Abstract
The UK Biobank study contains several sources of diagnostic data, including hospital inpatient data and data on self-reported conditions for approximately 500,000 participants and primary-care data for approximately 177,000 participants (35%). Epidemiologic investigations require a primary disease definition, but whether to combine data sources to maximize statistical power or focus on only 1 source to ensure a consistent outcome is not clear. The consistency of disease definitions was investigated for venous thromboembolism (VTE) by evaluating overlap when defining cases from 3 sources: hospital inpatient data, primary-care reports, and self-reported questionnaires. VTE cases showed little overlap between data sources, with only 6% of reported events for persons with primary-care data being identified by all 3 sources (hospital, primary-care, and self-reports), while 71% appeared in only 1 source. Deep vein thrombosis-only events represented 68% of self-reported VTE cases and 36% of hospital-reported VTE cases, while pulmonary embolism-only events represented 20% of self-reported VTE cases and 50% of hospital-reported VTE cases. Additionally, different distributions of sociodemographic characteristics were observed; for example, patients in 46% of hospital-reported VTE cases were female, compared with 58% of self-reported VTE cases. These results illustrate how seemingly neutral decisions taken to improve data quality can affect the representativeness of a data set.</p>