Abstract
The UK Biobank study contains several sources of diagnostic data, including hospital inpatient data and self-reported conditions for ~500,000 participants, and primary care data for ~177,000 participants (35%). Epidemiological investigations require a primary disease definition, but whether to combine sources to maximize power or focus on one to ensure a consistent outcome is not clear. The consistency of definitions was investigated for venous thromboembolism (VTE) by looking at overlap when defining cases from hospital in-patient data, primary care reports, and self-reported questionnaires. VTE cases showed little overlap between data sources, with only 6% of reported events for those with primary care data identified by all three of hospital, primary care, and self-report, while 71% appeared only in one source. Deep vein thrombosis only events represented 68% of self-reported and 36% of hospital-reported VTE cases, while pulmonary embolism only events represented 20% of self-reported and 50% of hospital-reported VTE cases. Additionally, different distributions of sociodemographic characteristics were observed; for example, 46% of hospital reported VTE cases were female, compared with 58% of self-reported VTE cases. These results illustrate how seemingly neutral decisions taken to improve data quality can affect the representativeness of a dataset.</p>