Abstract
OBJECTIVES: To evaluate whether empirical calibration of P values using negative controls can effectively control type I and type II errors under unmeasured confounding bias in both simulated and real-world observational settings.</p>
STUDY DESIGN AND SETTING: A simulation study was conducted under five settings reflecting different degrees of adherence to the U-comparability assumption-that is, the extent to which negative controls share the same unmeasured confounding structure as the exposure of interest. These included three primary scenarios (ideal, realistic, and violation of U-comparability) and two mixed scenarios reflecting partial violations. We varied sample size, the direction and strength of unmeasured confounding bias, and the number of negative controls. Based on UK Biobank data, the method was also applied to evaluate the association between hypertension and peripheral artery disease (PAD) in individuals with type 2 diabetes mellitus.</p>
RESULTS: Standard logistic regression showed inflated type I error rates across almost all settings, peaking at 44.2% under realistic U-comparability with a sample size of 20,000. In contrast, empirical calibration generally controlled type I error close to the nominal 5% level and reduced bias by 80%-100% under both ideal and realistic U-comparability. Type I error control improved with more negative controls, while type II error control was influenced by whether the unmeasured confounding bias acted in the same or opposite direction as the true exposure-outcome effect. In the UK Biobank case study, 4 of 15 negative controls showed P < .05 after adjustment for measured confounders, indicating residual unmeasured confounding. After empirical calibration with 5, 10, or 15 negative controls, the association between hypertension and PAD remained statistically significant (calibrated P ≈ .004-.006).</p>
CONCLUSION: Empirical calibration of P values can mitigate residual unmeasured confounding and reduce type I error inflation in observational studies. Its performance depends on the validity and number of negative controls.</p>
PLAIN LANGUAGE SUMMARY: When researchers use large health databases to study whether a treatment or risk factor causes a disease, results can sometimes be misleading. This can happen because of unmeasured confounding-hidden factors that influence both the exposure and the outcome-leading to "false alarms," or false positive findings. We evaluated a statistical method called empirical calibration of P values, designed to correct this bias. The method uses negative controls (exposures known not to have a causal effect on the outcome) to estimate the amount of bias in the data and then adjust the P value for the exposure of interest. Our simulation study showed that this approach effectively reduced the false alarm rate to the expected 5% level, but only under certain conditions. Its success depended on selecting appropriate negative controls that shared the same bias structure as the exposure of interest and on using a sufficient number of controls to ensure stable results. The method failed when the negative controls were poorly chosen or unrelated. When applied to real-world data from the UK Biobank, it successfully corrected for unmeasured bias while still confirming the true, significant link between high blood pressure and PAD. These findings suggest that empirical calibration of P values can make observational research more reliable, provided that enough well-chosen negative controls are available.</p>