: Publication 16530

Publication 16530

Title:	Mastering rare event analysis: subsample-size determination in Cox and logistic regressions
Journal:	Biometrics
Published:	3 Jul 2025
Pubmed:	https://pubmed.ncbi.nlm.nih.gov/40856106/
DOI:	https://doi.org/10.1093/biomtc/ujaf110
URL:	https://academic.oup.com/biometrics/article-pdf/81/3/ujaf110/64136131/ujaf110.pdf

WARNING: the interactive features of this website use CSS3, which your browser does not support. To use the full features of this website, please update your browser.

Abstract

In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>

9 Keywords

Colorectal Neoplasms
Computer Simulation
Data Interpretation, Statistical
Humans
Logistic Models
Proportional Hazards Models
Sample Size
Survival Analysis
United Kingdom

3 Authors

Tal Agassi
Nir Keret
Malka Gorfine

1 Application

Application ID	Title
56885	Advanced Risk Prediction and Heritability Estimation Methods

Enabling scientific discoveries that improve human health