Abstract
Background: This study aimed to evaluate whether integrating clinical and genomic data improves the performance of machine learning (ML) models for predicting Type 2 Diabetes (T2D) risk.</p>
Methods: Six models-Random Forest, Support Vector Machine, Linear Discriminant Analysis, Logistic Regression, Gradient Boosting Machine, and Decision Tree-were trained and tested on a discovery dataset (N=3,546) and validated in the UK Biobank (N=31,620). Model performance was assessed using clinical data alone, combined clinical and genomic data, and in age-specific groups (>55 and ≤55 years).</p>
Results: The inclusion of genomic data modestly improved model performance across all algorithms in the discovery dataset. Clinical features such as family history of T2D and hypertension consistently ranked as top features. When SNPs were added, T2D-associated variants, including rs2943641 (IRS1), rs7903146 (TCF7L2), and rs7756992 (CDKAL1), emerged among the most important features, particularly in younger individuals. These findings demonstrate the translational potential of incorporating genomics for early risk identification. In the UK Biobank, all models achieved AUCs exceeding 91 % with combined clinical and genomic data. Performance was notably better among younger individuals (≤55 years), emphasizing the models' potential for early detection. Integration of a polygenic risk score (PRS) further supported risk prediction, particularly in younger individuals, though incremental gains were modest.</p>
Conclusions: While traditional clinical factors remained the strongest predictors of T2D risk, integration of genomic data produced a modest improvement in model performance, especially among younger adults. Validation across independent datasets confirmed the generalizability of these findings, underscoring the value of multi-dimensional risk-prediction models to refine T2D risk assessment.</p>