: Application

Application 46789

Title:	How large is large enough? Is small datasets still valuable under the shadow of huge biobank data?
Lead Institution:	Academia Sinica
Principal investigator:	Professor Cathy Fann

WARNING: the interactive features of this website use CSS3, which your browser does not support. To use the full features of this website, please update your browser.

About

Our research project would like to address two issues. 1) Despite ethnical related disease susceptible markers, given large biobank datasets, will smaller biobank datasets sill valuable? 2) Huge datasets provide great opportunities in mapping disease associated markers, however, how large is large enough? In this project, we will use machine learning techniques to "oversample" small biobank data and compare the results with that obtained from UK biobank data through different phenotypes (diseases). The "oversampling" techniques will be carried out by using machine learning methods such as SMOTE (Synthetic Minority Over-Sampling Technique), GUN (Generative Unadversarial Networks) and GAN (Generative Adversarial Networks). The purpose of oversampling is to improve statistical power which is usually lower for small datasets. In order to have a throughout understanding, we plan to screen out common phenotypes for the two biobank datasets (Taiwan and UK biobank) which involve about 24 phenotypes, such as hypertension, asthma, cancers, etc. The reason for examining different phenotypes is because the genetic contributions for these diseases are different and therefore the association results might be affected given fixed sample size. For diseases with higher genetic contribution, modest dataset might be sufficient to identify susceptible markers. Most traditional statistical models are built on a few assumptions. For example, the ratio between case and control numbers should not be too far away from one, however, for large biobank data, the ratio could be 0.01 or even lower which might distort the results. By using oversampling techniques, we are able to observe the performance of the statistics under various ratios and therefore identify more appropriate ratios for association tests. Overall speaking, by using UK and Taiwan biobank datasets, the goals of our study are to identify important parameters such as case-control ratios, different disease prevalence, effect size (difference between allele frequency between cases and controls) etc, and their impacts to the association test results by using machine learning techniques.

2 Publications

Pub ID	Title	Author(s)	Year	Journal
11062	Effects of insomnia and non-vasomotor menopausal symptoms on coronary heart disease risk: a mendelian randomization study	Ie-Bin Lian (+4)	2023	Heliyon
9996	Unsupervised clustering identified clinically relevant metabolic syndrome endotypes in UK and Taiwan Biobanks	Aylwin Ming Wee Lim (+3)	2024	iScience

Enabling scientific discoveries that improve human health