: Publication 5699

Publication 5699

Title:	Scalable probabilistic PCA for large-scale genetic variation data
Journal:	PLOS Genetics
Published:	29 May 2020
Pubmed:	https://pubmed.ncbi.nlm.nih.gov/32469896/
DOI:	https://doi.org/10.1371/journal.pgen.1008773
URL:	https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1008773&type=printable
Citations:	37 (17 in last 2 years) as of 8 Aug 2024

WARNING: the interactive features of this website use CSS3, which your browser does not support. To use the full features of this website, please update your browser.

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

14 Keywords

Adaptor Proteins, Signal Transducing
Algorithms
Biological Specimen Banks
Computational Biology
Genetics, Population
Genome-Wide Association Study
Humans
Models, Genetic
Mutation, Missense
Polymorphism, Single Nucleotide
Principal Component Analysis
Toll-Like Receptor 4
United Kingdom
White People

5 Authors

Aman Agrawal
Alec M. Chiu
Minh Le
Eran Halperin
Sriram Sankararaman

1 Application

Application ID	Title
33127	Methods for large-scale medical and population genetic data

Enabling scientific discoveries that improve human health