Exome sequences - Genomics
The first tranche of UK Biobank whole exome sequencing (WES) was made available for 50,000 UK Biobank participants in March 2019, and the data for an additional 150,000 participants was made available in October 2020. Initially data was computed using two protocols, FE and SPB, however these were replaced by a new improved unified OQFE pipeline when the additional 150,000 sequences were released (with the earlier values being reprocessed to match this).
Fields 23141-23146 will eventually contain information on all the participants for whom exome sequencing is possible. It will only be possible to access the data for these fields in-situ via UK Biobank's Research Access Platform.
The first 50k release prioritized individuals with whole body MRI imaging data, enhanced baseline measurements, hospital episode statistics (HES), and/or linked primary care records. Additionally, one disease area was selected for enrichment: individuals with admission to hospital with a primary diagnosis of asthma (ICD10 J45 or J46). With the addition of the additional 150k samples, the 200k release includes 1,135 parent-offspring pairs, 3,855 full-sibling pairs, including 101 trios, 27 monozygotic twin pair and 7,461 second degree genetically determined relationships.
Exomes were captured with the IDT xGen Exome Research Panel v1.0 including supplemental probes. The basic design targets 39 Mbp of the human genome (19,396 genes). Multiplexed samples were sequenced with dual-indexed 75x75 bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 (initial 50k release) and S4 flow cells (all subsequent samples). A different IDT v1.0 oligo lot was used in the initial 50k sequencing than was used in the sequencing of all subsequent samples. Inclusion of this information as a covariate in downstream analyses is recommended. In each sample and among targeted bases, coverage exceeds 20X at 95.2% of sites on average. Complete sequencing protocols are described in detail by the summary manuscript (https://pubmed.ncbi.nlm.nih.gov/33087929/).
Primary and secondary analysis for the UKB 200k release was performed with an updated Functional Equivalence (FE) protocol that retains original quality scores in the CRAM files (referred to as the OQFE protocol, https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1). The OQFE protocol aligns and duplicate-marks all raw sequencing data (FASTQs) to the full GRCh38 reference in an alt-aware manner as described in the original FE manuscript (https://pubmed.ncbi.nlm.nih.gov/30279509/). The OQFE CRAMs were then called for small variants with DeepVariant to generate per-sample gVCFs. These gVCFs were aggregated and joint-genotyped with GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single multi-sample VCF (pVCF) for all UKB 200k samples. PLINK files were derived directly from this pVCF. Please note: to ensure that the UKB 200k data supports a broad range of analyses, no variant- or sample-level filters were pre-applied to the pVCF or PLINK files. The publicly released pVCF is the direct output of GLnexus, from which the PLINK files are generated. The pVCF contains allele-read depths and genotype qualities for all genotypes from which variant- and sample-level QC metrics can be calculated and to which analysis-specific filters can be applied. Examples of such filtering are described in the UKB 200K preprint (https://www.medrxiv.org/content/10.1101/2020.11.02.20222232v1).
Please note that the OQFE protocol differs from both previous UKB 50k releases, SPB and FE, which are described below for reference. All UKB 200k samples were processed from FASTQ with the OQFE docker (https://hub.docker.com/r/dnanexus/oqfe). Further details are provided in the WES FAQ at https://www.ukbiobank.ac.uk/media/cfulxh52/uk-biobank-exome-release-faq_v9-december-2020.pdf
In the original protocol, the SPB pipeline first converted all raw sequencing data to FASTQs according to Illumina NovaSeq best practices and aligned those reads to the GRCh38 reference genome with BWA-mem to generate a CRAM file for each sample. After read-duplicate marking, SNVs and indels were called for with WeCall (GenomicsPLC), generating a gVCF per sample. These gVCFs were joint genotyped using GLnexus (https://www.biorxiv.org/content/10.1101/572347v1) to create a single, unfiltered project-level VCF (pVCF). Genotype depth filters (SNV DP≥7, indel DP≥10) were applied prior to variant site filters requiring at least one variant genotype passing an allele balance filter (heterozygous SNV AB>0.15, heterozygous indel<0.20), resulting in a second 'filtered' pVCF.
To maximize data utility and ease of use, an additional "Functionally Equivalent" (FE) pVCF was generated from FASTQs, following the primary analysis protocol described in the 2018 manuscript (PMID: 30279509) and then subject to GATK 3.0 variant calling and hard filtering of variants with inbreeding coefficient<-0.03 or without at least one variant genotype of DP≥10, GQ≥20 and, if heterozygous, AB≥0.20.