To use the UK Biobank secure online repository services a researcher must
This webpage details the means by which large-scale genetic data held by UK Biobank can be accessed and manipulated once access has been approved.
Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.
The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:
Anonymised data includes such information as Calls and Imputed values related to samples. When a participant withdraws consent a small fraction of these datasets is invalidated and should not be used for analysis, however this subset cannot be identified from the information within the file.
Link data includes such information as Fam files which allow the pseudo-IDs assigned to individual participants to be linked to particular subsets (e.g. a column) of the sample results files. When a participant withdraws their consent from UK Biobank their pseudo-ID information is removed from these datasets at the earliest opportunity.
A summary of the file types and groups is given in the table below:
Data type | Group | Filename(s) | How to obtain |
---|---|---|---|
Calls BED | Anon | ukb_cal_chrN_vZ.bed | ukbgene cal |
Calls BIM | Anon | ukb_snp_chrN_vZ.bim | Resource 1963, ukb_snp_bim.tar |
Calls FAM | Link | ukbA_cal_chrN_vZ_sP.fam | ukbgene cal -m |
Marker-QC | Static | ukb_snp_qc.txt | Resource 1955, ukb_snp_qc.txt |
Sample-QC | Anon | ukb_sqc_vZ.txt | standard fields in Category 100313 |
Relatedness | Link | ukbA_rel_sP.txt | ukbgene rel |
Imputation BGEN | Anon | ukb_imp_chrN_vZ.bgen | ukbgene imp |
Imputation BGI | Anon | ukb_bgi_chrN_vZ.bgi | Resource 1965, ukb_imp_bgi.tar |
Imputation MAF+info | Anon | ukb_mfi_chrN_vZ.txt | Resource 1967, ukb_imp_mfi.tar |
Imputation sample | Link | ukbA_imp_chrN_vZ_sP.sample | ukbgene imp -m |
Haplotypes BGEN | Anon | ukb_hap_chrN_vZ.bgen | ukbgene hap |
Haplotypes BGI | Anon | ukb_hbg_chrN_vZ.bgi | Resource 1671, ukb_hap_bgi.tar |
HLA Imputation | Anon | ukb_hla_vZ.txt | Field 22182 |
Intensity | Anon | ukb_int_chrN_vZ.bin | ukbgene int |
Confidences | Anon | ukb_con_chrN_vZ.txt | ukbgene con |
CNV log2r | Anon | ukb_l2r_chrN_vZ.txt | ukbgene l2r |
CNV baf | Anon | ukb_baf_chrN_vZ.txt | ukbgene baf |
SNP-posterior | Static | ukb_snp_posterior_chrN.bin | Resource 1817, ukb_snp_posterior.tar |
SNP-posterior X BIM | Static | ukb_snp_posterior_chrX_haploid.bim | Resource 1817, ukb_snp_posterior.tar |
Batch | Static | ukb_snp_posterior.batch | Resource 1968, ukb_snp_posterior.batch |
ukbgene typename -cchrom [flags]where typename is the type of data being retrieved, selected from the list:
typename | type of data to be retrieved | format | link format |
---|---|---|---|
cal | genotype calls | bed | fam |
con | genotype confidences | txt | fam |
int | genotype intensities | bin | fam |
baf | genotype CNV b-allele frequencies | txt | fam |
l2r | genotype CNV log2ratios | txt | fam |
imp | imputation | bgen | sample |
hap | haplotypes | bgen | sample |
and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from the UKB Showcase Resource areas.
A full list of the available flags can be obtained by running ukbgene without any parameters. Particularly important are:
ukbgene cal -c17which will produce a bed-format file.
To fetch the Link file associated with that Anonymous dataset add the -m parameter to the command line, hence
ukbgene cal -c17 -mwill fetch the corresponding fam-format file.
ukbgene relThis will produce a 5 column file giving a pairwise listing of related individual pseudo-IDs accompanied by the values:
When a participant withdraws consent, UK Biobank immediately flags this in the central databases. The Link files are dynamically generated for each Researcher at the time of download and respond to this change immediately, substituting negative dummy-IDs for the pseudo-ID of any withdrawn participants. Using these new Link files for analysis work instantly removes any connection to withdrawn elements in the Anonymised data.
At manageable intervals UK Biobank will regenerate the Anonymised files to purge any accumulating unusable entries - at which point Researchers will also need new Link files due to a change in the number of rows/columns in the Anonymised files. Notices will be sent to all Researchers registered for genetic data in advance of such a purge being made.
Note that there will generally be a lower number of participants present in the imputation-derived Anonymous and Link files compared to the genotype-related ones.
ukb99_cal_chr1_v2_s6.fam | ukb_con_chr1_v2.txt |
---|---|
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 3679861 3679861 0 0 2 Batch_b032 7397822 3679861 0 0 2 Batch_b024 | 0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036 |
If the participant with pseudo-ID 3679861 withdraws then the Fam file contents and name ("s6" becoming "s5") would change to:
ukb99_cal_chr1_v2_s5.fam | ukb_con_chr1_v2.txt |
---|---|
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -1 -1 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024 | 0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036 |
If the participant with pseudo-ID 2874520 also withdraws then the Fam file would change to:
ukb99_cal_chr1_v2_s4.fam | ukb_con_chr1_v2.txt |
---|---|
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -2 -2 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024 | 0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036 |
If the Anonymised confidences file is then purged and regenerated (moving from Version 2 to 3), both the files and their names would alter to become:
ukb99_cal_chr1_v3_s4.fam | ukb_con_chr1_v3.txt |
---|---|
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024 | 0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036 |
If the participant with pseudo-ID 8029816 subsequently withdraws then the Fam file would change to:
ukb99_cal_chr1_v3_s3.fam | ukb_con_chr1_v3.txt |
---|---|
3298462 3298462 0 0 2 Batch_b001 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024 | 0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036 |
Common Link files
Many of the Anonymous files share the same Link files. It is possible to
download a separate Link file for every Anonymous file however this
process would have to be repeated whenever a participant withdrew and
there are a various ways of making this process more efficient starting
with downloading only a pair of link files (for instance
ukb_cal_chr1_vZ_sP.fam and ukb_imp_chr1_vZsP.sample) thus:
ln -s ukb_cal_chr1_vZ_sP.fam ukb_cal_chr2_vZ_sP.famsets up an alias whereby the Chromosome 1 file 'looks like' the Chromosome 2 file.
Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed
that, with permission, Researchers from multiple approved Applications
within a unit may share a common copy of them. However this can create
problems as some analysis programs assume specific names and/or
locations for their input files and it is likely that there will be more
than one set of Link files in use simultaneously. Possible remedies include: