Accessing Genetic Data within UK Biobank

The genetic datasets held by UK Biobank are too large to be distributed as part of a standard phenotype dataset. Instead, the gfetch client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems from secure online repositories outside the main UK Biobank showcase system. This guide explains how to use gfetch.

To use the UK Biobank secure online repository services a researcher must

be a validated UK Biobank researcher;
be part of an Approved Application;
have been issued a standard dataset together with the associated password credentials;
have included the desired genetic fields in an approved Basket.

This webpage details the means by which large-scale genetic data held by UK Biobank can be accessed and manipulated once access has been approved.

Preparation
Notices
Data Groups
Authentication
Fetching Data
Fetching relatedness
File versioning
Standard usages

1. Preparation

Following approval of a research application, researchers will be sent a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the gfetch utility from the Downloads section of the Showcase website.

Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.

The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:

biota.ndph.ox.ac.uk
chest.ndph.ox.ac.uk

To access genetic data from a remote computer the system that the download utility is running on must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems. If this is not possible then researchers should contact their local IT team to resolve the issue.

2. Notices

Before downloading any data, Researchers are reminded that:

All access attempts, whether successful or denied, are logged and monitored with the IP address recorded.
Any data exported outside of the UK Biobank systems must be protected by strong (e.g. AES256) encryption when not actively in use.
The volume of data available in the repository is subject to gradual change and may not match the list supplied when an application is processed. These changes are due to participant withdrawals (which require the removal of data) and the incremental addition of new data for continuing participants.
It is possible to run multiple downloads in parallel, however to provide fair usage the system will not permit a single Application to run more than 10 simultaneously and additional attempts will be rejected.
To ensure continuity of service, the download servers will reject requests when there are more than 500 simultaneous downloads in progress.

3. Data Groups

The genetic data in UK Biobank can be grouped into 3 types according to how malleable and aggregated it is:

Static meta-data and aggregated results;
Anonymised results from sample analysis;
Link files mapping pseudo-ids to other participant information.

Static data includes such datasets as Marker QC (quality control) information. This data is not affected by the withdrawal of individual participants. Similarly, aggregated research output such as GWAS results are not affected by the withdrawal of participants after the computations have been made.

Anonymised data includes such information as Calls and Imputed values related to samples. When a participant withdraws consent a small fraction of these datasets is invalidated and should not be used for analysis, however this subset cannot be identified from the information within the file.

Link data includes such information as Fam files which allow the pseudo-IDs assigned to individual participants to be linked to particular subsets (e.g. a column) of the sample results files. When a participant withdraws their consent from UK Biobank their pseudo-ID information is removed from these datasets at the earliest opportunity.

4. Authentication

To access the repository it is necessary to prove ones identity to the system using a keyfile. See Resource 667 for detailed information on this.

5. Fetching data

Most genetic data has been divided into per-chromosome datasets for convience of downloading and use. Some is stored as paired files with one containing anonymised data and the other the Application-specific identifiers (for instance genotype calls), while in others the actual data is customised dynamically. The data for some chromosomes/formats is sufficiently large that it has been broken up into a number of separate files for a chromosome and these are indexed by a 'block' counter, which begins at 0 for each chromosome.

To download the results files using gfetch, enter the following at the command line:

 gfetch  field_id -cchrom [flags]

where field_id is the ID of the field as given in the Showcase and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from the UKB Showcase Resource areas.

A full list of the available flags can be obtained by running gfetch without any parameters. Particularly important are:

-m will produce the Link file associated with the Anonymised dataset if such exists for the type of data;
-b will fetch a particular block from a multi-block dataset (block 0 is fetched if not specified);
-v will produce extra diagnostic information in case of problems.

Downloaded files will have names which reflect the parameters used to acquire them (and thus their contents), generally

 ukbF_cC_bB_vV.typ

where

F - field id
C - chromosome
B - block index
V - version of data

with the names of Link files also containing a final "_s" indicating the number of participants listed in them.

5.1 Examples

To fetch the Anonymous genotype calls (Field 22418) for Chromosome 17 enter

 gfetch 22418 -c17

which will produce a bed-format file. To fetch the Link file associated with that Anonymous dataset add the -m parameter to the command line, hence

 gfetch 22418 -c17 -m

will fetch the corresponding fam-format file.

To fetch the 3rd block of the OQFE pVCF (field 23156) for Chromsome 9 exomes enter

 gfetch 23156 -c9 -b2

If data has been divided into blocks then a Resource attached to the relevant field (837 for field 23156) will indicate how many blocks there are for each chromosome.

5.2 Duplication

Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically:

The fam file is identical for all chromosomes with genotype data formats.
The sample file is identical for chromosomes 1-22 in the WTCGH imputed data.
The fam file is identical for all chromosomes with exome data.

See Standard Usages for help on working with this.

6. Fetching relatedness

The genotype information allows one to infer which/how different participants within UK Biobank are related. To retrieve this information run gfetch with the "rel" parameter thus:

 gfetch rel

This will produce a 5 column file giving a pairwise listing of related individual pseudo-IDs accompanied by the values:

HetHet : the fraction of markers for which the pair both have a heterozygous genotype;
IBS0 : the fraction of markers for which the pair shares zero alleles;
Kinship : estimate of the kinship coefficient for pair based on the set of markers used in the kinship inference.

In any pair where one or more of the participants has withdrawn, both pseudo-IDs are replaced by negative numbers.

7. File versioning

UK Biobank is a large study involving over 500,000 members of the general UK population. As a result of its size and composition it regularly encounters issues which are rare or absent in more tightly focussed studies involving only a few hundreds or thousands of participants. In particular, in most years a small number of participants decide to completely withdraw their consent to being in the study which means that UK Biobank and all Researchers using their data have a legal duty (as detailed in the MTA signed when an Application is approved) to desist from doing any analysis work on individual-level data concerning them.

When a participant withdraws consent, UK Biobank immediately flags this in the central databases. The Link files are dynamically generated for each Researcher at the time of download and respond to this change immediately, substituting negative dummy-IDs for the pseudo-ID of any withdrawn participants. Using these new Link files for analysis work instantly removes any connection to withdrawn elements in the Anonymised data.

At manageable intervals UK Biobank will regenerate the Anonymised files to purge any accumulating unusable entries - at which point Researchers will also need new Link files due to a change in the number of rows/columns in the Anonymised files. Notices will be sent to all Researchers registered for genetic data in advance of such a purge being made.

Note that there will generally be different numbers of participants present in the various types of genetic files due to different samples being used to generate them and subsequent processing and quality control.

7.1 File versioning illustration

To illustrate how the file versioning is performed consider an initial Fam (Link) file which works with a Confidence (Anonymised) file, choosing the latter for clarity because of the plain-text format. To further simply imagine the Version 2 dataset contained only 6 participants and 3 SNPs on Chromosome 1, in which case the initial release (for Application 99) might be:

ukb22419_c1_b0_v2_s6.fam	ukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 3679861 3679861 0 0 2 Batch_b032 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb22419_c1_b0_v2_s6.fam

ukb22419_c1_b0_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
3679861 3679861 0 0 2 Batch_b032
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 3679861 withdraws then the Fam file contents and name ("s6" becoming "s5") would change to:

ukb22419_c1_b0_v2_s5.fam	ukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 2874520 2874520 0 0 1 UKBiLEVEAX_b11 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -1 -1 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb22419_c1_b0_v2_s5.fam

ukb22419_c1_b0_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-1 -1 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the participant with pseudo-ID 2874520 also withdraws then the Fam file would change to:

ukb22419_c1_b0_v2_s4.fam	ukb22419_c1_b0_v2.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 -2 -2 0 0 0 redacted 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0021 0.0022 0.0023 0.0024 0.0025 0.0026 0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

ukb22419_c1_b0_v2_s4.fam

ukb22419_c1_b0_v2.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-2 -2 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036

If the Anonymised confidences file is then purged and regenerated (moving from Version 2 to 3), both the files and their names would alter to become:

ukb22419_c1_b0_v3_s4.fam	ukb22419_c1_b0_v3.txt
3298462 3298462 0 0 2 Batch_b001 8029816 8029816 0 0 1 Batch_b007 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036

ukb22419_c1_b0_v3_s4.fam

ukb22419_c1_b0_v3.txt

3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

If the participant with pseudo-ID 8029816 subsequently withdraws then the Fam file would change to:

ukb22419_c1_b0_v3_s3.fam	ukb22419_c1_b0_v3.txt
3298462 3298462 0 0 2 Batch_b001 -1 -1 0 0 0 redacted 9023752 9023752 0 0 1 UKBiLEVEAX_b11 7397822 3679861 0 0 2 Batch_b024	0.0011 0.0012 0.0014 0.0016 0.0021 0.0022 0.0024 0.0026 0.0031 0.0032 0.0034 0.0036

ukb22419_c1_b0_v3_s3.fam

ukb22419_c1_b0_v3.txt

3298462 3298462 0 0 2 Batch_b001
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024

0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036

8. Standard usages

This section details some commonly encountered conundrums with analysis pipelines and suggests workarounds for them.

Common Link files
Many of the Anonymous files share the same Link files. It is possible to download a separate Link file for every Anonymous file however this process would have to be repeated whenever a participant withdrew and there are a various ways of making this process more efficient starting with downloading only a small number of link files (for instance ukb22418_c1_b0_vZ_sP.fam for the genotype data) thus:

Create multiple symlinks to act as the other Link files. For instance
```
ln   -s   ukb22418_c1_b0_vZ_sP.fam   ukb22418_c2_b0_vZ_sP.fam
```
sets up an alias whereby the Chromosome 1 file 'looks like' the Chromosome 2 file.
Setup a script to physically copy the initial file into any other names required.
Some analysis programs have optional parameters allowing the Anonymous and Link files to be specified separately.

Using any of these methods only the original two Link files need to be re-downloaded when a participant withdraws.

Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed that, with permission, Researchers from multiple approved Applications within a unit may share a common copy of them. However this can create problems as some analysis programs assume specific names and/or locations for their input files and it is likely that there will be more than one set of Link files in use simultaneously. Possible remedies include:

Use symlinks to create multiple virtual copies of the Anonymous files in the same apparent location as each set of Link files.
Some analysis programs have optional parameters which allow the names and locations of their multiple input files to be specified independently.

Generally however the use of shared datasets is both discouraged and will be deprecated as UKB develops its own online access platforms.

END