Accessing Genetic Data within UK Biobank
The genetic datasets held by UK Biobank are too large to be distributed as part of a standard phenotype dataset.
Instead, the gfetch client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems
from secure online repositories outside the main
UK Biobank showcase system. This guide explains how to use gfetch.
To use the UK Biobank secure online repository services a researcher must
- be a validated UK Biobank researcher;
- be part of an Approved Application;
- have been issued a standard dataset together with the associated password credentials;
- have included the desired genetic fields in an approved Basket.
This webpage details the means by which large-scale genetic data held by UK
Biobank can be accessed and manipulated once access has been approved.
- Preparation
- Notices
- Data Groups
- Authentication
- Fetching Data
- Fetching relatedness
- File versioning
- Standard usages
1. Preparation
Following approval of a research application, researchers will be sent
a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the gfetch utility from the
Downloads
section of the Showcase website.
Some of the UKB utilities are supplied pre-compiled for both MS-Windows and
Linux systems. The MS-Windows utilities have the suffix
.exe however the
explanations given in this guide
omit this for generality. All the utility programs are command-line, so Windows
versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.
The repository consists of a pair of mirrored systems each connected to the UK JANET network
by independent links. The system names are:
- biota.ndph.ox.ac.uk
- chest.ndph.ox.ac.uk
To access genetic data from a remote computer the system that the download utility is running on
must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems.
If this is not possible then researchers should contact their local IT team to resolve the issue.
2. Notices
Before downloading any data, Researchers are reminded that:
- All access attempts, whether successful or denied, are logged and
monitored with the IP address recorded.
- Any data exported outside of the UK Biobank systems must be protected
by strong (e.g. AES256) encryption when not actively in use.
-
The volume of data available
in the repository is subject to gradual change and may not match
the list supplied when an application is processed. These changes are
due to participant withdrawals (which require the removal of data) and
the incremental addition of new data for continuing participants.
-
It is possible to run multiple downloads in parallel, however to
provide fair usage the system will not permit a single Application to run
more than 10 simultaneously and additional attempts will be rejected.
- To ensure continuity of service, the download servers will reject requests
when there are more than 500 simultaneous downloads in progress.
3. Data Groups
The genetic data in UK Biobank can be grouped into 3 types according to
how malleable and aggregated it is:
- Static meta-data and aggregated results;
- Anonymised results from sample analysis;
- Link files mapping pseudo-ids to other participant information.
Static data includes such datasets as Marker QC (quality control)
information. This data is not affected by the withdrawal of individual
participants. Similarly, aggregated research output such as GWAS results
are not affected by the withdrawal of participants after the
computations have been made.
Anonymised data includes such information as Calls and Imputed values
related to samples. When a participant withdraws consent a small
fraction of these datasets is invalidated and should not be used for
analysis, however this subset cannot be identified from the information
within the file.
Link data includes such information as Fam files which allow the
pseudo-IDs assigned to individual participants to be linked to
particular subsets (e.g. a column) of the sample results files. When a
participant withdraws their consent from UK Biobank their pseudo-ID
information is removed from these datasets at the earliest opportunity.
4. Authentication
To access the repository it is necessary to prove ones identity to the system using a keyfile.
See Resource
667 for detailed information on this.
5. Fetching data
Most genetic data has been divided into per-chromosome datasets for convience of downloading and use. Some is stored as paired
files with one containing anonymised data and the other the Application-specific identifiers (for instance genotype calls), while in
others the actual data is customised dynamically. The data for some chromosomes/formats is sufficiently large that it has been broken
up into a number of separate files for a chromosome and these are indexed by a 'block' counter, which begins at 0 for each chromosome.
To download the results files using gfetch, enter the following at the command line:
gfetch field_id -cchrom [flags]
where field_id is the ID of the field as given in the Showcase and chrom is the chromosome 1,2,...,22,X,Y,XY or MT. Additional
person/sample-independent elements of a dataset (e.g. index files or QC) are not regarded as confidential and may be download directly from
the UKB Showcase Resource areas.
A full list of the available flags can be obtained by running gfetch
without any parameters. Particularly important are:
- -m will produce the Link file associated with the Anonymised dataset if such exists for the type of data;
- -b will fetch a particular block from a multi-block dataset (block 0 is fetched if not specified);
- -v will produce extra diagnostic information in case of problems.
Downloaded files will have names which reflect the parameters used to acquire them (and thus their contents), generally
ukbF_cC_bB_vV.typ
where
- F - field id
- C - chromosome
- B - block index
- V - version of data
with the names of Link files also containing a final "_s" indicating the number of participants listed in them.
5.1 Examples
To fetch the Anonymous genotype calls (Field 22418) for Chromosome 17 enter
gfetch 22418 -c17
which will produce a bed-format file.
To fetch the Link file associated with that Anonymous dataset add the
-m parameter to the command line, hence
gfetch 22418 -c17 -m
will fetch the corresponding fam-format file.
To fetch the 3rd block of the OQFE pVCF (field 23156) for Chromsome 9 exomes enter
gfetch 23156 -c9 -b2
If data has been divided into blocks then a Resource attached to the relevant field
(837 for field 23156) will indicate how many blocks there are for each chromosome.
5.2 Duplication
Note that many of the Link files have the same contents for different
anonymous files and hence only a single instance needs to be downloaded.
Specifically:
- The fam file is identical for all chromosomes with genotype data formats.
- The sample file is identical for chromosomes 1-22 in the WTCGH imputed data.
- The fam file is identical for all chromosomes with exome data.
See Standard Usages for help on working
with this.
6. Fetching relatedness
The genotype information allows one to infer which/how different participants within UK Biobank are related. To
retrieve this information run gfetch with the "rel" parameter thus:
gfetch rel
This will produce a 5 column file giving a pairwise listing of related
individual pseudo-IDs accompanied by the values:
- HetHet : the fraction of markers for which the pair both have a heterozygous genotype;
- IBS0 : the fraction of markers for which the pair shares zero alleles;
- Kinship : estimate of the kinship coefficient for pair based on the set of markers used in the kinship inference.
In any pair where one or more of the participants has withdrawn, both
pseudo-IDs are replaced by negative numbers.
7. File versioning
UK Biobank is a large study involving over 500,000 members of the
general UK population. As a result of its size and composition it
regularly encounters issues which are rare or absent in more tightly
focussed studies involving only a few hundreds or thousands of
participants. In particular, in most years a small number of participants decide
to completely withdraw their consent to being in the study which means
that UK Biobank and all Researchers using their data have a legal duty (as
detailed in the MTA signed when an Application is approved) to desist
from doing any analysis work on individual-level data concerning them.
When a participant withdraws consent, UK Biobank immediately flags this
in the central databases. The Link files are dynamically generated for
each Researcher at the time of download and respond to this change
immediately, substituting negative dummy-IDs for the pseudo-ID of any
withdrawn participants. Using these new Link files for analysis work
instantly removes any connection to withdrawn elements in the Anonymised
data.
At manageable intervals UK Biobank will regenerate the Anonymised files
to purge any accumulating unusable entries - at which point Researchers
will also need new Link files due to a change in the number of
rows/columns in the Anonymised files. Notices will be sent to all
Researchers registered for genetic data in advance of such a purge being
made.
Note that there will generally be different numbers of participants
present in the various types of genetic files due to different samples being
used to generate them and subsequent processing and quality control.
7.1 File versioning illustration
To illustrate how the file versioning is performed consider an initial
Fam (Link) file which works with a Confidence (Anonymised) file,
choosing the latter for clarity because of the plain-text format. To
further simply imagine the Version 2 dataset contained only 6
participants and 3 SNPs on Chromosome 1, in which case the initial
release (for Application 99) might be:
ukb22419_c1_b0_v2_s6.fam | ukb22419_c1_b0_v2.txt |
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
3679861 3679861 0 0 2 Batch_b032
7397822 3679861 0 0 2 Batch_b024
|
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036
|
If the participant with pseudo-ID 3679861 withdraws then the Fam file
contents and name ("s6" becoming "s5") would change to:
ukb22419_c1_b0_v2_s5.fam | ukb22419_c1_b0_v2.txt |
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
2874520 2874520 0 0 1 UKBiLEVEAX_b11
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-1 -1 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
|
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036
|
If the participant with pseudo-ID 2874520 also withdraws then the Fam
file would change to:
ukb22419_c1_b0_v2_s4.fam | ukb22419_c1_b0_v2.txt |
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
-2 -2 0 0 0 redacted
7397822 3679861 0 0 2 Batch_b024
|
0.0011 0.0012 0.0013 0.0014 0.0015 0.0016
0.0021 0.0022 0.0023 0.0024 0.0025 0.0026
0.0031 0.0032 0.0033 0.0034 0.0035 0.0036
|
If the Anonymised confidences file is then purged and regenerated
(moving from Version 2 to 3), both the files and their names would alter
to become:
ukb22419_c1_b0_v3_s4.fam | ukb22419_c1_b0_v3.txt |
3298462 3298462 0 0 2 Batch_b001
8029816 8029816 0 0 1 Batch_b007
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
|
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036
|
If the participant with pseudo-ID 8029816 subsequently withdraws then
the Fam file would change to:
ukb22419_c1_b0_v3_s3.fam | ukb22419_c1_b0_v3.txt |
3298462 3298462 0 0 2 Batch_b001
-1 -1 0 0 0 redacted
9023752 9023752 0 0 1 UKBiLEVEAX_b11
7397822 3679861 0 0 2 Batch_b024
|
0.0011 0.0012 0.0014 0.0016
0.0021 0.0022 0.0024 0.0026
0.0031 0.0032 0.0034 0.0036
|
8. Standard usages
This section details some commonly encountered conundrums with analysis
pipelines and suggests workarounds for them.
Common Link files
Many of the Anonymous files share the same Link files. It is possible to
download a separate Link file for every Anonymous file however this
process would have to be repeated whenever a participant withdrew and
there are a various ways of making this process more efficient starting
with downloading only a small number of link files (for instance
ukb22418_c1_b0_vZ_sP.fam for the genotype data) thus:
Using any of these methods only the original two Link files need to be
re-downloaded when a participant withdraws.
Shared datasets
Because of the large size of the Anonymous files UK Biobank has agreed
that, with permission, Researchers from multiple approved Applications
within a unit may share a common copy of them. However this can create
problems as some analysis programs assume specific names and/or
locations for their input files and it is likely that there will be more
than one set of Link files in use simultaneously. Possible remedies include:
- Use symlinks to create multiple virtual copies of the Anonymous
files in the same apparent location as each set of Link files.
- Some analysis programs have optional parameters which allow the
names and locations of their multiple input files to be specified
independently.
Generally however the use of shared datasets is both discouraged and will
be deprecated as UKB develops its own online access platforms.
END