Accessing Bulk Data within UK Biobank

Some of the data held by UK Biobank is too large or complexly formatted to be distributed as part of a standard phenotype dataset. Instead, the ukbfetch client has been developed to allow Approved researchers to download elements of it piecemeal to their local systems from secure online repositories outside the main UK Biobank showcase system. This guide explains how to use ukbfetch.

To use the UK Biobank secure online repository services a researcher must

  1. be a validated UK Biobank researcher;
  2. be part of an Approved Application;
  3. have been issued a standard dataset together with the associated password credentials;
  4. have included the desired bulk fields in an approved Basket.

This webpage details the means by which bulk data held by UK Biobank can be accessed and manipulated once access has been approved.

  1. Preparation
  2. Notices
  3. Authentication
  4. Fetching data
  5. Automation

1. Preparation

Following approval of a research application, researchers will be sent a 32-character MD5 Checksum and a 64-character password. The next step is to acquire the ukbfetch utility from the Downloads section of the Showcase website.

Some of the UKB utilities are supplied pre-compiled for both MS-Windows and Linux systems. The MS-Windows utilities have the suffix .exe however the explanations given in this guide omit this for generality. All the utility programs are command-line, so Windows versions are best run from a Command Prompt window, and Linux versions are best run directly from a Terminal.

The repository consists of a pair of mirrored systems each connected to the UK JANET network by independent links. The system names are:

To access bulk data from a remote computer the system that the download utility is running on must be able to make http (Port 80) connections to at least one, and preferably both, of the repository systems. If this is not possible then researchers should contact their local IT team to resolve the issue.

2. Notices

Before downloading any data, Researchers are reminded that: Note also that while it is possible to run multiple downloads in parallel, to provide fair usage the system will not permit a single Application to run more than 10 simultaneously.

3. Authentication

To access the repository it is necessary to prove ones identity to the system using a keyfile. See Resource 667 for detailed information on this.

4. Fetching Data

The standard dataset (downloadable directly by researchers) contains a record of all the bulk data-files approved, however only the data-file IDs are present rather than the actual contents of the files themselves. These data-file IDs have the format "F_I_A" where F is the field ID, I is the instance index and A is the array index. Hence 8034_4_2 corresponds to Field 8034, Instance 4, Array 2.

To analyse a particular bulk data file, a copy must be retrieved from the repository using the ukbfetch utility. This program can be obtained from the download section of the UK showcase website.

The ukbconv utility, downloadable similarly, can be used to produce lists of all the Bulk data-files included in an application.

4.1 Using ukbfetch

The ukbfetch utility can be used with various flags to retrieve either single or multiple Bulk data-files. The format of the ukbfetch command is:
  ukbfetch  -eperson_id  -ddataset_name  -bbatch_file  [-aauthentication_keyfile] [-v]
where the flags are as follows:

-aSpecifies the authentication keyfile containing application ID and truncated password. This is an optional flag and is not required if the default authentication file name (.ukbkey) has been used.
-bSpecifies a batch file containing participant-ID and data-file ID pairs, for retrieving multiple data-files at once. Details on creating a batch file are given in Section 3 below.
-dSpecifies a single data-file ID to be retrieved.
-eSpecifies a single paticipant ID to be retrieved.
-hShows a basic help message.
-mSpecifies that only the first N data-files listin a batch-file should be retrieved (entered as -mN, e.g. -m20).
-oSpecifies an alternate name for the output logfile.
-sSpecifies line N as the starting point for retrieving data-files listed in a batch file (entered as -sN, e.g. -s50)
-vSpecifies that output should be verbose (useful for tracing errors).

Either both -d and -e must be present, or -b alone must be present.
As an example, suppose the authentication keyfile .ukbkey exists, then to retrieve datafile 6025_1_0 for person 829423 enter the following:

  ukbfetch -e829423 -d6025_1_0
which will create the file 829423_6025_1_0.typ on the local disk, where typ is an extension appropriate to the type of file. On failure the program will output an error message.

Note that ukbfetch will exit if it attempts to download a datafile to a disk location with insufficient space. Once access has been granted, individual files may be downloaded an unlimited number of times, so we suggest that researchers delete each file once they have finished analysing it.

5. Automation

It is possible to retrieve multiple datafiles as a batch by supplying ukbfetch with a file containing lists of the person and datafile identifiers (in which case the authentication file must be named .ukbkey). As an example, creating a file input.txt with the contents

829423 6025_0_0
829582 6025_1_1
829582 21012_0_2

then entering

  ukbfetch -binput.txt
would instruct ukbfetch to retrieve the three datafiles listed. The names of the files retrieved will be saved as a list in the file fetched.lis which can be used as an input file to produce worklists for analysis programs (successive runs will over-write this file, so use the -o option to specify alternative names if you wish to keep the files).

Lines beginning with # in a batch file will be ignored - this may be used to embed comments.

Note that no more than 50,000 files can be retrieved on a single run of ukbfetch, however researchers may run multiple instances of ukbfetch simultaneously using different input files.

To facilitate producing lists of datafiles, the ukbconv utility has an option to output "bulk" format which can be loaded directly by ukbfetch. This produces a single file containing all the datafile names - if this exceeds 50,000 lines then the -s and -n flags will need to be used with ukbfetch.

Please be aware that, because the files are encrypted in transit (and decrypted only on receipt), receiving multiple streams of them simultaneously may stress your local system and actually result in decreased throughput overall.

5.1 Automation Example

To illustrate, suppose we have a (decrypted) standard dataset ukb789.enc_ukb, belonging to application 789, with password c3d4a1b2c3d4a1b2c3d4a1b2c3d4a1b2c3d4a1b2c3d4a1b2c3d4a1b2c3d4a1b2c3d4, and containing bulk fields 145 and 728.

To generate a list of datafiles (to be fetched) for field 145, enter the command

  ukbconv  ukb789.enc_ukb  bulk  -s145
which will output the file ukb789.bulk. To fetch the datafiles listed in ukb789.bulk, create a .ukbkey file containing


and enter the command

  ukbfetch -bukb789.bulk
which will connect to the repository and download copies of the information. The names of the successfully fetched datafiles will be outputted to in the logfile fetched.lis.

Note: If the ukb789.bulk file contained 2300 lines (i.e. more than 1000), then the data could be retrieved using the following set of commands

  ukbfetch -bukb789.bulk -s1 -n800 -of1
  ukbfetch -bukb789.bulk -s801 -n800 -of2
  ukbfetch -bukb789.bulk -s1601 -n700 -of3
The end result would be 2300 datafiles (assuming sufficient disk-space) in the current directory, accompanied by the output logfiles f1.lis, f2.lis and f3.lis.