Essential information > Accessing your data > UKB Synthetic Dataset

UK Biobank Synthetic Dataset

The Synthetic Dataset has been created to allow large scale system testing using data which is comparable in size and constitution to the real dataset.

Values are presented attached to participant EIDs belonging to some fictional Application. Values were randomly generated and the dataset is presented without any warranty of correctness or accuracy. The datasets are therefore not internally consistent – for instance they may contain reports of prostate cancer linked to female participants, medical events after death or dates of disease without corresponding diagnoses.

Files for download are listed below alongside the MD5 checksum (as generated by the standard linux md5sum utility). The checksums are also available as separate files.

Methods Summary

A more detailed summary of the methods used to construct the UK Biobank synthetic dataset is provided in this document

Tabular Records

This corresponds to the information issued to researchers via the standard UK Biobank mechanism of creating a Showcase basket then downloading the extracted results. It consists of plain ASCII text files containing variables separated by tabs across columns and new-lines between rows. The first column of each row is the participant identifier (EID). The dataset contains approximately 27,000 columns by 600,000 rows. A header row will be included containing “EID” followed by the data column names using the convention of
 FieldID – InstanceId . ArrayID 
Because of the size of the dataset, the data is being delivered as a set of 23 files which each contain a non-overlapping subset of the whole, split into groups of fields (i.e. all persons are present in every field).

Additional information regarding the names, types and encoding meta-data related to the phenotype file is available from the Schemas section of the public UK Biobank Showcase at: http://biobank.ndph.ox.ac.uk/showcase/schema.cgi

The following files are present:

The file tabular.md5 contains the above MD5s.

Medical Records

This is plain ASCII text containing variables separated by tabs across columns and new-lines between rows. It corresponds to the gp_clinical table (as specified in 2019) accessible to researchers via the portal on the Showcase website. The data is split into 6 files which together contain approximately 400,000,000 rows of 8 columns each. The columns in the files are:
  1. EID, 7-digit integer
  2. data provider, integer
  3. event date, date in ANSI yyyymmdd format
  4. read2, char(8) coding
  5. read3, char(20) coding
  6. value1, char(1024) text
  7. value2, char(1024) text
  8. value3, char(1024) text
Encoding maps for the read2 and read3 columns may be found at: https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=592

The following files are present:

The file medrec.md5 contains the above MD5s.

Genetic Records

This contains the SNP information produced by the UKB genotyping chip, with meta-data available from http://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=15. The dataset contains approximately 600,000 “rows” of 840,000 columns each. The data are supplied as a dictionary file plus a series of data files.

The dictionary file “gene_dic.dat” is a 7-column tab separated file, containing the columns:

  1. Affymetrix ID (integer)
  2. Chromosome ID (1-2 chars)
  3. Index (integer, zero-based) along data row
  4. SNP variant 0, e.g. "T T"
  5. SNP variant 1, e.g. "T T"
  6. SNP variant 2, e.g. "T T"
  7. SNP variant 3, e.g. "AGC G"
If there are less than 4 variants for a particular SNP then unused entries are blank (i.e. empty quotes).

The data are supplied as a series of 26 files named "rand_chr*.dat.gz", each containing the SNPs for a single chromosome across the whole cohort and accompanied by rand_chr.md5 giving the MD5 checksums for the uncompressed files. Data are provided as plain ASCII text, without spaces/tabs. This format corresponds to the internal format used by UK Biobank to service basket requests (typically someone asking for a few dozen SNPs). Each line has the format:

 eeeeeee ABCD….Z 
where eeeeeee is the 7-digit EID and A….Z are the index values for the SNP variants in the same order as specified in the dictionary file. To illustrate, the first 3 lines of rand_chr21.dat begin In the first row, person 1707540 has value=0 for the 5th SNP (index=4). Referring to gene_dic.dat one finds the line: which means that for the SNP with AffyID 52233461, person 170540 has genotype "C C".

The following files are present:

The file genotype.md5 contains the above MD5s.
The file rand_chr.md5 contains MD5s of the genotype files prior to their being gzipped.

Bulk Files

This is a collection of ~6,000,000 files which simulate the UK Biobank bulk file repository. The file contents however are unrelated to those found in the live system and also much smaller to facilitate download and handling in a reasonable interval. They are provided primarily to allow system developers to practise handling and pseudonymising such a collection rather than to use with type-specific analysis pipelines. Files are named according to the standard UK Biobank download convention of: The following files are present: The file bulk.md5 contains the above MD5s.