UK Biobank Synthetic Dataset

The Synthetic Dataset has been created to allow large scale system testing using data which is comparable in size and constitution to the real dataset.

Values are presented attached to participant EIDs belonging to some fictional Application project. Values were randomly generated and the dataset is presented without any warranty of correctness or accuracy. The datasets are therefore not internally consistent – for instance they may contain reports of prostate cancer linked to female participants, medical events after death or dates of disease without corresponding diagnoses.

Files for download are listed below alongside the MD5 checksum (as generated by the standard linux md5sum utility). The checksums are also available as separate files.

Tabular Records

This corresponds to the information issued to researchers via the standard UK Biobank mechanism of creating a Showcase basket then downloading the extracted results. It is consists of plain ASCII text file containing variables separated by tabs across columns and new-lines between rows. The first column of each row is the participant identifier (EID). The dataset contains approximately 27,000 columns by 600,000 rows. A header row will be included containing “EID” followed by the data column names using the convention of
FieldID – InstanceId . ArrayID
Because of the size of the dataset, the data is being delivered as a set of 23 files which each contain a non-overlapping subset of the whole, split into groups of fields (i.e. all persons are present in every field).

Additional information regarding the names, types and encoding meta-data related to the phenotype file is available from the Schemas section of the public UK Biobank Showcase at:

The following files are present:

a68feb44e037397bc3cb43a6c0c86ef9  41257_HES_SimDates.tsv
4ff448b195ad417c3ae1324312782c30  41260_HES_SimDates.tsv
46aced37adea430907b81b8370f4718b  41262_HES_SimDates.tsv
5fc75c1d4d221d4e8366d4ce7920e7f8  41263_HES_SimDates.tsv
60007421300548e3a03c317e3392e5d1  41280_HES_SimDates.tsv
3b5a706c475050c5a64ad4359d224309  41281_HES_SimDates.tsv
7592c86dbb8502ca0630a763aa85be47  41282_HES_SimDates.tsv
5c35335d9e91f1eb4c0dca92213f6cb9  41283_HES_SimDates.tsv
7e7ec9ba895eaf465cb766cddcf29a72  bulk_strings.tsv
44af5c5d7bf4c4a6ca8fcdaa5329c9ec  dates_death.tsv
0f50afe342c6dca8c9a23ee41df0c8e3  datetime_fields_2.tsv
708dff0bf4989c50cad25f7ffd15623b  datetime_fields.tsv
ff0689f3629da3cd46097199f59db826  fo_fields_trimmed.tsv
47e4214a945327914d5a82189cf0d560  integer_arrays_part1.tsv
1630aa738230ea4d5a28cf91f3c66f6d  integer_arrays_part2.tsv
86944f36c7ea72b397e5740f7ee6453a  integer_diet_quest_fields.tsv
d57178f580c9bd90ba7f33a9c371a894  integer_no_arrays.tsv
e0793fe5cb9e87f28f3af77cbd14e0e6  integer_other_quest_fields.tsv
7701c46303680aa3f06c4e85b2babf35  oaa_fields.tsv
f63dd2423b242f6c00ebee665190265b  real_fields1.tsv
6a32a2d4341d8abedd31303ec25915e9  real_fields2.tsv
b327f5edbfc7693b1a53379f1bd899e7  string_fields1.tsv
3220e8fbdeffd86ebe56356b1d53fbae  string_fields2.tsv
dc5fafa749070aafcff6f1b87d889836  tabular.md5
The file tabular.md5 contains the above MD5s.

Medical Records

This is plain ASCII text containing variables separated by tabs across columns and new-lines between rows. It corresponds to the gp_clinical table (as specified in 2019) accessible to researchers via the portal on the Showcase website. The data is split into 6 files which together contain approximately 400,000,000 rows of 8 columns each. The columns in the files are:
  1. EID, 7-digit integer
  2. data provider, integer
  3. event date, date in ANSI yyyymmdd format
  4. read2, char(8) coding
  5. read3, char(20) coding
  6. value1, char(1024) text
  7. value2, char(1024) text
  8. value3, char(1024) text
Encoding maps for the read2 and read3 columns may be found at:

The following files are present:

a355712357f968daa1b47d56a4d6d39b  set3a1.txt
3403f6b997fabbb761d08b2bc33b9a9d  set3a2a.txt
0f7a1bb0a20e4576fa4aeebb8394f94b  set3a3.txt
87446b78921c412b1417e514f91cc7e7  set3a4.txt
9d70c29ad9343528f6e4dd2a07a89bcc  set3b.txt
fd5a2a07e8b3b93c7b11ec01dbb45536  set3c.txt
58348146c9e59f8591f90985d92ed125  medrec.md5
The file medrec.md5 contains the above MD5s.

Genetic Records

This contains the SNP information produced by the UKB genotyping chip, with meta-data available from The dataset contains approximately 600,000 “rows” of 840,000 columns each. The data are supplied as a dictionary file plus a series of data files.

The dictionary file “gene_dic.dat” is a 7-column tab separated file, containing the columns:

  1. Affymetrix ID (integer)
  2. Chromosome ID (1-2 chars)
  3. Index (integer, zero-based) along data row
  4. SNP variant 0, e.g. "T T"
  5. SNP variant 1, e.g. "T T"
  6. SNP variant 2, e.g. "T T"
  7. SNP variant 3, e.g. "AGC G"
If there are less than 4 variants for a particular SNP then unused entries are blank (i.e. empty quotes).

The data are supplied as a series 26 of files named "rand_chr*.dat.gz", each containing the SNPs for a single chromosome across the whole cohort and accompanied by rand_chr.md5 giving the MD5 checksums for the uncompressed files. Data are provided as plain ASCII text, without spaces/tabs. This format corresponds to the internal format used by UK Biobank to service basket requests (typically someone asking for a few dozen SNPs). Each line has the format:

eeeeeee ABCD….Z
where eeeeeee is the 7-digit EID and A….Z are the index values for the SNP variants in the same order as specified in the dictionary file. To illustrate, the first 3 lines of rand_chr21.dat begin
1707540 312202120
7843592 302112201
5903945 312201202
In the first row, person 1707540 has value=0 for the 5th SNP (index=4). Referring to gene_dic.dat one finds the line:
52233461        21      4       "C C"   "0 0"   "T C"   ""
which means that for the SNP with AffyID 52233461, person 170540 has genotype "C C".

The following files are present:

4693cf9eb7f91317d7daac37e988014f  gene_dic.dat
4d7b40fe2eb6c826202775ae41722bf5  rand_chr10.dat.gz
3727eeab271981f3da2896a931b04c31  rand_chr11.dat.gz
a9b24a033c4934ed43cfeb0d897183e0  rand_chr12.dat.gz
904fda8a5e2b1ef15c518b8e3beccdbb  rand_chr13.dat.gz
6d08650c317be1cbb8ff6ac8aed86de7  rand_chr14.dat.gz
6675d8d3f751db0c0c9d781b547ce17a  rand_chr15.dat.gz
bedda0edd3cffefa0520057fa3c1a428  rand_chr16.dat.gz
afe53d77480342b0a393a95ed76c5b32  rand_chr17.dat.gz
ffb67ae3a56a68344f502eb14b04847c  rand_chr18.dat.gz
5df0ee5dbe05e1c0f00c88fc99282bf1  rand_chr19.dat.gz
7ca58214640718632dc4a01429513f1d  rand_chr1.dat.gz
f40dc53eea55a7de31d85b63202c94a2  rand_chr20.dat.gz
0ddecef031166409db2c3ff3926c05a7  rand_chr21.dat.gz
ff5df3309294b36af9996ad32a4776e8  rand_chr22.dat.gz
58b4813f57abb674b07c0b7e41dbdab0  rand_chr2.dat.gz
071e91174af1cb4073fc8f1e8feab1e8  rand_chr3.dat.gz
657896f8a50a46ed974e744e1b94ffc7  rand_chr4.dat.gz
b916295d01dc7587ecb2947633a44af8  rand_chr5.dat.gz
d92673bd9762167bfae4001e058e8375  rand_chr6.dat.gz
69b0a4c003f21aa14e78d4e5de5fe5a9  rand_chr7.dat.gz
7bba0d96e922f3145896688067fa2161  rand_chr8.dat.gz
fba274d9c0ba39d02baa3a9bc568f974  rand_chr9.dat.gz
fbce95deada157bce69befd104d92ac7  rand_chrmt.dat.gz
b8e81f5ab174c60f08180145a4b0cf38  rand_chrx.dat.gz
712fcee16d4208b60c11fe94e9e01f64  rand_chrxy.dat.gz
70e527b49fff8333de13f2c2fe02c6cf  rand_chry.dat.gz
3bf65e4d0b943523fa8fcf7a9cb5cafb  genotype.md5
08f8cd13d36b65e701ec4fa56f5a6f29  rand_chr.md5
The file genotype.md5 contains the above MD5s.
The file rand_chr.md5 contains MD5s of the genotype files prior to their being gzipped.

Bulk Files

This is a collection of ~6,000,000 files which simulate the UK Biobank bulk file repository. The file contents however are unrelated to those found in the live system and also much smaller to facilitate download and handling in a reasonable interval. They are provided primarily to allow system developers to practise handling and pseudonymising such a collection rather than to use with type-specific analysis pipelines. Files are named according to the standard UK Biobank download convention of:
The following files are present:
37f62129f1287949d10f2ba765053b4b  bulk.md5
The file bulk.md5 contains the above MD5s.

Improving the health of future generations