Essential information > Accessing your data > UKB Synthetic Dataset

UK Biobank Synthetic Dataset

The Synthetic Dataset has been created to allow large scale system testing using data which is comparable in size and constitution to the real dataset.

Values are presented attached to participant EIDs belonging to some fictional Application. Values were randomly generated and the dataset is presented without any warranty of correctness or accuracy. The datasets are therefore not internally consistent - for instance they may contain reports of prostate cancer linked to female participants, medical events after death or dates of disease without corresponding diagnoses.

Files for download are listed below alongside the MD5 checksum (as generated by the standard linux md5sum utility). The checksums are also available as separate files.

Methods Summary

A more detailed summary of the methods used to construct the UK Biobank synthetic dataset is provided in this document.

Tabular Records

This corresponds to the information issued to researchers via the standard UK Biobank mechanism of creating a Showcase basket then downloading the extracted results. It consists of plain ASCII text files containing variables separated by tabs across columns and new-lines between rows. The first column of each row is the participant identifier (EID). The dataset contains approximately 27,000 columns by 600,000 rows. A header row will be included containing "EID" followed by the data column names using the convention of

 FieldID - InstanceId . ArrayID

Because of the size of the dataset, the data is being delivered as a set of 23 files which each contain a non-overlapping subset of the whole, split into groups of fields (i.e. all persons are present in every field).

Additional information regarding the names, types and encoding meta-data related to the phenotype file is available from the Schemas section of the UK Biobank Showcase.

The following files are present:

a68feb44e037397bc3cb43a6c0c86ef9 41257_HES_SimDates.tsv
4ff448b195ad417c3ae1324312782c30 41260_HES_SimDates.tsv
46aced37adea430907b81b8370f4718b 41262_HES_SimDates.tsv
5fc75c1d4d221d4e8366d4ce7920e7f8 41263_HES_SimDates.tsv
60007421300548e3a03c317e3392e5d1 41280_HES_SimDates.tsv
3b5a706c475050c5a64ad4359d224309 41281_HES_SimDates.tsv
7592c86dbb8502ca0630a763aa85be47 41282_HES_SimDates.tsv
5c35335d9e91f1eb4c0dca92213f6cb9 41283_HES_SimDates.tsv
7e7ec9ba895eaf465cb766cddcf29a72 bulk_strings.tsv
44af5c5d7bf4c4a6ca8fcdaa5329c9ec dates_death.tsv
0f50afe342c6dca8c9a23ee41df0c8e3 datetime_fields_2.tsv
708dff0bf4989c50cad25f7ffd15623b datetime_fields.tsv
ff0689f3629da3cd46097199f59db826 fo_fields_trimmed.tsv
47e4214a945327914d5a82189cf0d560 integer_arrays_part1.tsv
1630aa738230ea4d5a28cf91f3c66f6d integer_arrays_part2.tsv
86944f36c7ea72b397e5740f7ee6453a integer_diet_quest_fields.tsv
d57178f580c9bd90ba7f33a9c371a894 integer_no_arrays.tsv
e0793fe5cb9e87f28f3af77cbd14e0e6 integer_other_quest_fields.tsv
7701c46303680aa3f06c4e85b2babf35 oaa_fields.tsv
f63dd2423b242f6c00ebee665190265b real_fields1.tsv
6a32a2d4341d8abedd31303ec25915e9 real_fields2.tsv
b327f5edbfc7693b1a53379f1bd899e7 string_fields1.tsv
3220e8fbdeffd86ebe56356b1d53fbae string_fields2.tsv
dc5fafa749070aafcff6f1b87d889836 tabular.md5

The file tabular.md5 contains the above MD5s.

Medical Records

This is plain ASCII text containing variables separated by tabs across columns and new-lines between rows. It corresponds to the gp_clinical table (as specified in 2019) accessible to researchers via the portal on the Showcase website. The data is split into 6 files which together contain approximately 400,000,000 rows of 8 columns each. The columns in the files are:

EID, 7-digit integer
data provider, integer
event date, date in ANSI yyyymmdd format
read2, char(8) coding
read3, char(20) coding
value1, char(1024) text
value2, char(1024) text
value3, char(1024) text

Encoding maps for the read2 and read3 columns may be found in Resource 592.

The following files are present:

a355712357f968daa1b47d56a4d6d39b set3a1.txt
3403f6b997fabbb761d08b2bc33b9a9d set3a2a.txt
0f7a1bb0a20e4576fa4aeebb8394f94b set3a3.txt
87446b78921c412b1417e514f91cc7e7 set3a4.txt
9d70c29ad9343528f6e4dd2a07a89bcc set3b.txt
fd5a2a07e8b3b93c7b11ec01dbb45536 set3c.txt
58348146c9e59f8591f90985d92ed125 medrec.md5

The file medrec.md5 contains the above MD5s.

Genetic Records

This contains the SNP information produced by the UKB genotyping chip, with meta-data available from Schema 15. The dataset contains approximately 600,000 "rows" of 840,000 columns each. The data are supplied as a dictionary file plus a series of data files.

The dictionary file "gene_dic.dat" is a 7-column tab separated file, containing the columns:

Affymetrix ID (integer)
Chromosome ID (1-2 chars)
Index (integer, zero-based) along data row
SNP variant 0, e.g. "T T"
SNP variant 1, e.g. "T T"
SNP variant 2, e.g. "T T"
SNP variant 3, e.g. "AGC G"

If there are less than 4 variants for a particular SNP then unused entries are blank (i.e. empty quotes).

The data are supplied as a series of 26 files named "rand_chr*.dat.gz", each containing the SNPs for a single chromosome across the whole cohort and accompanied by rand_chr.md5 giving the MD5 checksums for the uncompressed files. Data are provided as plain ASCII text, without spaces/tabs. This format corresponds to the internal format used by UK Biobank to service basket requests (typically someone asking for a few dozen SNPs). Each line has the format:

 eeeeeee ABCD...Z

where eeeeeee is the 7-digit EID and A...Z are the index values for the SNP variants in the same order as specified in the dictionary file. To illustrate, the first 3 lines of rand_chr21.dat begin

1707540 312202120
7843592 302112201
5903945 312201202

In the first row, person 1707540 has value=0 for the 5th SNP (index=4). Referring to gene_dic.dat one finds the line:

52233461 21 4 "C C" "0 0" "T C" ""

which means that for the SNP with AffyID 52233461, person 170540 has genotype "C C".

The following files are present:

4693cf9eb7f91317d7daac37e988014f gene_dic.dat
4d7b40fe2eb6c826202775ae41722bf5 rand_chr10.dat.gz
3727eeab271981f3da2896a931b04c31 rand_chr11.dat.gz
a9b24a033c4934ed43cfeb0d897183e0 rand_chr12.dat.gz
904fda8a5e2b1ef15c518b8e3beccdbb rand_chr13.dat.gz
6d08650c317be1cbb8ff6ac8aed86de7 rand_chr14.dat.gz
6675d8d3f751db0c0c9d781b547ce17a rand_chr15.dat.gz
bedda0edd3cffefa0520057fa3c1a428 rand_chr16.dat.gz
afe53d77480342b0a393a95ed76c5b32 rand_chr17.dat.gz
ffb67ae3a56a68344f502eb14b04847c rand_chr18.dat.gz
5df0ee5dbe05e1c0f00c88fc99282bf1 rand_chr19.dat.gz
7ca58214640718632dc4a01429513f1d rand_chr1.dat.gz
f40dc53eea55a7de31d85b63202c94a2 rand_chr20.dat.gz
0ddecef031166409db2c3ff3926c05a7 rand_chr21.dat.gz
ff5df3309294b36af9996ad32a4776e8 rand_chr22.dat.gz
58b4813f57abb674b07c0b7e41dbdab0 rand_chr2.dat.gz
071e91174af1cb4073fc8f1e8feab1e8 rand_chr3.dat.gz
657896f8a50a46ed974e744e1b94ffc7 rand_chr4.dat.gz
b916295d01dc7587ecb2947633a44af8 rand_chr5.dat.gz
d92673bd9762167bfae4001e058e8375 rand_chr6.dat.gz
69b0a4c003f21aa14e78d4e5de5fe5a9 rand_chr7.dat.gz
7bba0d96e922f3145896688067fa2161 rand_chr8.dat.gz
fba274d9c0ba39d02baa3a9bc568f974 rand_chr9.dat.gz
fbce95deada157bce69befd104d92ac7 rand_chrmt.dat.gz
b8e81f5ab174c60f08180145a4b0cf38 rand_chrx.dat.gz
712fcee16d4208b60c11fe94e9e01f64 rand_chrxy.dat.gz
70e527b49fff8333de13f2c2fe02c6cf rand_chry.dat.gz
3bf65e4d0b943523fa8fcf7a9cb5cafb genotype.md5
08f8cd13d36b65e701ec4fa56f5a6f29 rand_chr.md5

The file genotype.md5 contains the above MD5s.
The file rand_chr.md5 contains MD5s of the genotype files prior to their being gzipped.

Bulk Files

This is a collection of ~6,000,000 files which simulate the UK Biobank bulk file repository. The file contents however are unrelated to those found in the live system and also much smaller to facilitate download and handling in a reasonable interval. They are provided primarily to allow system developers to practise handling and pseudonymising such a collection rather than to use with type-specific analysis pipelines. Files are named according to the standard UK Biobank download convention of:

FieldID_InstanceID_ArrayID_EID.type

The following files are present:

22b0e0395489636e70d68e49c11f02e8 bulk_20158.zip
19fc94a0c71673fd43de461c087a84c3 bulk_20203.zip
56f75f157596329e5a8fd13f80486d95 bulk_20205.zip
4a50584a01788027c2b3d9070c11da15 bulk_20206.zip
4986c4b3544eb788fa19fe4f0e926c16 bulk_20220.zip
813add11bf33b8954559961aa04e64e5 bulk_20221.zip
e131f6265b472892b34d2fde660b0e6d bulk_20222.zip
5bbfd22d65626b3e063811f5ce654c19 bulk_20223.zip
bc8993e333a32458c8f5f67a21df0584 bulk_20224.zip
59c5fd760a1a9a2de3240c0a3ae3712d bulk_20225.zip
4f97913068281f4f44de785a101b3941 bulk_20226.zip
cd9b71e3f000b5f61b8c99678d0cff6f bulk_20227.zip
1ab8ad5e5f8b1782cec09e9abaacf8ca bulk_20249.zip
895af470d160a44fa57a5d1683f61356 bulk_20250.zip
fab0e60572f2d4bbe3a827e045e63efa bulk_20251.zip
51663fe25e4b657587683a2ba35ca255 bulk_20252.zip
4cb1c4d09b420df949f7e51d6eed3a9d bulk_20253.zip
b3219e5ae8defa2b000c85e5fb2fecf2 bulk_20254.zip
eb6a158a98c6edda134b9dd406ac98dc bulk_20259.zip
c4fc581515d09d9a7e371b230df602c2 bulk_20260.zip
46ee0b12f137908bceb1c22c31fd66dd bulk_21017.zip
67941f04ec3e061883820f547f124dfb bulk_22002.zip
02d2d0aeb29189e656b97268c1c7c5b8 bulk_23164.zip
7baebecb68a4b05f16bde7d592e8bc64 bulk_23184.zip
e01f74af130f72b950a1e3e2eae2e3e7 bulk_25747.zip
6c43c9e2c354388c130ceef57c188f34 bulk_25748.zip
7e2307bfea721001fcf228c727b51912 bulk_25749.zip
0a6fed624a693ef78894ce2757918705 bulk_25750.zip
e389ea73d386d6dbe6e2db4df90b5b3b bulk_25751.zip
bd5c23490b6d1b1f630d93c199cab97f bulk_25752.zip
90c911d989f040bd94c51ad6898026fd bulk_25753.zip
83f1f096a0841488b2d9e0c959d68ae1 bulk_25754.zip
ace2c18709c985d1a989898e885a3cfd bulk_25755.zip
e7c606f980c37273bd824048cd2a529d bulk_6025.zip
8ae7f063988737fbf5325d6781408458 bulk_90001.zip
9868a3be6429e8d9de690ca5fbce3233 bulk_90004.zip
37f62129f1287949d10f2ba765053b4b bulk.md5

The file bulk.md5 contains the above MD5s.