UK Biobank Synthetic Dataset

The Synthetic Dataset has been created to allow large scale system testing using data which is comparable in size and constitution to the real dataset.

Values are presented attached to participant EIDs belonging to some fictional Application project. Values were randomly generated and the dataset is presented without any warranty of correctness or accuracy. The datasets are therefore not internally consistent – for instance they may contain reports of prostate cancer linked to female participants, medical events after death or dates of disease without corresponding diagnoses.

Files for download are listed below alongside the MD5 checksum (as generated by the standard linux md5sum utility). The checksums are also available as separate files.

Tabular Records

This corresponds to the information issued to researchers via the standard UK Biobank mechanism of creating a Showcase basket then downloading the extracted results. It is consists of plain ASCII text file containing variables separated by tabs across columns and new-lines between rows. The first column of each row is the participant identifier (EID). The dataset contains approximately 27,000 columns by 600,000 rows. A header row will be included containing “EID” followed by the data column names using the convention of

FieldID – InstanceId . ArrayID

Because of the size of the dataset, the data is being delivered as a set of 23 files which each contain a non-overlapping subset of the whole, split into groups of fields (i.e. all persons are present in every field).

Additional information regarding the names, types and encoding meta-data related to the phenotype file is available from the Schemas section of the public UK Biobank Showcase at: http://biobank.ndph.ox.ac.uk/showcase/schema.cgi

The following files are present:

a68feb44e037397bc3cb43a6c0c86ef9  41257_HES_SimDates.tsv
4ff448b195ad417c3ae1324312782c30  41260_HES_SimDates.tsv
46aced37adea430907b81b8370f4718b  41262_HES_SimDates.tsv
5fc75c1d4d221d4e8366d4ce7920e7f8  41263_HES_SimDates.tsv
60007421300548e3a03c317e3392e5d1  41280_HES_SimDates.tsv
3b5a706c475050c5a64ad4359d224309  41281_HES_SimDates.tsv
7592c86dbb8502ca0630a763aa85be47  41282_HES_SimDates.tsv
5c35335d9e91f1eb4c0dca92213f6cb9  41283_HES_SimDates.tsv
7e7ec9ba895eaf465cb766cddcf29a72  bulk_strings.tsv
44af5c5d7bf4c4a6ca8fcdaa5329c9ec  dates_death.tsv
0f50afe342c6dca8c9a23ee41df0c8e3  datetime_fields_2.tsv
708dff0bf4989c50cad25f7ffd15623b  datetime_fields.tsv
ff0689f3629da3cd46097199f59db826  fo_fields_trimmed.tsv
47e4214a945327914d5a82189cf0d560  integer_arrays_part1.tsv
1630aa738230ea4d5a28cf91f3c66f6d  integer_arrays_part2.tsv
86944f36c7ea72b397e5740f7ee6453a  integer_diet_quest_fields.tsv
d57178f580c9bd90ba7f33a9c371a894  integer_no_arrays.tsv
e0793fe5cb9e87f28f3af77cbd14e0e6  integer_other_quest_fields.tsv
7701c46303680aa3f06c4e85b2babf35  oaa_fields.tsv
f63dd2423b242f6c00ebee665190265b  real_fields1.tsv
6a32a2d4341d8abedd31303ec25915e9  real_fields2.tsv
b327f5edbfc7693b1a53379f1bd899e7  string_fields1.tsv
3220e8fbdeffd86ebe56356b1d53fbae  string_fields2.tsv
dc5fafa749070aafcff6f1b87d889836  tabular.md5

The file tabular.md5 contains the above MD5s.

Medical Records

This is plain ASCII text containing variables separated by tabs across columns and new-lines between rows. It corresponds to the gp_clinical table (as specified in 2019) accessible to researchers via the portal on the Showcase website. The data is split into 6 files which together contain approximately 400,000,000 rows of 8 columns each. The columns in the files are:

EID, 7-digit integer
data provider, integer
event date, date in ANSI yyyymmdd format
read2, char(8) coding
read3, char(20) coding
value1, char(1024) text
value2, char(1024) text
value3, char(1024) text

Encoding maps for the read2 and read3 columns may be found at: https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=592

The following files are present:

a355712357f968daa1b47d56a4d6d39b  set3a1.txt
3403f6b997fabbb761d08b2bc33b9a9d  set3a2a.txt
0f7a1bb0a20e4576fa4aeebb8394f94b  set3a3.txt
87446b78921c412b1417e514f91cc7e7  set3a4.txt
9d70c29ad9343528f6e4dd2a07a89bcc  set3b.txt
fd5a2a07e8b3b93c7b11ec01dbb45536  set3c.txt
58348146c9e59f8591f90985d92ed125  medrec.md5

The file medrec.md5 contains the above MD5s.

Genetic Records

This contains the SNP information produced by the UKB genotyping chip, with meta-data available from http://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=15. The dataset contains approximately 600,000 “rows” of 840,000 columns each. The data are supplied as a dictionary file plus a series of data files.

The dictionary file “gene_dic.dat” is a 7-column tab separated file, containing the columns:

Affymetrix ID (integer)
Chromosome ID (1-2 chars)
Index (integer, zero-based) along data row
SNP variant 0, e.g. "T T"
SNP variant 1, e.g. "T T"
SNP variant 2, e.g. "T T"
SNP variant 3, e.g. "AGC G"

If there are less than 4 variants for a particular SNP then unused entries are blank (i.e. empty quotes).

The data are supplied as a series 26 of files named "rand_chr*.dat.gz", each containing the SNPs for a single chromosome across the whole cohort and accompanied by rand_chr.md5 giving the MD5 checksums for the uncompressed files. Data are provided as plain ASCII text, without spaces/tabs. This format corresponds to the internal format used by UK Biobank to service basket requests (typically someone asking for a few dozen SNPs). Each line has the format:

eeeeeee ABCD….Z

where eeeeeee is the 7-digit EID and A….Z are the index values for the SNP variants in the same order as specified in the dictionary file. To illustrate, the first 3 lines of rand_chr21.dat begin

1707540 312202120
7843592 302112201
5903945 312201202

In the first row, person 1707540 has value=0 for the 5th SNP (index=4). Referring to gene_dic.dat one finds the line:

52233461        21      4       "C C"   "0 0"   "T C"   ""

which means that for the SNP with AffyID 52233461, person 170540 has genotype "C C".

The following files are present:

4693cf9eb7f91317d7daac37e988014f  gene_dic.dat
4d7b40fe2eb6c826202775ae41722bf5  rand_chr10.dat.gz
3727eeab271981f3da2896a931b04c31  rand_chr11.dat.gz
a9b24a033c4934ed43cfeb0d897183e0  rand_chr12.dat.gz
904fda8a5e2b1ef15c518b8e3beccdbb  rand_chr13.dat.gz
6d08650c317be1cbb8ff6ac8aed86de7  rand_chr14.dat.gz
6675d8d3f751db0c0c9d781b547ce17a  rand_chr15.dat.gz
bedda0edd3cffefa0520057fa3c1a428  rand_chr16.dat.gz
afe53d77480342b0a393a95ed76c5b32  rand_chr17.dat.gz
ffb67ae3a56a68344f502eb14b04847c  rand_chr18.dat.gz
5df0ee5dbe05e1c0f00c88fc99282bf1  rand_chr19.dat.gz
7ca58214640718632dc4a01429513f1d  rand_chr1.dat.gz
f40dc53eea55a7de31d85b63202c94a2  rand_chr20.dat.gz
0ddecef031166409db2c3ff3926c05a7  rand_chr21.dat.gz
ff5df3309294b36af9996ad32a4776e8  rand_chr22.dat.gz
58b4813f57abb674b07c0b7e41dbdab0  rand_chr2.dat.gz
071e91174af1cb4073fc8f1e8feab1e8  rand_chr3.dat.gz
657896f8a50a46ed974e744e1b94ffc7  rand_chr4.dat.gz
b916295d01dc7587ecb2947633a44af8  rand_chr5.dat.gz
d92673bd9762167bfae4001e058e8375  rand_chr6.dat.gz
69b0a4c003f21aa14e78d4e5de5fe5a9  rand_chr7.dat.gz
7bba0d96e922f3145896688067fa2161  rand_chr8.dat.gz
fba274d9c0ba39d02baa3a9bc568f974  rand_chr9.dat.gz
fbce95deada157bce69befd104d92ac7  rand_chrmt.dat.gz
b8e81f5ab174c60f08180145a4b0cf38  rand_chrx.dat.gz
712fcee16d4208b60c11fe94e9e01f64  rand_chrxy.dat.gz
70e527b49fff8333de13f2c2fe02c6cf  rand_chry.dat.gz
3bf65e4d0b943523fa8fcf7a9cb5cafb  genotype.md5
08f8cd13d36b65e701ec4fa56f5a6f29  rand_chr.md5

The file genotype.md5 contains the above MD5s.
The file rand_chr.md5 contains MD5s of the genotype files prior to their being gzipped.

Bulk Files

This is a collection of ~6,000,000 files which simulate the UK Biobank bulk file repository. The file contents however are unrelated to those found in the live system and also much smaller to facilitate download and handling in a reasonable interval. They are provided primarily to allow system developers to practise handling and pseudonymising such a collection rather than to use with type-specific analysis pipelines. Files are named according to the standard UK Biobank download convention of:

 FieldID_InstanceID_ArrayID_EID.type

The following files are present:

22b0e0395489636e70d68e49c11f02e8  bulk_20158.zip
19fc94a0c71673fd43de461c087a84c3  bulk_20203.zip
56f75f157596329e5a8fd13f80486d95  bulk_20205.zip
4a50584a01788027c2b3d9070c11da15  bulk_20206.zip
4986c4b3544eb788fa19fe4f0e926c16  bulk_20220.zip
813add11bf33b8954559961aa04e64e5  bulk_20221.zip
e131f6265b472892b34d2fde660b0e6d  bulk_20222.zip
5bbfd22d65626b3e063811f5ce654c19  bulk_20223.zip
bc8993e333a32458c8f5f67a21df0584  bulk_20224.zip
59c5fd760a1a9a2de3240c0a3ae3712d  bulk_20225.zip
4f97913068281f4f44de785a101b3941  bulk_20226.zip
cd9b71e3f000b5f61b8c99678d0cff6f  bulk_20227.zip
1ab8ad5e5f8b1782cec09e9abaacf8ca  bulk_20249.zip
895af470d160a44fa57a5d1683f61356  bulk_20250.zip
fab0e60572f2d4bbe3a827e045e63efa  bulk_20251.zip
51663fe25e4b657587683a2ba35ca255  bulk_20252.zip
4cb1c4d09b420df949f7e51d6eed3a9d  bulk_20253.zip
b3219e5ae8defa2b000c85e5fb2fecf2  bulk_20254.zip
eb6a158a98c6edda134b9dd406ac98dc  bulk_20259.zip
c4fc581515d09d9a7e371b230df602c2  bulk_20260.zip
46ee0b12f137908bceb1c22c31fd66dd  bulk_21017.zip
67941f04ec3e061883820f547f124dfb  bulk_22002.zip
02d2d0aeb29189e656b97268c1c7c5b8  bulk_23164.zip
7baebecb68a4b05f16bde7d592e8bc64  bulk_23184.zip
e01f74af130f72b950a1e3e2eae2e3e7  bulk_25747.zip
6c43c9e2c354388c130ceef57c188f34  bulk_25748.zip
7e2307bfea721001fcf228c727b51912  bulk_25749.zip
0a6fed624a693ef78894ce2757918705  bulk_25750.zip
e389ea73d386d6dbe6e2db4df90b5b3b  bulk_25751.zip
bd5c23490b6d1b1f630d93c199cab97f  bulk_25752.zip
90c911d989f040bd94c51ad6898026fd  bulk_25753.zip
83f1f096a0841488b2d9e0c959d68ae1  bulk_25754.zip
ace2c18709c985d1a989898e885a3cfd  bulk_25755.zip
e7c606f980c37273bd824048cd2a529d  bulk_6025.zip
8ae7f063988737fbf5325d6781408458  bulk_90001.zip
9868a3be6429e8d9de690ca5fbce3233  bulk_90004.zip
37f62129f1287949d10f2ba765053b4b  bulk.md5

The file bulk.md5 contains the above MD5s.

Improving the health of future generations