UK Biobank Synthetic Dataset
The Synthetic Dataset has been created to allow large scale system testing
using data which is comparable in size and constitution to the real dataset.
Values are presented attached to participant EIDs belonging to some fictional
Application project.
Values were randomly generated and the dataset is presented without any warranty of
correctness or accuracy.
The datasets are therefore not internally consistent – for
instance they may contain reports of prostate cancer linked to female participants,
medical events after death or dates of disease without corresponding diagnoses.
Files for download are listed below alongside the MD5 checksum (as
generated by the standard linux md5sum utility). The checksums are also
available as separate files.
Medical Records
This is plain ASCII text containing variables separated by tabs across columns and
new-lines between rows. It corresponds to the gp_clinical table (as specified in 2019)
accessible to researchers via the portal on the Showcase website. The data is split into
6 files which together contain approximately 400,000,000 rows of 8 columns each. The
columns in the files are:
- EID, 7-digit integer
- data provider, integer
- event date, date in ANSI yyyymmdd format
- read2, char(8) coding
- read3, char(20) coding
- value1, char(1024) text
- value2, char(1024) text
- value3, char(1024) text
Encoding maps for the read2 and read3 columns may be found at:
https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=592
The following files are present:
a355712357f968daa1b47d56a4d6d39b set3a1.txt
3403f6b997fabbb761d08b2bc33b9a9d set3a2a.txt
0f7a1bb0a20e4576fa4aeebb8394f94b set3a3.txt
87446b78921c412b1417e514f91cc7e7 set3a4.txt
9d70c29ad9343528f6e4dd2a07a89bcc set3b.txt
fd5a2a07e8b3b93c7b11ec01dbb45536 set3c.txt
58348146c9e59f8591f90985d92ed125 medrec.md5
The file medrec.md5 contains the above MD5s.
Genetic Records
This contains the SNP information produced by the UKB genotyping chip, with meta-data available from
http://biobank.ndph.ox.ac.uk/showcase/schema.cgi?id=15.
The dataset contains approximately 600,000 “rows” of 840,000 columns each. The data are
supplied as a dictionary file plus a series of data files.
The dictionary file “gene_dic.dat” is a 7-column tab separated file, containing the columns:
- Affymetrix ID (integer)
- Chromosome ID (1-2 chars)
- Index (integer, zero-based) along data row
- SNP variant 0, e.g. "T T"
- SNP variant 1, e.g. "T T"
- SNP variant 2, e.g. "T T"
- SNP variant 3, e.g. "AGC G"
If there are less than 4 variants for a particular SNP then unused entries are blank (i.e. empty quotes).
The data are supplied as a series 26 of files named "rand_chr*.dat.gz", each containing the SNPs for a single chromosome across the whole cohort and
accompanied by rand_chr.md5 giving the MD5 checksums for the uncompressed files. Data are provided as plain ASCII text, without spaces/tabs.
This format
corresponds to the internal format used by UK Biobank to service basket requests (typically someone asking for a few
dozen SNPs). Each line has the format:
eeeeeee ABCD….Z
where
eeeeeee is the 7-digit EID and A….Z are the index values for the SNP variants in the same order as specified in the dictionary file.
To illustrate, the first 3 lines of rand_chr21.dat begin
1707540 312202120
7843592 302112201
5903945 312201202
In the first row, person 1707540 has value=0 for the 5th SNP (index=4). Referring to gene_dic.dat one finds the line:
52233461 21 4 "C C" "0 0" "T C" ""
which means that for the SNP with AffyID 52233461, person 170540 has genotype "C C".
The following files are present:
4693cf9eb7f91317d7daac37e988014f gene_dic.dat
4d7b40fe2eb6c826202775ae41722bf5 rand_chr10.dat.gz
3727eeab271981f3da2896a931b04c31 rand_chr11.dat.gz
a9b24a033c4934ed43cfeb0d897183e0 rand_chr12.dat.gz
904fda8a5e2b1ef15c518b8e3beccdbb rand_chr13.dat.gz
6d08650c317be1cbb8ff6ac8aed86de7 rand_chr14.dat.gz
6675d8d3f751db0c0c9d781b547ce17a rand_chr15.dat.gz
bedda0edd3cffefa0520057fa3c1a428 rand_chr16.dat.gz
afe53d77480342b0a393a95ed76c5b32 rand_chr17.dat.gz
ffb67ae3a56a68344f502eb14b04847c rand_chr18.dat.gz
5df0ee5dbe05e1c0f00c88fc99282bf1 rand_chr19.dat.gz
7ca58214640718632dc4a01429513f1d rand_chr1.dat.gz
f40dc53eea55a7de31d85b63202c94a2 rand_chr20.dat.gz
0ddecef031166409db2c3ff3926c05a7 rand_chr21.dat.gz
ff5df3309294b36af9996ad32a4776e8 rand_chr22.dat.gz
58b4813f57abb674b07c0b7e41dbdab0 rand_chr2.dat.gz
071e91174af1cb4073fc8f1e8feab1e8 rand_chr3.dat.gz
657896f8a50a46ed974e744e1b94ffc7 rand_chr4.dat.gz
b916295d01dc7587ecb2947633a44af8 rand_chr5.dat.gz
d92673bd9762167bfae4001e058e8375 rand_chr6.dat.gz
69b0a4c003f21aa14e78d4e5de5fe5a9 rand_chr7.dat.gz
7bba0d96e922f3145896688067fa2161 rand_chr8.dat.gz
fba274d9c0ba39d02baa3a9bc568f974 rand_chr9.dat.gz
fbce95deada157bce69befd104d92ac7 rand_chrmt.dat.gz
b8e81f5ab174c60f08180145a4b0cf38 rand_chrx.dat.gz
712fcee16d4208b60c11fe94e9e01f64 rand_chrxy.dat.gz
70e527b49fff8333de13f2c2fe02c6cf rand_chry.dat.gz
3bf65e4d0b943523fa8fcf7a9cb5cafb genotype.md5
08f8cd13d36b65e701ec4fa56f5a6f29 rand_chr.md5
The file genotype.md5 contains the above MD5s.
The file rand_chr.md5 contains MD5s of the genotype files prior to their being gzipped.