About
The project aims to develop computational tools that will allow understanding how the environment and genes influence the risk of disease. In order to understand this, large datasets are required, but the computational tools to analyse these data are not available. We will develop such tools and demonstrate their use to understand how the environment and genome of UK Biobank participants shape their risk of disease. Only by understanding the complex interplay of genetics and environmental risk factors will we be able to develop medicines targeted to the relevant subgroups of individuals most likely to benefit and to guide public health interventions.
Genome-wide association studies have been instrumental in identifying genes that determine disease risk, however, many genes remain to be identified because scientists have only assayed a small proportion of the genetic variation through genotyping arrays. To overcome this problem, now the full DNA of UK participants will be examined, that is sequenced. However, this raises to scientific challenges. (1) The number of statistical tests that scientists will perform will be increased massively and (2) the volume of data will make the current model where researchers download the data to their institutions impossible (it would take years to download the data for analyses). Because of that the model will change to move the analyses tools to a common informatics platform where researchers can do the analyses. The tool will address a major issue in research, reproducibility. The tool will store the metadata of the data used, the sequence in which the data was chosen and the statistical model used, so that any other researcher (or the same researcher) can reproduce the same exact results (obviating participants' withdraws).
The tool will also allow sub-setting of the data, researchers are sometimes interested in looking at stratified analyses (for instance, post-menopausal females) but extracting the data and keeping tag of all the files is cumbersome and prone to error, the tool will allow to streamline and rationalise this type of analyses.
Finally, the tool will allow for standard epidemiological analyses to study environmental risk factors such as logistic regression, cox regression, etc.