There are four main types of data in the research environment. These are:
- Clinical and phenotype data data for each participant
- Genomic data for each participant from our sequencing provider
- Genomic and associated data from the Genomics England bioinformatic pipelines
- Publicly available genomic datasets and cohorts
Each set of data is described by a data release, for which more information can be found under the Data Releases page.
At Genomics England the data are stripped out of identifiable information and associated to a the patient's participant_id so that all patient's data can be linked to their clinical and genomic data.
Clinical and phenotype data
The clinical and phenotype data are stored in a data management application called LabKey which is accessible from the Research Environment Desktop. Clinical and phenotype data are sourced from the GMCs according to set data models that specify the variables and matching data types. Not all variables are compulsory and some will contain personal identifiable data, so are not present in the de-identified data within the Research Environment. Participant phenotypes such as age, sex, ethnicity, pedigree, recruited disease, associated HPO terms, and tumour categories can be analysed by using the LabKey application.
The genomic data delivered by our sequencing provider are provided as a genome delivery which includes BAMs and VCFs for each participant. The genomic data are accessed through the file system where each genome delivery represents a unique sequenced genome. Navigate to the Home icon on the Research Environment desktop, and click on the 'genomes' folder, then the 'by_date' folder. You will see all available genomes organised by date of delivery.
Genomics England data
A subset of the genomic and associated data from the Genomics England bioinformatic pipelines are also provided through the file system. These data include files which were necessary for genome interpretation; such as the joint-called by family VCFs, tiered variants, aggregated variant calls, and internal allele frequency files. You can access these files by navigating to the Home icon on the Research Environment desktop and selecting the 'gel_data_resources' folder.
Publicly available data
The publicly available genomic datasets and cohorts includes datasets such as: 1,000 Genomes data, reference genomes for both GRCh37 and GRCh38 assemblies, BLAST databases, CADD databases and many others. These can be accessed through our file system by navigating to the Home icon on the Research Environment desktop and selecting the 'public_data_resources' folder. You are able to request additional publicly available data at any time by contacting the Service Desk.
Look to see how these files are organised in this video tutorial: