Overview and access
Clinical and phenotypic data for de-identified participants in the Research Environment are stored in a Desktop application called LabKey. LabKey is a software suite available for researchers to integrate, analyse, and share biomedical research data. The platform provides a secure data repository that allows web-based querying, reporting, and collaborating across a range of data sources. To access LabKey, log into the Research Environment and double click on the Desktop application called LabKey. You can then use your same Research Environment credentials to log into LabKey by clicking on 'Sign In' on the top right-hand side of the LabKey application.
The clinical and phenotypic data housed and presented in LabKey are derived from a number of sources which can be broken down into the following categories:
Primary clinical data
|These data are captured and submitted by the Genomics Medicine Centers at participant recruitment and follow-up. These data include demographic data, disease characteristics, rare disease and cancer specific clinical and phenotypic data such as recruited disease, family pedigree, laboratory blood tests, HPO terms, tumour morphology and topography.|
Secondary clinical data
|These data are derived from NHS Digital and Public Health England and comprise additional clinical and phenotypic data that is provided to Genomics England by other data collectors. These data include Hospital Episode Statistics, Diagnostic Imaging, Patient Reported Outcomes Measures, Mortality, and Systemic Anti-Cancer Therapy.|
Sequencing and sample-level data
|These data include the plated-sample data and related QC data associated with the laboratory sample submitted.|
Genomics England Bioinformatics data
|These data comprise outputs from the Genomics England Bioinformatics pipeline such as results from the tiering pipeline, exit questionnaire data, and paths to the genomic data for each participant (BAMs, VCFs, and other meta-data).|
As mentioned, this page will only include instructions on how to use the LabKey Application. To understand the data itself within LabKey, please visit the pages: Clinical and Phenotype data and Main Programme Data Releases.
Upon login, you will be presented with the Projects page. Each of these Project folders represents a different LabKey Project. Depending on your level of access, you may see different Project folders. If you are a member of the GeCIP or Discovery Forum, you will be able to see the Project folders shown and described below.
|LabKey Project Name||Description|
MAIN PROGRAMME PRE-RELEASE
|This folder contains the June 2017 "pre-release" of 1,207 participants. These data were released to the "early on-boarding" GeCIP group for use in testing the Research Environment. It is no longer advised to use these data as they are consumed within the Main Programme folder.|
|This folder contains the Main Programme data release which comprises the largest available dataset. Detailed information on the Main Programme data release can be found under: Main Programme Data Releases. Researchers should use the Main Programme Project as it comprises the largest, most up-to-date, and most comprehensive dataset.|
|This folder contains summary statistics for the data in the Main Programme data release. These data are displayed graphically for easy visualisation of what disease cohorts are available in the Main Programme dataset. It is useful for browsing high-level information on each of the Main Programme data releases.|
This folder contains a randomised sub-sample of the Main Programme data release. It is used for demo purposes only and should not be used by researchers.
The folders and tables you see in the guide may differ from what you see in LabKey. This will be due to either your level of access or subsequent new releases of data.
Data release versions
The Main Programme dataset is almost always updated every 3 months with additional data as we receive it. Each release of the Main Programme dataset is given a version number and a date. Upon selecting the Main Programme Project folder, you will see the list of available Main Programme data releases and their version numbers and release dates. Each of these releases will be shown as a separate sub-folder as shown below. The Main Programme Data Releases page will tell you the content of each Main Programme data release along with any existing changes between releases. We recommend always using the latest Main Programme data release as it will comprise the richest dataset.
Upon selecting the Main Programme release version of choice, you will be navigated to the Data Views page. This page comprises the list of tables available in the chosen Project. The tables are organised by category: those that are common to all participants, those that are specific to participants from either the rare disease or cancer programme, those that are part of the secondary clinical data, and those that are part of the quick-view tables. Again, we will not describe the data in full here but each Main Programme data release is accompanied by a Release Note and Data Dictionary which can be found here for each release: Main Programme Data Releases. We also supply and overview of the available clinical and phenotype data here: Clinical and Phenotype data.
Browsing data in LabKey
When you click on a table, the table will be displayed in the LabKey application. LabKey tables look and behave a lot like standard Excel files which you will be familiar with.
Using the participant id
All participants within the Main Programme are assigned a unique participant id. This is a pseudo-anonymised identifier which is unique to each participant. The participant id is found in almost all tables under the 'Participant ID' column. It can be used to link data across tables as the identifier is unique for each participant.
By default, 100 rows of the table are displayed. This can be changed by clicking on the PAGING dropdown button and selecting the number of rows you want to display. If you scroll to the far-right of the table, the total number of rows in the entire table will be displayed. Please be careful as this does not necessarily equal the number of participants in the table. This is because the same participant may be found on multiple rows of a table; for instance if the participant has many associated disease terms - these will be displayed on multiple rows (one row for each disease term).
Tables in LabKey can easily be sorted; in acceding or descending order. To do this, click on the column in the table you want to sort. This will open a box which can be used to sort the data in the specified column.
Data in table columns can be filtered easily and in many different ways. In order to filter a table, click on the column you want to filter and filter 'Filter...' (as shown above). This will open up a dialog box. By default, the 'Choose Values' filter tab will be displayed. In this tab, you can simply tick the row values you want to retrieve.
In the below example (top-left), only 'Rare Diseases' will be returned in the Programme column in the participant table ('Cancer' is filtered out from the table). By selecting the 'Choose Filters' tab, you will be able to apply different filter logic by clicking the 'Filter Type'. You will see a list of available filters. In the below example (top-right), the Year of Birth column is filtered for values which are between 1970 and 1980 (inclusive).
The displayed table will have your filters applied . You will be able to see that the total row count of the table will have decreased. You can see this in the top far-right value in the table header.
You can build up as many filters as you like as shown below (bottom). You can clear the filter using the CLEAR ALL button.
Exporting to spreadsheet or text file
Full tables or filtered tables can easily be exported in various different formats. To export a table, click the EXPORT button. The export type can then be chosen from here. Export type includes Excel Workbook (.xlsx) and text file (.csv, .tsv) as shown below. We don't recommend that large tables are exported into an Excel Workbook. It may be better to filter the table down first into a more manageable size.
Joining across multiple tables
It is often necessary to join data across multiple tables. For example if you would like to join the participant table with the sequencing report table to identify the paths on the file system to your participants of interest. Joining across tables cannot be done directly from the LabKey application. Instead the data must either first be exported (as an Excel Workbook or text file) and joined using Excel or similar, or, the LabKey APIs can be used (see below).
Using the LabKey APIs
The LabKey APIs are a powerful way of connecting with the data and provide consistent, reproducible results, which can be shared with others easily. The LabKey APIs are provided in many programming languages (R, Python, Java...), are simple to use and only require a few lines of code. We always recommend using the LabKey APIs to interrogate the data once exploratory analysis has been conducted using the LabKey Desktop application. We have written documentation and provide example code of how to use the LabKey APIs here: Using the LabKey API.