External links on this page can only be accessed from outside the RE
Overview
This tutorial uses R LabKey API to query the clinical and phenotype data. Please see Rlabkey Package guide to find out additional information on the R LabKey API.
Prerequisites
- You will firstly need to install and load the Rlabkey package into your R session by following this guide. We used Rlabkey version 2.2.1 in these examples.
- You will need to create and enter your credentials into your .netrc profile.
Types of queries
Fetching an entire table
Fetching an entire LabKey table is easy. You will only need to change the name of the main programme release version and the name of the the LabKey table you need. You can get the release versions by navigating to the LabKey Application and selecting the Main Programme project - the names of the releases are detailed in the subsequent folder. The names of each table can be seen in the data views when you click on a release folder.
The code below loads the entire sequencing_report table and saves it as an R data.frame. The function labkey.selectRows is used to do this.
library(Rlabkey) # Set the baseURL labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/") sequencing_report <- labkey.selectRows( schemaName="lists", # Do not change this colNameOpt="rname", # Do not change this maxRows = 100000000, # Do not change this folderPath="/main-programme/main-programme_v8_2019-11-28", # This can be changed to different main programme releases queryName="sequencing_report" # This can be changed to different table names )
Fetching a table and filtering
The function labkey.selectRows can also include additional arguments to select specific columns (colSelect), filter for specified row values (colFilter), and sort by columns (colSort).
The code below loads and subsets the participant table for rare disease probands, born between 1990 and 2010. It then selects only the participant id, family id, and year of birth column, and sorts by year of birth. You can see the full range of R LabKey filters here.
library(Rlabkey) # Set the baseURL labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/") participant <- labkey.selectRows( # Default selection schemaName="lists", # Do not change this colNameOpt="rname", # Do not change this maxRows = 100000000, # Do not change this folderPath="/main-programme/main-programme_v8_2019-11-28", # This can be changed to different main programme releases queryName="participant", # This can be changed to different table names # Additional parameters colFilter=makeFilter( # Make various filters to subset rows c("participant_type", "EQUAL", "Proband"), c("year_of_birth", "BETWEEN", "1990,2010")), colSelect=c("participant_id", "rare_diseases_family_id", "year_of_birth"), # Choose to only select these columns colSort=("+year_of_birth") # Sort by these columns )
Fetching a table using SQL
Often, you will need to merge data from multiple tables using a common key (such as the unique participant id). The easiest, safest, and most reproducible way to do this is by using the labkey.executeSql function. We highly recommend using this function to perform more complex analysis across multiple tables.
The following code will use the labkey.executeSql function to perform a complex query which:
- Selects
- The participant id and HPO term columns from the 'rare diseases participant phenotype' table
- The year of birth column from the 'participant' table
- The file path column to the genomic data from the 'genome file paths and types' table
- Left joins
- By the unique 'participant id' common to all tables
- Filters
- For positive diagnoses (hpo present = YES) in the 'rare diseases participant phenotype' table
- For two HPO terms (hpo id = HP:0001249 and HP:0000365) in the 'rare diseases participant phenotype' table
- For small variant VCF file paths (file sub type = standard VCF) in the 'genome file paths and types' table
- For genomes aligned to GRCh38 (genome build = 38) in the 'genome file paths and types' table
- Orders
- By year of birth
library(Rlabkey) # Set the baseURL labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") # Write your SQL query here query <- "SELECT rd.participant_id, rd.hpo_id, par.year_of_birth, sq.file_path FROM rare_diseases_participant_phenotype AS rd LEFT JOIN genome_file_paths_and_types AS sq ON rd.participant_id = sq.participant_id LEFT JOIN participant AS par ON rd.participant_id = par.participant_id WHERE rd.hpo_id IN ('HP:0001249', 'HP:0000365') AND rd.hpo_present = 'Yes' AND sq.file_sub_type = 'Standard VCF' AND sq.genome_build = 'GRCh38' ORDER BY par.year_of_birth;" mysql <- labkey.executeSql( schemaName="lists", # Do not change this colNameOpt = "rname", # Do not change this maxRows = 100000000, # Do not change this folderPath="/main-programme/main-programme_v8_2019-11-28", # This can be changed to different main programme releases sql = query # This can be changed to your query of choice )
Querying the Hospital Episode Statistics tables
There are two 'Medical History' tables within the Main Programme LabKey projects which give diagnoses information encoded by the ICD-10 medical classification. These are the Admitted Patient Care and Outpatient Care tables which are named as 'hes_apc' and 'hes_op' respectively in LabKey (hes = Hospital Episode Statistics). Though these tables are very large, the information containing the ICD-10 terms can be found in the columns prefixed by 'diag'. For example 'diag_01', which is the primary diagnosis, and 'diag_02' and so on. The hes_apc table has 20 diag columns whereas the hes_op table has 12 diag columns. We can retrieve these columns using the labkey.executeSql function as shown below.
The following code will use the labkey.executeSql function to retrieve all of the ICD-10 columns for each participant from both the hes_apc and hes_op tables. It will also retrieve the time of the participant's visit and their participant ID which you can use to link back to the other tables.
library(Rlabkey) # Set the baseURL labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/") # Gather the 20 diag terms for the hes_apc table hes_apc_q <- paste("diag_", sprintf("%02d", seq(1:20)), sep = "", collapse = ", ") # Execute the SQL query to get the ICD-10 terms from the hes_apc table hes_apc <- labkey.executeSql( schemaName ="lists", # Do not change this colNameOpt ="rname", # Do not change this maxRows = 100000000, # Do not change this folderPath="/main-programme/main-programme_v8_2019-11-28", # This can be changed to different main programme releases sql = paste("SELECT participant_id, epistart,", hes_apc_q, "FROM hes_apc") # Do not change this ) # Gather the 12 diag terms for the hes_op table hes_op_q <- paste("diag_", sprintf("%02d", seq(1:12)), sep = "", collapse = ", ") # Execute the SQL query to get the ICD-10 terms from the hes_op table hes_op <- labkey.executeSql( schemaName ="lists", # Do not change this colNameOpt ="rname", # Do not change this maxRows = 100000000, # Do not change this folderPath="/main-programme/main-programme_v8_2019-11-28", # This can be changed to different main programme releases sql = paste("SELECT participant_id, apptdate,", hes_op_q, "FROM hes_op") # Do not change this )