Page tree
Skip to end of metadata
Go to start of metadata


(warning) External links on this page can only be accessed from outside the RE (warning)

Overview

This tutorial uses R LabKey API to query the clinical and phenotype data. Please see Rlabkey Package guide to find out additional information on the R LabKey API.

Prerequisites 

  • You will firstly need to install and load the Rlabkey package into your R session by following this guide. We used Rlabkey version 2.2.1 in these examples.
  • You will need to create and enter your credentials into your .netrc profile

Types of queries

Fetching an entire table

Fetching an entire LabKey table is easy. You will only need to change the name of the main programme release version and the name of the the LabKey table you need. You can get the release versions by navigating to the LabKey Application and selecting the Main Programme project - the names of the releases are detailed in the subsequent folder. The names of each table can be seen in the data views when you click on a release folder. 

(question) The code below loads the entire sequencing_report table and saves it as an R data.frame. The function labkey.selectRows is used to do this. 

library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

sequencing_report <- labkey.selectRows(								
    
    schemaName="lists",                                             # Do not change this
    colNameOpt="rname",                                             # Do not change this
	maxRows = 100000000,                                            # Do not change this
    folderPath="/main-programme/main-programme_v8_2019-11-28",    # This can be changed to different main programme releases    
    queryName="sequencing_report"                                   # This can be changed to different table names
										
)

Fetching a table and filtering

The function labkey.selectRows can also include additional arguments to select specific columns (colSelect), filter for specified row values (colFilter), and sort by columns (colSort). 

(question) The code below loads and subsets the participant table for rare disease probands, born between 1990 and 2010. It then selects only the participant id, family id, and year of birth column, and sorts by year of birth. You can see the full range of R LabKey filters here.

library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

participant <- labkey.selectRows(
    
    # Default selection
    schemaName="lists",                                             # Do not change this
    colNameOpt="rname",                                             # Do not change this
	maxRows = 100000000,                                            # Do not change this
    folderPath="/main-programme/main-programme_v8_2019-11-28",    # This can be changed to different main programme releases
    queryName="participant",                                        # This can be changed to different table names
    
    # Additional parameters
    colFilter=makeFilter(                                                          # Make various filters to subset rows
      c("participant_type", "EQUAL", "Proband"),
      c("year_of_birth", "BETWEEN", "1990,2010")),
    colSelect=c("participant_id", "rare_diseases_family_id", "year_of_birth"),     # Choose to only select these columns
    colSort=("+year_of_birth")												       # Sort by these columns
)

Fetching a table using SQL

Often, you will need to merge data from multiple tables using a common key (such as the unique participant id). The easiest, safest, and most reproducible way to do this is by using the labkey.executeSql function. We highly recommend using this function to perform more complex analysis across multiple tables.

(question) The following code will use the labkey.executeSql function to perform a complex query which:

  1. Selects
    1. The participant id and HPO term columns from the 'rare diseases participant phenotype' table
    2. The year of birth column from the 'participant' table
    3. The file path column to the genomic data from the 'genome file paths and types' table
  2. Left joins
    1. By the unique 'participant id' common to all tables
  3. Filters
    1. For positive diagnoses (hpo present = YES) in the 'rare diseases participant phenotype' table
    2. For two HPO terms (hpo id =  HP:0001249 and HP:0000365) in the 'rare diseases participant phenotype' table
    3. For small variant VCF file paths (file sub type = standard VCF) in the 'genome file paths and types' table
    4. For genomes aligned to GRCh38 (genome build = 38) in the 'genome file paths and types' table
  4. Orders 
    1. By year of birth


library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

# Write your SQL query here
query <- "SELECT rd.participant_id, rd.hpo_id, par.year_of_birth, sq.file_path
          FROM rare_diseases_participant_phenotype AS rd
          LEFT JOIN genome_file_paths_and_types AS sq 
          ON rd.participant_id = sq.participant_id
          LEFT JOIN participant AS par 
          ON rd.participant_id = par.participant_id
          WHERE rd.hpo_id IN ('HP:0001249', 'HP:0000365') AND rd.hpo_present = 'Yes'
          AND sq.file_sub_type = 'Standard VCF' AND sq.genome_build = 'GRCh38'
          ORDER BY par.year_of_birth;"

mysql <- labkey.executeSql(
    schemaName="lists",  												# Do not change this
    colNameOpt = "rname",												# Do not change this
    maxRows = 100000000, 												# Do not change this
    folderPath="/main-programme/main-programme_v8_2019-11-28",		# This can be changed to different main programme releases
    sql = query															# This can be changed to your query of choice
)

Querying the Hospital Episode Statistics tables

There are two 'Medical History' tables within the Main Programme LabKey projects which give diagnoses information encoded by the ICD-10 medical classification. These are the Admitted Patient Care and Outpatient Care tables which are named as 'hes_apc' and 'hes_op' respectively in LabKey (hes = Hospital Episode Statistics). Though these tables are very large, the information containing the ICD-10 terms can be found in the columns prefixed by 'diag'. For example 'diag_01', which is the primary diagnosis, and 'diag_02' and so on. The hes_apc table has 20 diag columns whereas the hes_op table has 12 diag columns. We can retrieve these columns using the labkey.executeSql function as shown below. 

(question) The following code will use the labkey.executeSql function to retrieve all of the ICD-10 columns for each participant from both the hes_apc and hes_op tables. It will also retrieve the time of the participant's visit and their participant ID which you can use to link back to the other tables. 


library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

# Gather the 20 diag terms for the hes_apc table
hes_apc_q <- paste("diag_", sprintf("%02d", seq(1:20)), sep = "", collapse = ", ")

# Execute the SQL query to get the ICD-10 terms from the hes_apc table
hes_apc <- labkey.executeSql(                             
  
  schemaName ="lists",                                                     		# Do not change this
  colNameOpt ="rname",                                             		 		# Do not change this
  maxRows = 100000000,                                            				# Do not change this
  folderPath="/main-programme/main-programme_v8_2019-11-28",    				# This can be changed to different main programme releases   
  sql = paste("SELECT participant_id, epistart,", hes_apc_q, "FROM hes_apc")    # Do not change this
)

# Gather the 12 diag terms for the hes_op table
hes_op_q <- paste("diag_", sprintf("%02d", seq(1:12)), sep = "", collapse = ", ")

# Execute the SQL query to get the ICD-10 terms from the hes_op table
hes_op <- labkey.executeSql(                             
  
  schemaName ="lists",                                             				# Do not change this
  colNameOpt ="rname",                                             				# Do not change this
  maxRows = 100000000,                                            				# Do not change this
  folderPath="/main-programme/main-programme_v8_2019-11-28",    				# This can be changed to different main programme releases   
  sql = paste("SELECT participant_id, apptdate,", hes_op_q, "FROM hes_op")      # Do not change this
)
  • No labels