Page tree
Skip to end of metadata
Go to start of metadata


(warning) External links on this page can only be accessed from outside the RE (warning)

Overview

This tutorial uses Python LabKey API to query the clinical and phenotype data. Please see Python LabKey API guide to find out additional information on the Python LabKey API.

Prerequisites 

  • You will need to create and enter your credentials into your .netrc profile

Loading the LabKey module in Python

IVD - loading Python

On Inuvika Virtual Desktop machines, the Python LabKey module is installed for both Python 2 (version 2.7.12) and Python 3 (versions 3.6.5 and 3.8.1).

To load Python, open a terminal and type the following:

# to use Python 2
module load python/2.7.12
# to use Python 3, e.g. version 3.8.1
module load python/3.8.1

HPC Helix - loading Python

On Helix, Python is managed using conda environments - please find more details at: ARCHIVED: HPC (Helix) Migration 2020

In particular, for Python 3, the Labkey API is accessible for example from the py3pypi environment, so you can load Python like this:

. /resources/conda/miniconda3/etc/profile.d/conda.sh && conda activate py3pypi

Both systems - loading the module for Labkey

After that, on both IVD and Helix, you can load the LabKey API in Python by importing it like any other Python module. Open Python in the terminal (or in your script) and type:

import labkey

Important Note

All query functions within the Python LabKey module (we will use select_rows and execute_sql) will by default extract no more than 100,000 rows from any given LabKey table, rather than the whole table.

All such query functions have an optional parameter called max_rows that sets the maximum number of rows that can be retrieved by that query, and that can be set to any integer value. The default value for this parameter is "None", which translates into 100,000 rows being retrieved.

There is no standard way to retrieve all rows in a large table using this version of the API - if you believe that you will retrieve more than 100,000 rows, or you are unsure, you must set the optional parameter max_rows to an integer value that is large enough for your use case. 

Please see the examples section below, however remember that the value you need to set max_rows to will depend on the size of the table and on your query.


You are now ready to use Python and the LabKey API.

Types of queries

Fetching an entire table from LabKey

(question) The following code outlines how to fetch an entire table from LabKey and store it in a Pandas dataframe - note the use of the max_rows parameter (see above), which is essential here as the sequencing_report table has 107,623 rows in Data Release 8.

# Import the needed modules, labkey and pandas (for dataframes)
import labkey
import pandas as pd
 
# Specify what we are connecting to, and what schema and tables we want
labkey_server = "labkey-embassy.gel.zone"  # The labkey server we are connecting to. This will not change
project_name = "main-programme/main-programme_v8_2019-11-28"  # The data we want to access. This will change depending on your study
context_path = "labkey"   # This does not change
schema_name = "lists"  # The schema we are getting data from. This does not change
query_name = "sequencing_report"  # The table we want to get data from. This does change
 
# Create an object that will let us connect to the LabKey databases. This does not change.
server_context = labkey.utils.create_server_context(
    labkey_server, project_name, context_path, use_ssl=True
)
 
# The data are returned and stored in the variable results.
results = labkey.query.select_rows(server_context, schema_name, query_name, max_rows=200000)
 
# Data are returned as a dictionary, will all of the table information stored under the key "rows".
# We make a dataframe of all of the table information using pandas.
table_of_data = pd.DataFrame(results["rows"])

Fetching a specific part of a table from LabKey

(question) The following code outlines how to filter a table by columns and/or rows in order to return a subset of a table.

# Import the needed modules, labkey and pandas (for dataframes)
import labkey
import pandas as pd
 
# Specify what we are connecting to, and what schema and tables we want
labkey_server = "labkey-embassy.gel.zone"  # The labkey server we are connecting to. This will not change
project_name = "main-programme/main-programme_v8_2019-11-28"  # The data we want to access. This will change depending on your study
context_path = "labkey"  # This does not change
schema_name = "lists"  # The schema we are getting data from. This does not change
 
query_name = "sequencing_report"  # The table we want to get data from. This does change
 
column1 = "Participant_Id"  # The first column that we are interested in. This does change.
column2 = "Type"  # The second column that we are interested in. This does change.
column3 = "Delivery_Date"  # The third column that we are interested in. This does change.
 
filter1 = labkey.query.QueryFilter("Type", "cancer somatic") # Return only reports for cancer somatic samples.
filter2 = labkey.query.QueryFilter("Lab_sample_id", "1018056774, 1018068634", labkey.query.QueryFilter.Types.BETWEEN) # Return only reports with Lab sample id included in a range.

combined_filter = [filter1, filter2]  # Combine the previous two filters.
 
# Create an object that will let us connect to the LabKey databases. This does not change.
server_context = labkey.utils.create_server_context(
    labkey_server, project_name, context_path, use_ssl=True
)
 
# Fetch and store only the columns that we want.
selected_columns = labkey.query.select_rows(
    server_context, schema_name, query_name,
    columns=",".join([column1, column2, column3])
)
 
# Filters can be used to fetch only the rows that we are interested in.
selected_rows1 = labkey.query.select_rows(
    server_context, schema_name, query_name,
    filter_array =  [filter1]
)
selected_rows2 = labkey.query.select_rows(#
    server_context, schema_name, query_name,
    filter_array =  [filter2]
)
 
# Filters can also be combined to with each other and with column selection to do more powerful selection of data from a table.
selected_data = labkey.query.select_rows(
    server_context, schema_name, query_name,
    filter_array = combined_filter,
    columns = ",".join([column1, column2, column3])
)
 
# Data can then be stored in a dataframe as normal.
table_of_data = pd.DataFrame(selected_data["rows"])

Fetching data spread across multiple tables in LabKey

The previous code is good for when all the data you want is stored in one table. Often this is not the case. The following code will show how to get data that is spread across multiple tables in LabKey.

Until now we have been using the select_rows function in LabKey. We could continue to do this for this example, but then we would have to make a call for each table we want to select data from, applying the filters we want and joining the resulting data returned into one big dataframe.

An easier way to achieve this is with the execute_sql function in LabKey. With this, we can write a block of sql code that LabKey can execute, which can combine all of the above steps into one step.

(question) In the following code, we will get a table of all the people in the cancer programme who are numbered between 6000 and 6200.

# Import the needed modules, labkey and pandas (for dataframes)
import labkey
import pandas as pd
 
# Specify what we are connecting to, and what schema and tables we want
labkey_server = "labkey-embassy.gel.zone"  # The labkey server we are connecting to. This will not change
project_name = "main-programme/main-programme_v8_2019-11-28"  # The data we want to access. This will change depending on your study
context_path = "labkey" # This does not change
schema_name = "lists"  # The schema we are getting data from. This does not change
 
# Create the SQL query as a string (note the spaces)
sql 
= (
"select "
    "participant.Participant_Id, participant.Programme, sequencing_report.lab_sample_id "
"from "
    "lists.participant "
"join "
    "lists.sequencing_report "
"on "
    "participant.Participant_Id = sequencing_report.Participant_Id "
"where "
    "sequencing_report.lab_sample_id between 1018056774 and 1018068634"
)


# Create an object that will let us connect to the LabKey databases. This does not change.
server_context = labkey.utils.create_server_context(
    labkey_server, project_name, context_path, use_ssl=True
)
 
# The data are returned and stored in the variable results.
results = labkey.query.execute_sql(server_context, schema_name, sql)
 
# Data are returned as a dictionary, will all of the table information stored under the key "rows".
# We make a dataframe of all of the table information using pandas.
table_of_data = pd.DataFrame(results["rows"])

Potential error upon an incorrect setup ("DOCTYPE"-error)

In case the LabKey API is incorrectly setup, please have a look here:
LabKey DOCTYPE error
If this is not the issue, but you still have problems using the LabKey API, please contact the Service Desk.

  • No labels