Page tree
Skip to end of metadata
Go to start of metadata

About

This page will highlight some best practices to work with containers within the Research Environment. Please note that this page aimed at more advanced command line users and provide them with the necessary setup to run their containers. It will therefore not outline how to learn to work with containers. Some information displayed here may be useful for the learning process but we would suggest other resources online to learn more about working with containers.

This is a new feature that we are rolling out to Helix. If you have any feedback or suggestions on the usage of singularity and containers, please reach out to us via the Genomics England Service Desk.


How-to

It is now possible to run containerised software on the HPC Helix. While Docker is not available, Singularity is available and through that various docker containers can be pulled and run. Here we provide some best-practices on how to run it and how to set it up. For security reasons we cannot allow pushing out of the environment.

Loading Singularity on the HPC

To use Singularity on Helix please type the following: module load singularity/3.2.1

Caching (Singularity)

Whenever you create an image with Singularity within the HPC, the files are automatically cached. The cached files are located in /home/<username>/.singularity/. However, it could be that you are submitting and creating an image via a compute node in an interactive session. In that case the caching will output the file there which may potentially flood the compute node's memory. You can redirect this location by setting the environment variable SINGULARITY_CACHEDIR.

For example, we recommend placing the environment variable in your .bashrc script as followed SINGULARITY_CACHEDIR="/re_gecip/my_gecip_/username/singularity_cache/".

To view your current cache you can use the command singularity cache list and singularity cache list --all to view all the individual blob's that have been pulled.

To clean up your cache you can use the command: singularity cache clean

Running bcftools from containers (Example quay.io)

As an example on how to run the containers from Helix, we are showcasing the usage of bcftools 1.13 (This version is at present not available yet on Helix as an installed module). The repository on quay.io has various builds available and can be run seamlessly on Helix: https://quay.io/repository/biocontainers/bcftools?tab=info.

Below we first load singularity, pull the container and build a singularity image so you do not need to pull the container every time. We then show the basic command, and an example where we mount the /genomes/ folder and run a simple bcftools view command.

Some containers may be sizeable, so we recommend pulling and/or creating images via an interactive session. The bcftools container of this example is ~234 Mb, but they can easily reach >Gb depending on the software complexity. Please also note the caching section above.


Running bcftools via containers
module load singularity/3.2.1

singularity pull bcftools_v1.13.sif docker://quay.io/biocontainers/bcftools:1.13--h3a49de5_0

singularity exec bcftools_v1.13.sif bcftools --version

singularity exec --bind /nas/weka.gel.zone/pgen_genomes:/nas/weka.gel.zone/pgen_genomes \
--bind /genomes:/genomes bcftools_v1.13.sif \
bcftools view -h /genomes/by_date/2021-03-02/BE00018796/LP3000099-DNA_C07/Variations/LP3000099-DNA_C07.vcf.gz | head


Mounting drives and environment variables

In the above example we use the --bind argument to mount the /genomes folder to the container. By default containers will not have the same drives mounted to them, so this needs to be added manually. An added complication of our file system is that we generally make use of relative paths. For instance, the actual path of our /genomes/ folder is /nas/weka.gel.zone/pgen_genomes/. On a day to day basis you will not find any hindrance of this, however for containers it is something to be aware of. In fact, you will first need to --bind the full path, and then add another --bind for the relative path. As we can understand that this can be rather frustrating, we provide a list of useful file paths and relative paths for to ensure a path of least resistance. 

--bind's of interest
MOUNT_GENOMES='--bind /nas/weka.gel.zone/pgen_genomes:/nas/weka.gel.zone/pgen_genomes --bind /genomes:/genomes'

MOUNT_GEL_DATA_RESOURCES='--bind /nas/weka.gel.zone/pgen_int_data_resources:/nas/weka.gel.zone/pgen_int_data_resources --bind /gel_data_resources:/gel_data_resources'

MOUNT_PUBLIC_DATA_RESOURCES='--bind /nas/weka.gel.zone/pgen_public_data_resources:/nas/weka.gel.zone/pgen_public_data_resources --bind /public_data_resources:/public_data_resources'

MOUNT_SCRATCH='--bind /nas/weka.gel.zone/re_scratch:/nas/weka.gel.zone/re_scratch --bind /re_scratch:/re_scratch'

MOUNT_RE_GECIP='--bind /nas/weka.gel.zone/re_gecip:/nas/weka.gel.zone/re_gecip --bind /re_gecip:/re_gecip'

MOUNT_DISCOVERY_FORUM='--bind /nas/weka.gel.zone/discovery_forum:/nas/weka.gel.zone/discovery_forum --bind /discovery_forum:/discovery_forum'

Below shows an example where we are using two of these variables to save the header of a vcf into a .txt file. The example assumes that you also ran the initial bcftools example shown above. Please note that you should change the file path to your own folders, and check whether you need to use the Gecip or Discovery Forum example.

Example combined mounts
# Gecip example
singularity exec $MOUNT_GENOMES $MOUNT_RE_GECIP bcftools_v1.13.sif \
> bcftools view -H /genomes/by_date/2021-03-02/BE00018796/LP3000099-DNA_C07/Variations/LP3000099-DNA_C07.vcf.gz > /re_gecip/<YOUR_FILE_PATH>/sing_cont_bcftools_test.txt

# Discovery Forum example
singularity exec $MOUNT_GENOMES $MOUNT_DISCOVERY_FORUM bcftools_v1.13.sif \
> bcftools view -H /genomes/by_date/2021-03-02/BE00018796/LP3000099-DNA_C07/Variations/LP3000099-DNA_C07.vcf.gz > /discovery_forum/<YOUR_FILE_PATH>/sing_cont_bcftools_test.txt


Working with containers within a workflow

Two ways of going about with this, either pull the container directly within a task of the workflow or create an image beforehand and let the workflow call upon the image. You can also add some of the --bind examples from above into the SINGULARITY_MOUNTS variable.

Caching within Cromwell
submit-docker = """
module load singularity/3.2.1
SINGULARITY_MOUNTS='--bind /nas/weka.gel.zone/re_scratch:/nas/weka.gel.zone/re_scratch \
					--bind /nas/weka.gel.zone/pgen_genomes:/nas/weka.gel.zone/pgen_genomes \
					--bind /nas/weka.gel.zone/re_gecip:/nas/weka.gel.zone/re_gecip \
					--bind /nas/weka.gel.zone/discovery_forum:/nas/weka.gel.zone/discovery_forum'

if [ -z $SINGULARITY_CACHEDIR ];
then
    CACHE_DIR=$HOME/.singularity/cache
else
    CACHE_DIR=$SINGULARITY_CACHEDIR
fi

mkdir -p $CACHE_DIR
LOCK_FILE=$CACHE_DIR/singularity_pull_flock

flock --exclusive --timeout 900 $LOCK_FILE \
singularity exec docker://${docker} \
echo "Sucessfully pulled ${docker}"

bsub \
-q ${lsf_queue} \
-P ${lsf_project} \
-J ${job_name} \
-cwd ${cwd} \
-o ${out} \
-e ${err} \
-n ${cpu} \
-R 'rusage[mem=${memory_mb}] span[hosts=1]' \
-M ${memory_mb} \
singularity exec --containall $SINGULARITY_MOUNTS --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}
"""


List of available repositories

There are various container repositories available which have been whitelisted for Helix. Below you will find the current list of available repositories: 

Sylabs.io (Images at Cloud Sylabs)

Docker Hub

Quay.io


  • No labels