Illumina has partnered with Genomics England to provide a whole genome sequencing (WGS) service pipeline that utilises a series of established algorithms to detect genomic variants with in-depth accuracy.
The genomic data deposited in the Research Environment comprises the full output of the Illumina Whole Genome Sequencing Service Informatics pipeline and the Cancer Analysis Services pipeline.
On this page is a summary of the genomic data within the Research Environment. For a more in-depth description, you can consult the Whole Genome Sequencing Service Informatics or Cancer Services Guide from Illumina.
External links on this page can only be accessed from outside the RE
Genomic data structure
The genomic data in the Research Environment and their associated files are an exact copy of those generated by the Illumina sequencing pipeline.
The files for each sample are maintained in a consistent file structure and are named according by the sequencing platekey of the sample. The platekey refers to the combination of plate barcode and well coordinate; for example: LP12345678-DNA_A01.
The genomic data can be accessed through the file system either by clicking on the Home icon on the Desktop or navigating to 'home' through the command line. In home, you will see a folder called 'genomes' which contains a sub-folder called 'by_date'. This folder contains the genomic data generated by the Illumina sequencing pipeline, organised by date of delivery. Within each 'by_date' folder, is the delivery folder for each genome. These are labelled with a unique identifier per genome such as HX00122252. You will then be able to click on the platekey folder to see all genomic data delivered for the sample in hand.
Below is an example of the genomic file structure for the sample LP3000588-DNA_A01:
|Genomes folder||By date folder||Delivery ID folder||Platekey folder|
The best way to find the locations of genome files is using the latest LabKey
genome_file_types_and_paths table or file locations on the latest version Participant Explorer, as we know that these files are currently consented for use. You should not find these files by directory traversing or browsing, as you may find genomes of participants who have since withdrawn consent and any requests to export these data via Airlock will be rejected.
Remember that the ~/genomes/ folder on the Desktop is mounted on the HPC under /genomes so you will be able to access genome folders from both environments. See here for more information.
In the desktop and terminal interface, you will be able see all genome delivery folders.
Note that this does not mean you will have access to all of these folders.
You will only have access to the genome folders that you have been given permissions to based on your credentials.
The Illumina Whole Genome Sequencing Service Informatics pipeline and the Cancer Analysis Services pipeline performs a series of processes with the following software packages:
|Issac||Aligns reads to the reference genome, trims and flags duplicates in the raw sequence.|
|Starling||A germline small variant caller and generates small variant (SNV and small indels ≤ 50 bp) analysis calls.|
|Manta||A germline and somatic structural variant caller; it generates structural variant (SV) analysis calls.|
|Canvas||A germline and somatic copy number variant caller; it generates copy number (CNV) and loss of heterozygosity (LOH) analysis calls.|
|Strelka||Joint tumour/normal small-variant caller.|
|ExpansionHunter||A tool which looks for repeat expansions at several positions of interest.|
|HLATyper||A tools to generate likely HLA types for the sample.|
|ROHcaller||Identifies runs of homozygosity (ROHs) from whole-genome SNV variant call sets and predicts the most likely relationships of the sequenced individual's parents.|
The tools and versions used for each Illumina genome delivery will depended on the sample (germline/tumour) and on the Illumina pipeline version.
You can look at the header of the BAMs/VCFs to identify the Illumina pipeline and tool version used for each delivery.
Variants are annotated and the resulting statistics are compiled into a summary PDF. See here for the variant annotation pipeline from Illumina.
All genomes have been aligned against either GRCh37 or GRCh38. For each data release, the reference genome is indicated in LabKey along with the path to the folder where the genome is stored.
Genomic Data Contents
An example file structure for LP12345678-DNA_A01 is shown below, this is an exhaustive list, so not all samples will have all the below files.
The key files are:
- ./Assembly/[Platekey].bam - Archival BAM file for sample
- ./Assembly/[Platekey].bam.bai - Index for the BAM file
- ./Variations/[Platekey].vcf.gz - Single nucleotide polymorphism (SNVs) and small insertion/deletion (1 bp–50 bp) calls in VCF format.
- ./Variations/[Platekey].genome.vcf.gz - Genome *.VCF file containing SNVs, indels, and reference covered regions
- ./Variations/[Platekey].SV.vcf.gz - Large Structural Variation calls (51 bp–10 kb) and copy number calls (10 kb+) in *.VCF format.
Note that the genotyping folders from early genome deliveries contain the results of the sample's run on the Infinium platform, an initial run done to confirm the sample identity and make sure that it is of high quality.
Quality metrics and other data such as coverage information can be found in the Metrics folder for each genome.
File Types Explained
The BAM file contains all pass filter reads input into the analysis pipeline for a sample and includes aligned, duplicate and unaligned reads. It adheres to the SAM format specification wherever possible.
All vcf files are compressed and indexed using tabix; the tabix index files show up as the corresponding *.tbi file.
Human genome sequencing applications require sequencing information for both variant and nonvariant positions, yet there is no common exchange format for such data. gVCF addresses this issue.
gVCF is a set of conventions applied to the standard variant call format. These conventions allow representation of genotype, annotation and additional information across all sites in the genome, in a reasonably compact format (typically about 1/50 the size of the BAM file used for variant calling).
- gVCF is also equally appropriate for representing and compressing targeted sequencing results. Compression is achieved by joining contiguous nonvariant regions with similar properties into single 'block' VCF records. To maximise the utility of gVCF, especially for high stringency applications, the properties of the compressed block are conservative. Block properties such as depth and genotype quality reflect the minimum of any site in the block. The gVCF file is also a valid vCF v4.1 file and can be indexed and used with existing tools such as tabix and IGV.