External links on this page can only be accessed from outside the RE
The somatic aggregate multi-sample VCF (somAgg) comprises somatic genomic data from 16,341 tumour samples. It is an aggregation of single nucleotide variants and small indels (<=50bp) from single-sample somatic vcf files that have been successfully sequenced and interpreted. The input data can be found in the cancer_analysis table in Labkey, for the Main Programme data release V12 of the 100,000 Genomes Project, under column somatic_small_variants_annotation_vcf.
We have aggregated 16,341 somatic somatic vcf files from the 100,000 Genomes Project which we made available as a multi-sample VCF dataset (somAgg). somAgg comprises over 573 million annotated single nucleotide variants and small indels (<=50bp) from quality controlled tumour whole genomes. For a breakdown of variants per chunk see here.
The multi-sample VCF is split into 1,371 roughly equal 'chunks' across the genome for faster processing. Each chunk contains the full set of samples and is in the VCF.gz file format with accompanying tabix index files (.tbi). Chromosomes 1-22, X, Y, and M are included.
The usage of GT
In the somatic aggregated files there are only two possible GT values:
All variants are in their bi-allelic forms (instead of potential multi-allelic) and samples that have multi-allelic sites are indicated by the FORMAT tag: SAMPLE_MULTIALLELIC (See Genotype-level Metrics for further details on SAMPLE_MULTIALLELIC).
- Multi-allelic: where a single variant contains three or more observed alleles, counting the reference as one, therefore allowing for two or more variant alleles (heterozygous genotype example: 1/2)
- Bi-allelic: where a variant contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele (heterozygous genotypes are always: 0/1)
Each step of the pipeline to generate somAgg is documented in the sections below:
A code book of popular queries to help you use somAgg is found here: somAgg Code Book
somAgg Manifest & Location
The somAgg dataset comprises:
- A multi-sample VCF file for each chunk containing the genotypes and per variant quality metrics and filter flags
All somAgg (v0.2) outputs can be found in the following folder within the Genomics England Research Environment:
This folder is accessible from the Desktop Environment and from the HPC as shown below:
|Desktop Access||HPC Access|
Overview of Quality Control Flags
Variants in the multi-sample VCF files are flagged against this set of basic site quality metrics. Note that hard variant filtering has not been applied to the dataset (no variants have been removed).
All 16,341 samples included in somAgg have successfully passed our internal sequencing and interpretation pipeline. These samples are listed in the LabKey table cancer_analysis. Some quality control statistics for these samples are provided below.
|Tumour Cross-Contamination||less than 5%|
|Germline Cross-Contamination||less than 3%|
|Median Fragment Size||greater than 279bp|
|Excess of Chimeric Reads||mean of 0.3%|
|Percentage of Somatic Mapped Reads||mean of 93.4%|
|Percentage AT Dropout||mean of 3.1%|
Single sample Genomics England Filters
On the single sample vcf level (somAgg input), Genomics England has defined extra FILTERs that are described here. In the single vcf file, a variant is only flagged with PASS after having passed all Strelka and the filters listed below.
Applied to indels only. It aims to flag calls with too many filtered basecalls. More specifically, a variant is flagged if the average fraction of filtered basecalls within 50 bases of the indel exceeds 0.1, i.e. FDP50/DP50 > 0.1.
Applied to SNVs only. It aims to flags variants in a region of mapping/sequencing error. More specifically, a variant is flagged if SomaticFisherPhred is below 50, indicating somatic SNV is systematic mapping/sequencing error. Different from other filters, this filter is only applied to variants that pass all Strelka filters.
Applied to both SNVs and indels. It aims to flag variants that overlap a repetitive regions, since these are prone to error. More specifically, a variant is flagged if overlapping simple repeats as defined by Tandem Repeats Finder:
Note however, that a few samples have been analysed with previous versions of this cohort, and hence some inconsistency has been carried over to the somAgg.
Variant- and genotype- level flags (FILTER)
The FILTER field has not been populated in this version of the aggregate. Hence, all variants have FILTER "." in the respective field of the aggregate VCF. All filter flags of the individual annotated VCF files have been moved to the INFO or FORMAT fields in the aggregate. Variant-level flags have been moved to the INFO field of the aggregate. Genotype-level flags have been kept in the FORMAT field of the aggregate. Note that no variants have been filtered out on the basis of these filters in this version of the aggregate.
Filter flags are marked in purple on the Variant- and Genotype- level metrics and flags below.
Variant-level Metrics (INFO)
Per variant quality metrics are kept in the INFO field of the multi-sample VCF files. The INFO tags with descriptions are shown in the table below. Note that the source column in the table indicates if the TAG is generated by the variant caller (Strelka), has been added as part of Genomics England sequencing and interpretation pipeline (internal) or as part of post-processing/annotation specifically for the aggregate (BRS).
* Repetitive regions have been introduced when some samples of the 100,000 Genomes Project had already been sequenced and analysed so it is not consistently applied throughout the cohort.
Genotype-level Metrics (FORMAT)
Genotype-level metrics are kept in the FORMAT field of the multi-sample VCF files. The FORMAT tags with descriptions are shown in the table below. Note that the source column below indicates if the TAG is generated by default by the variant caller (Strelka), has been added as part of Genomics England sequencing and interpretation pipeline (internal) or or as part of post-processing/annotation specifically for the aggregate (BRS). The SNV/indel column indicates whether the respective FORMAT field has been populated for SNPs, indels or both.
** Note that the way VAF is calculated for SNVs, it does not take multi-allelic into account. The reason for that is to remove potential noise. However, multi-allelic sites may have VAFs whose sum is larger than 1. In the most extreme case, you will have REF completely replaced by the two (or more) possible ALT and each ALT will have VAF = 1.
Help & Support
|Please reach out via the Genomics England Service Desk for any issues related to the somAgg aggregation or companion datasets, including "somAgg" in the title / description of your inquiry.|