Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

The page summarises information about the 16,341 samples that are present in the somAgg. All samples that are present in cancer_analysis on the Main Programme Release v12 have been included. These samples are all the somatic samples that have been sucessfully sequenced and interpreted.

Single sample sequencing and variant calling

SomAgg combines the annotated somatic vcf files generated by Strelka and annotated with CellBase. Each tumour sample has a matched germline, both deep whole-genome sequenced with an average coverage of 100x and 30x, respectively. Few patients had more than one tumour sample sequenced.

Samples were prepared using an Illumina TruSeq DNA Nano, TruSeq DNA PCR-Free or FFPE library preparation kit and then sequenced on a HiSeq X generating 150 bp paired-end reads. Illumina’s North Star pipeline (version 2.6.53.23) was used for primary WGS analysis. Read alignment against human reference genome GRCh38-Decoy+EBV was performed with iSAAC Aligner (version iSAAC- 03.16.02.19). Small variant calling together with tumour-normal subtraction was performed using Strelka2 (version 2.4.7).

Strelka FILTERs flag the following germline variant calls as NOT PASS, they are nonetheless included in the single vcf files and somAgg:

  • All calls with a sample depth three times higher than the chromosomal mean
  • Site genotype conflicts with proximal indel call. This is typically a heterozygous SNV call made inside of a heterozygous deletion
  • Locus read evidence displays unbalanced phasing patterns
  • Genotype call from variant caller not consistent with chromosome ploidy
  • The fraction of basecalls filtered out at a site > 0.4
  • Locus quality score < 14 for for het or hom SNP
  • Locus quality score < 6 for het, hom or het-alt indels
  • Locus quality score < 30 for other small variant types or quality score is not calculated

Strelka FILTERs flag the following somatic variant calls as NOT PASS, they are nonetheless included in the single vcf files and somAgg:

  • All calls with a normal sample depth three times higher than the chromosomal mean
  • All calls where the site in the normal sample is not a homozygous reference
  • Somatic SNV calls with empirically fitted VQSR score < 2.75 (recalibrated quality score expressing the phred scaled probability of the somatic call being a false positive observation)
  • Somatic indels where fraction of basecalls filtered out in a window extending 50 bases to either side of the indel call position is > 0.3
  • Somatic indels with quality score < 30 (joint probability of the somatic variant and a homo ref normal genotype)
  • All calls that overlap LINE repeat region
  • Variants are not removed on the basis of low read count/frequency in the current version of the analysis pipeline.

Single sample decomposition

The annotated small variant vcf files used as input have been decomposed. The annotated single vcf files are generated from the somatic small variant vcf files (somatic_small_variants_vcf_path in cancer_analysis) generated by the variant calling pipeline, which comprises Strelka2 and vt for the decomposition. In orther words, the somatic variant vcf files are the ones decomposed, which means that no multi-allelic entries are found, because each multi-allelic is represented by 2 or more bi-allelic variants. The decomposition procedure is done in three steps by vt as presented here:

  1. Decompose variants of the same length.
Code Block
languagebash
vt decompose_blocksub -p {vcf_input} -o {vcf_output}

2. Split records with multiple alternate alleles into multiple bi-allelic records an e.g. 1/2 genotype will be split to 1/. and ./1. The flag -s (“smart”) option makes INFO and FORMAT fields of type A and R to be retained and decomposed appropriately.

Code Block
languagebash
vt decompose -s {vcf_input} -o {vcf_output}

3. Left-align indels and trim redundant bases. The “non-ambiguous” reference genome is used. This file only contains A,T,G,C and N characters.

Code Block
languagebash
vt normalize -n -w {window_size} -r {reference} {vcf_input} -o {vcf_output}

Genotype-level metrics

All 16,341 samples included in somAgg have successfully passed our internal sequencing and interpretation pipeline. These sample are listed in the LabKey table cancer_analysis. Some quality control statistics for these samples are provided below.

Sample AttributeDescription
Tumour Cross-Contaminationless than 5%
Germline Cross-Contaminationless than 3%
Median Fragment Sizegreater than 279bp
Excess of Chimeric Readsmean of 0.3%
Percentage of Mapped Readsmean of 93.4%
Percentage AT Dropoutmean of 3.1%

Sample source & Library preparation

The vast majority of the samples has been collected using surgical resection.

tissue_sourcenumber_of_samplespercent_of_samples
SURGICAL RESECTION1460289.36
NOT SPECIFIED5213.19
USS GUIDED BIOPSY4903.00
ENDOSCOPIC BIOPSY2271.39
NON GUIDED BIOPSY1360.83
BMA TUMOUR SORTED CELLS1330.81
NON STANDARD BIOPSY850.52
CT GUIDED BIOPSY690.42
STEREOTACTICALLY GUIDED BIOPSY490.30
ENDOSCOPIC ULTRASOUND GUIDED FNA120.07
LAPAROSCOPIC EXCISION80.05
MRI GUIDED BIOPSY60.04
ENDOSCOPIC ULTRASOUND GUIDED BIOPSY30.02

Also, the majority (~92%) of somAgg are from fresh-frozen (FF) and (~88%) from PCR-free.

library_typepreparation_methodnumber_of_samplespercent_of_samples
PCR-FreeFF1371183.91
PCRFF12857.86
PCRFFPE6023.68
PCR-FreeEDTA4943.02
PCR-FreeASPIRATE1240.76
PCR-FreeCD128 SORTED CELLS700.43
PCRCD128 SORTED CELLS250.15
PCREDTA190.12
PCRASPIRATE80.05
PCR-FreeFFPE30.02

As expected, we see an increased AT drop-out for FFPE samples, but overall the vast majority of samples have good mapping rate.

Tumour purity and coverage