Page tree
Skip to end of metadata
Go to start of metadata

Overview

There is a very useful plugin for bcftools called split-vep (available with bcftools version 1.11.2-foss-2018b) which we will make use of here to query and parse the VEP-like functional annotation from CellBase. Please read the split-vep documentation for more information. 

bcftools version

Please use bcftools version 1.11.2 via:  module load bio/BCFtools/1.11.2-GCC-8.3.0


Snippets

Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read. 


One can view all of the available CellBase annotations using the command below - all annotations can be extracted and / or filtered for. Note that, split-vep has been designed to work with VEP and VEP-like annotation. The Cellbase annotation provided in somAgg is VEP-like. This requires the annotation field to be given explicitly to split-vep, which is done by using the flag -a CT.

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT -l somAgg_dr12_chr21_21996481_24481186.vcf.gz

Output: 

...
0	GeneName
1	TranscriptID
2	CDSchange
3	ProteinChange
4	Consequence
5	siftDescription
6	siftScore
7	polyphenDescription
8	polyphenScore
...

Extract variants for a gene of interest 

Question: "I want to extract all variants in the gene IKZF1 and view some basic annotation". 

Script: Use bcftools split-vep. Use the flag -a CT to tell split-vep what the annotation field is calledThis example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i flag. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT somAgg_dr12_chr7_48864936_51531027.vcf.gz \
-i 'GeneName="IKZF1"' -d \
-f '%CHROM\t%POS\t%REF\t%ALT\t%GeneName\t%TranscriptID\t%Consequence\t%CDSchange\n' \
> IKZF1_variants.tsv

Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above. 

...
chr7	50299131	G	A	IKZF1	ENST00000484847	upstream_variant		.
chr7	50299139	A	AGT	IKZF1	ENST00000484847	upstream_gene_variant	.
chr7	50299147	C	T	IKZF1	ENST00000484847	upstream_variant		.
chr7	50299148	G	A	IKZF1	ENST00000484847	upstream_variant		.
...

* data have been randomised and subset

Extract variants for a gene of interest with a damaging prediction 

Question: "I want to view the variants that that are missense or worse in the gene IKZF1".

Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - that are missense or worse (-s and -S flag). Use the flag -a CT to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT. \
-i 'GeneName="IKZF1"' -d  \
-f '%CHROM\t%POS\t%REF\t%ALT\t%GeneName\t%TranscriptID\t%Consequence\t%ProteinChange\n'  \
-c GeneName,TranscriptID,Consequence,ProteinChange  \
-s worst:missense+  \
-S ../additional_data/vep_severity_scale/VEP_severity_scale_2020.txt  \
somAgg_dr12_chr7_48864936_51531027.vcf.gz    \
> ../IKZF1_worse_missense.tsv

Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above. 

...
chr7	50327647	G	T	IKZF1	ENST00000413698	missense_variant	p.(SER17ILE)
chr7	50327647	GC	G	IKZF1	ENST00000413698	frameshift_variant	p.(SER17SER)
chr7	50327653	C	T	IKZF1	ENST00000413698	missense_variant	p.(PRO19LEU)
chr7	50327661	G	C	IKZF1	ENST00000413698	missense_variant	p.(ASP22HIS)
chr7	50327661	G	A	IKZF1	ENST00000413698	missense_variant	p.(ASP22ASN)
...

* data have been randomised and subset

Please note that when using bcftools 1.11.x specifying severity constraints via the -s flag, as above, is redudant. However, the use may refer to the example to provide their own severity scale if needed. The one above can be found on:

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt

Extract variants for a gene of interest – Genomics England Interpretation pipeline

Question: "I want to extract per sample variants as presented in the analysis_csv (LabKey table cancer_analysis) in the gene IKZF1".

Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - that are missense or worse (-s and -S flag). Use the flag -a CT to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file. 

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT \
-i 'GeneName="IKZF1" & TranscriptID="ENST00000331340" & FMT/FILTER="PASS" & (Consequence="frameshift_variant" | Consequence="inframe_deletion" | Consequence="inframe_insertion" | Consequence="missense_variant" | Consequence="splice_acceptor_variant" | Consequence="splice_donor_variant" | Consequence="splice_region_variant" | Consequence="start_lost" | Consequence="stop_gained" | Consequence="stop_lost")' -d  \
-f '[%CHROM\t%POS\t%REF\t%ALT\t%SAMPLE\t%GeneName\t%TranscriptID\t%Consequence\t%CDSchange\t%ProteinChange\n]'  \
-c GeneName,TranscriptID,Consequence,CDSchange,ProteinChange  \
somAgg_dr12_chr7_48864936_51531027.vcf.gz    \
> IKZF1_interpreted.tsv


Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above. 

...
chr7	50327647	G	T	IKZF1	ENST00000413698	missense_variant	p.(SER17ILE)
chr7	50327647	GC	G	IKZF1	ENST00000413698	frameshift_variant	p.(SER17SER)
chr7	50327653	C	T	IKZF1	ENST00000413698	missense_variant	p.(PRO19LEU)
chr7	50327661	G	C	IKZF1	ENST00000413698	missense_variant	p.(ASP22HIS)
chr7	50327661	G	A	IKZF1	ENST00000413698	missense_variant	p.(ASP22ASN)
...

* data have been randomised and subset


The list of transcripts used by the Cancer Programme Genomics England Interpretation pipeline can be found here:

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/LIST_OF_CANONICAL_TRANSCRIPTS.v1.11.tsv

Alternatively, the MANE project is also available on our HPC, at:

/public_data_resources/ensembl_cds/MANE/v0.95/