Page tree
Skip to end of metadata
Go to start of metadata

FILTER, INFO, and FORMAT fields in somAgg

A substantial amount of queries to somAgg can be made using the FILTER, INFO, and FORMAT tags within the VCFs. 

  • The FILTER field has been forced to ".". 
  • The INFO filed shows the per variant list of key-value pairs describing the variation (such as variant filter (flags, such as CommonGermlineVariant, or fraction of panel containing non-reference noise at the site(PNOISE)).
  • The FORMAT field shows and extensible list of fields for describing the samples per variant (such as number of reads supporting each allele (AU:CU:GU:TU) or sample depth).

One can extract all tags per field using the code below which uses bcftools to view the header of a single chunk then extracts the specific field:

#!/bin/bash

module load bio/BCFtools/1.11-GCC-8.3.0

cd /gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2

# This command will print out all of the FILTER tags in the VCF
bcftools view -h somAgg_dr12_chr1_104205047_106476576.vcf.gz | grep '#FILTER'

# This command will print out all of the INFO tags in the VCF
bcftools view -h somAgg_dr12_chr1_104205047_106476576.vcf.gz | grep '#INFO'

# This command will print out all of the FORMAT tags in the VCF
bcftools view -h somAgg_dr12_chr1_104205047_106476576.vcf.gz | grep '#FORMAT'

Identifying which chunk to use

somAgg is split into 1,371 'chunks' across the genome. This is true for both the genotype VCFs and the functional annotation VCFs; where the chromosome, start, and stop chunk names are identical across data types. 

It is often necessary to know which chunk(s) your gene(s), variant(s), region(s) of interest are located in. The script below helps you to this. 

Chunk Names

Chunks are named in the following format: 

Genotype VCFs:

somAgg_dr12_chromosome_start_stop.vcf.gz

- for example - 

somAgg_dr12_chr1_146620016_147701894.vcf.gz

List of chunk names and somAgg VCF files

The list of chunk names and full file paths to both the genotype and functional annotation VCFs can be found here. 

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/chunk_names/somAgg_chunk_names.bed

Each of the 1,371 chunks is on a separate line and each line contains 7 fields:

Column numberDescriptionExample
1Chromosomechr1
2Chunk start1
3Chunk stop506426
4Chromosome, start, stop (format 1)chr1_1_506426
5Chromosome, start, stop (format 2)chr1:1-506426
6Full path to genotype annotation VCF/gel_data_resources/main_programme/aggregated_somatic_strelka/somAg/genomic_data/somAgg_dr12_chr1_1_506426.vcf.gz

Create your own regions file 

You firstly must create a regions file of your gene(s), variant(s), region(s) of interest. This must be a three or column tab-delimited file of chromosome, start, and stop (with an option fourth column of an identifier - i.e. a gene name). The file should have the .bed extension. There is no limit to how many lines you can have in this file. 

Sort

Please pre-sort your data by chromosome and then by start position (sort -k1,1 -k2,2n in.bed > in.sorted.bed)

Example: 

chr2	213005363	213151603	IKZF2
chr7	50304716	50405101	IKZF1

Intersect the two files

Now you can intersect the bed file of chunk names with your regions file using bedtools as shown below: 

#!/bin/bash

module load bio/BEDTools/2.27.1-foss-2018b

bedtools intersect -wo -a my_regions.bed -b somAgg_chunk_names.bed | cut -f 1-4,10

This will print out a six column tab-delimited file with the number of lines equalling the number of inputs in the regions file. It will have the following format:

Column numberDescriptionExample
1Chromosome

chr2

2Region start

213005363

3Region stop

213151603

4Region identifier

IKZF2

5Full path to genotype annotation VCF

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/

/genomic_data/somAgg_dr12_chr2_211052166_213676386.vcf.gz

The full array of columns can also be printed by omitting the cut command. 

  • No labels