Page tree
Skip to end of metadata
Go to start of metadata


(warning) External links on this page can only be accessed from outside the RE (warning)

This aggregate dataset contains information on a subset of participants who have since been withdrawn from research. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses an ensure that you are only using samples included in the latest data release.

The list of samples for the consented participants can be found in the 'aggregate_gvcf_sample_stats' table in the labkey, for the latest data release.

For the main programme version 13 data release, the list of consented samples are detailed in the file main_programme_v13_samples.txt, located in the folder /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/

To filter the aggregate to these samples, all bcftools commands should include the flag -S /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/main_programme_v13_samples.txt

Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.

Brief Overview

As part of the Main Programme V10 data release, we make available an aggregate multi-sample VCF (aggV2) comprising 78,195 germline genomes from the 100,000 Genomes Project on GRCh38. We also provide functional annotation files for all variants, variant and sample quality control (QC) metrics, inferred sample relatedness information, Principal Components, and inferred ancestry information for all samples in aggV2

Description

We have aggregated 78,195 germline gVCFs (genomic VCFs) from the 100,000 Genomes Project which we made available as a multi-sample VCF dataset (aggV2). aggV2 comprises over 722 million annotated single nucleotide variants and small indels (<=50bp) from quality controlled rare disease and cancer germline whole genomes. Note that there are significant changes from the aggV1 pipeline based upon the Main Programme V5.1 Data Release (59,464 samples). 

Use aggV2

Unless an analysis using aggV1 is ongoing, aggV2 should be used in its place by all researchers.

All samples in the dataset were sequenced with 150bp paired-end reads in a single lane of an Illumina HiSeq X instrument and uniformly processed on the Illumina North Star Version 4 Whole Genome Sequencing Workflow (NSV4, version 2.6.53.23); which comprises the iSAAC Aligner (version 03.16.02.19) and Starling Small Variant Caller (version 2.4.7). Samples were aligned to the Homo Sapiens NCBI GRCh38 assembly with decoys. 

The dataset was constructed from the aggregation of single-sample gVCFs using the Illumina software gVCF genotyper (version: 2019.02.26). Variant normalisation and decomposition was implemented by vt (version 0.57721). Genomic annotation and calculation of allele statistics (count, frequency etc.) was performed using Ensembl VEP and bcftools respectively. 

The multi-sample VCF is split into 1,371 roughly equal 'chunks' across the genome for faster processing. Each chunk contains the full set of samples and is in the VCF.gz file format with accompanying tabix index files (.tbi). Chromosomes 1-22, X, Y, and M are included. 

Extended Details

Each step of the pipeline to generate aggV2 is documented in the sections below: 

FAQ & Code Book

A FAQ regarding all aggV2 queries can be found here: aggV2 FAQ

A code book of popular queries to help you use aggV2 is found here:  aggV2 Code Book

aggV2 Manifest & Location

Manifest

The aggV2 dataset comprises four main parts: 

  1. A multi-sample VCF file for each chunk containing the genotypes and per variant quality metrics and filter flags
  2. A corresponding VCF file for each chunk containing the functional (genomic) annotation and allele statistics for all variants
  3. The aggregate_gvcf_sample_stats table in the Main Programme V10 LabKey folder which contains all sample quality metrics and accompanying meta-data
  4. Associated files, based on aggV2, which may be useful for downstream analyses, such as Principal Components across all included samples, information on sample relatedness, and assignment of predicted super-population to each sample. Some of this information is also provided in LabKey, in the aggregate_gvcf_sample_stats table mentioned above.

Location

All aggV2 outputs can be found in the following folder within the Genomics England Research Environment: 

/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/

This folder is accessible from the Desktop Environment and from the HPC as shown below:

Desktop AccessHPC Access

Overview of Quality Control Flags

Variants in the multi-sample VCF files and annotation files are flagged against this set of basic site quality metrics. Note that hard variant filtering has not been applied to the dataset (no variants have been removed). 

Sample QC

All 78,195 samples included in aggV2 pass the following quality control filters:

Sample AttributeDescription
Sample Contamination (freemix)less than 0.03
Ratio of SNV Het to Hom callsless than 3
Total Number of SNVsbetween 3.2M - 4.7M
Array Concordancegreater than 90%
Median Fragment Sizegreater than 250bp
Excess of Chimeric Readsless than 5%
Percentage of Mapped Readsgreater than 60%
Percentage AT Dropoutless than 10%

Site Flags (FILTER)

The flags are presented within the FILTER column of the multi-sample VCF files and the annotation files as follows:

FILTER TAGDescription
PASS

All filters passed

missingness

Missingness (fully missing genotypes with DP=0) ≤ 5%

depth

Median Depth ≥ 10

GQMedian GQ ≥ 15
ABratioPercentage of het calls not showing significant allele imbalance for reads supporting the ref and alt alleles ≥ 25%
completeGTRatioPercentage of complete sites (sites with no missing data) ≥ 50%
phwe_eurp-value for deviations from HWE in unrelated samples of inferred European ancestry ≥ 1e-5

Site Metrics (INFO)

Per variant quality metrics were calculated and populated in the INFO field of the multi-sample VCF files and the annotation files. The INFO tags with descriptions are as follows: 

INFO TAGDescription
medianDepthAll

Median depth (taken from the DP FORMAT field) from all samples

medianDepthNonMiss

Median depth (taken from the DP FORMAT field) from samples with complete genotypes only

medianGQ

Median genotype quality (taken from the GQ FORMAT field) from samples with complete genotypes only

missingness

Percent of fully missing genotypes where GT = './.' and DP = 0

completeSitesThe ratio of complete genotypes by the total number of samples
AB_Ratio

The number of heterozygous genotypes showing imbalance (p<0.01) divided by the total number of heterozygous genotypes.

MendelSiteNumber of Mendelian errors at this site from confirmed trios
phwe_afrHardy-Weinberg equilibrium mid p-value in unrelated samples of inferred African ancestry
phwe_amrHardy-Weinberg equilibrium mid p-value in unrelated samples of inferred American ancestry
phwe_easHardy-Weinberg equilibrium mid p-value in unrelated samples of inferred East Asian ancestry
phwe_eurHardy-Weinberg equilibrium mid p-value in unrelated samples of inferred European ancestry
phwe_sasHardy-Weinberg equilibrium mid p-value in unrelated samples of inferred South Asian ancestry

Help & Support

Help with aggV2

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title / description of your inquiry.