External links on this page can only be accessed from outside the RE
This aggregate dataset contains information on a subset of participants who have since been withdrawn from research. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses an ensure that you are only using samples included in the latest data release.
The list of samples for the consented participants can be found in the 'aggregate_gvcf_sample_stats' table in the labkey, for the latest data release.
For the main programme version 13 data release, the list of consented samples are detailed in the file main_programme_v13_samples.txt, located in the folder /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/
To filter the aggregate to these samples, all bcftools commands should include the flag -S /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/main_programme_v13_samples.txt
Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.
Brief Overview
As part of the Main Programme V10 data release, we make available an aggregate multi-sample VCF (aggV2) comprising 78,195 germline genomes from the 100,000 Genomes Project on GRCh38. We also provide functional annotation files for all variants, variant and sample quality control (QC) metrics, inferred sample relatedness information, Principal Components, and inferred ancestry information for all samples in aggV2.
Description
We have aggregated 78,195 germline gVCFs (genomic VCFs) from the 100,000 Genomes Project which we made available as a multi-sample VCF dataset (aggV2). aggV2 comprises over 722 million annotated single nucleotide variants and small indels (<=50bp) from quality controlled rare disease and cancer germline whole genomes. Note that there are significant changes from the aggV1 pipeline based upon the Main Programme V5.1 Data Release (59,464 samples).
Use aggV2
Unless an analysis using aggV1 is ongoing, aggV2 should be used in its place by all researchers.
All samples in the dataset were sequenced with 150bp paired-end reads in a single lane of an Illumina HiSeq X instrument and uniformly processed on the Illumina North Star Version 4 Whole Genome Sequencing Workflow (NSV4, version 2.6.53.23); which comprises the iSAAC Aligner (version 03.16.02.19) and Starling Small Variant Caller (version 2.4.7). Samples were aligned to the Homo Sapiens NCBI GRCh38 assembly with decoys.
The dataset was constructed from the aggregation of single-sample gVCFs using the Illumina software gVCF genotyper (version: 2019.02.26). Variant normalisation and decomposition was implemented by vt (version 0.57721). Genomic annotation and calculation of allele statistics (count, frequency etc.) was performed using Ensembl VEP and bcftools respectively.
The multi-sample VCF is split into 1,371 roughly equal 'chunks' across the genome for faster processing. Each chunk contains the full set of samples and is in the VCF.gz file format with accompanying tabix index files (.tbi). Chromosomes 1-22, X, Y, and M are included.
Extended Details
Each step of the pipeline to generate aggV2 is documented in the sections below:
- Sample QC
- gVCF Aggregation
- Variant Normalisation
- Variant Representation
- Site QC, FILTER and INFO Fields
- Functional Annotation
FAQ & Code Book
A FAQ regarding all aggV2 queries can be found here: aggV2 FAQ
A code book of popular queries to help you use aggV2 is found here: aggV2 Code Book
aggV2 Manifest & Location
Manifest
The aggV2 dataset comprises four main parts:
- A multi-sample VCF file for each chunk containing the genotypes and per variant quality metrics and filter flags
- A corresponding VCF file for each chunk containing the functional (genomic) annotation and allele statistics for all variants
- The aggregate_gvcf_sample_stats table in the Main Programme V10 LabKey folder which contains all sample quality metrics and accompanying meta-data
- Associated files, based on aggV2, which may be useful for downstream analyses, such as Principal Components across all included samples, information on sample relatedness, and assignment of predicted super-population to each sample. Some of this information is also provided in LabKey, in the aggregate_gvcf_sample_stats table mentioned above.
Location
All aggV2 outputs can be found in the following folder within the Genomics England Research Environment:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/
This folder is accessible from the Desktop Environment and from the HPC as shown below:
Desktop Access | HPC Access |
---|---|
Overview of Quality Control Flags
Variants in the multi-sample VCF files and annotation files are flagged against this set of basic site quality metrics. Note that hard variant filtering has not been applied to the dataset (no variants have been removed).
Sample QC
All 78,195 samples included in aggV2 pass the following quality control filters:
Sample Attribute | Description |
---|---|
Sample Contamination (freemix) | less than 0.03 |
Ratio of SNV Het to Hom calls | less than 3 |
Total Number of SNVs | between 3.2M - 4.7M |
Array Concordance | greater than 90% |
Median Fragment Size | greater than 250bp |
Excess of Chimeric Reads | less than 5% |
Percentage of Mapped Reads | greater than 60% |
Percentage AT Dropout | less than 10% |
Site Flags (FILTER)
The flags are presented within the FILTER column of the multi-sample VCF files and the annotation files as follows:
FILTER TAG | Description |
---|---|
PASS | All filters passed |
missingness | Missingness (fully missing genotypes with DP=0) ≤ 5% |
depth | Median Depth ≥ 10 |
GQ | Median GQ ≥ 15 |
ABratio | Percentage of het calls not showing significant allele imbalance for reads supporting the ref and alt alleles ≥ 25% |
completeGTRatio | Percentage of complete sites (sites with no missing data) ≥ 50% |
phwe_eur | p-value for deviations from HWE in unrelated samples of inferred European ancestry ≥ 1e-5 |
Site Metrics (INFO)
Per variant quality metrics were calculated and populated in the INFO field of the multi-sample VCF files and the annotation files. The INFO tags with descriptions are as follows:
INFO TAG | Description |
---|---|
medianDepthAll | Median depth (taken from the DP FORMAT field) from all samples |
medianDepthNonMiss | Median depth (taken from the DP FORMAT field) from samples with complete genotypes only |
medianGQ | Median genotype quality (taken from the GQ FORMAT field) from samples with complete genotypes only |
missingness | Percent of fully missing genotypes where GT = './.' and DP = 0 |
completeSites | The ratio of complete genotypes by the total number of samples |
AB_Ratio | The number of heterozygous genotypes showing imbalance (p<0.01) divided by the total number of heterozygous genotypes. |
MendelSite | Number of Mendelian errors at this site from confirmed trios |
phwe_afr | Hardy-Weinberg equilibrium mid p-value in unrelated samples of inferred African ancestry |
phwe_amr | Hardy-Weinberg equilibrium mid p-value in unrelated samples of inferred American ancestry |
phwe_eas | Hardy-Weinberg equilibrium mid p-value in unrelated samples of inferred East Asian ancestry |
phwe_eur | Hardy-Weinberg equilibrium mid p-value in unrelated samples of inferred European ancestry |
phwe_sas | Hardy-Weinberg equilibrium mid p-value in unrelated samples of inferred South Asian ancestry |
Help & Support
Help with aggV2