External links on this page can only be accessed from outside the RE
Using the multi-sample VCFs from aggV2, we have generated Principal Components (PCs) for participants in aggV2, calculated pairwise relatedness amongst samples, and estimated probabilities of genetic ancestry for five broad super-populations. In this page we outline our approach and link to the outputs as they are provided in the Genomics England Research Environment.
We estimated broad genetic ancestry using ethnicities from the 1000 genomes project phase 3 (1KGP3) as the truth, by generating PCs for 1KGP3 samples and projecting all aggV2 participants onto these. The five broad super-populations are:
We used the 1KGP3 to infer ancestry as follows:
- We took all unrelated samples from the 1KGP3
- We subsetted to just our 188382 HQ SNPs
- Further filtered for MAF > 0.05 in 1KGP3 (as well as in our data)
- We calculated the first 20 PCs using GCTA
- We projected the AggV2 data onto the 1KGP3 PC loadings
- We trained a random forest model to predict ancestries based on
- First 8 1KGP3 PCs
- set Ntrees = 400
- Train and predict on 1KGP3 amr, afr, eas, eur and sas super-populations
Below we show the summary data for the random forest model fit. The OOB error rate and confusion matrix show very high performance in the prediction of 1KGP3 super-populations.
The probabilities for each individual is found at:
Additionally, for users interested in more fine-grained population structure, we provide a set of ancestry predictions based sub-population ancestries from the 1KGP3. The steps to calculate are as above and differ only for step 3 and 6.
3 - MAF filter of >0.01 for 1KGP3 and aggV2 data
6 - We trained a random forest model to predict ancestries based on 1KGP3 sub-populations
These data are available at:
Ancestry summary stats
Below is a summary table for the number of individuals (and as a percent of the cohort) assigned with a probability of >0.8 for any one ancestry.
PCs with 1KG samples and projected aggV2 samples, coloured by predicted ancestry
Below we show the first 6 PCs, which were used for the ancestry inference of the aggV2 samples. The plots to the left show all samples (in gray), with the 1KGP3 samples plotted in different colours by super-population. The plots to the right show all samples (in gray), with the aggV2 samples plotted in different colours by predicted super-population (using a threshold of T=0.8). 1KG samples are represented by crosses, and aggV2 samples by solid circles.
The following plot focuses on EUR and EAS sub-populations from 1KGP3. 1KG samples are represented by crosses, and aggV2 samples by solid circles. PCs for all 1KGP3 and aggV2 samples are included, in gray. In addition:
Left: 1KGP3 samples in different colours by super-population
Middle: 1KGP3 samples in different colours by EAS sub-populations, with aggV2 predicted EAS plotted on top
Right: 1KGP3 samples in different colours by NFE and FIN populations, with aggV2 predicted EUR samples plotted in darkblue.