Evaluating Reference Databases for Phasing and Imputation in Humans


Data from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. I will be presenting an assessment of the genotyping, phasing, and imputation accuracy of data in the 1000 Genomes project. I compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomics platform. We observe that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the 1000 Genomes project data. Further, it appears that using a population specific reference panel does not improve the accuracy of imputation over using the entire 1000 Genomes data set as a reference panel. We also note that the error rates and trends depend on the choice of definition of error, and hence any error reporting needs to take these definitions into account. The quality of the 1000 Genomes data needs to be considered while using this database for further studies. This work presents an analysis that can be used for these assessments. As I show the imputation accuracy to be considerably lower when using a non-matching reference panel, I will also describe our current efforts in evaluating a recently generated South Asia specific whole genome dataset as a reference panel for imputation.

Berkeley Center for Theoretical and Evolutionary Genetics Seminar
Berkeley, CA
Saurabh Belsare
Senior Scientist

Senior Scientist, Bioinformatics at Pacific Biosciences