imputing missing genotypes: effects of methods and patterns of missing data

2
POSTER PRESENTATION Open Access Imputing missing genotypes: effects of methods and patterns of missing data Funda Ogut 1 , Fikret Isik 2 , Steven McKeand 2 , Ross Whetten 2* From IUFRO Tree Biotechnology Conference 2011: From Genomes to Integration and Delivery Arraial dAjuda, Bahia, Brazil. 26 June - 2 July 2011 Costs of high-throughput genotyping have decreased to the point where it appears economically feasible to use molecular genetic marker information in applied breed- ing programs. Some practical questions remain to be addressed about how best to deal with missing data in the resulting genotype datasets, to minimize the impact of the missing data on the accuracy of breeding value prediction. Data can be missing for two reasons first, genotyping assay failure is likely for at least some loci in some samples; and second, it may prove economically desirable to invest more resources for high-density gen- otyping of a few individuals and fewer resources for lower-density genotyping of many individuals [1]. The proportion of missing genotypes may range from less than one percent due to genotyping assay failure, to over 80% if a selective genotyping strategy is used. Many methods for predicting genetic merit of trees using mar- ker genotype data require complete genotype informa- tion for mathematical reasons. It is therefore important to use efficient statistical methods to accurately impute missing genotypes. In species with complete reference genome sequences available, the map order of markers and linkage disequilibrium (LD) information can be used to guide imputation of missing genotypes. Comple- tely sequenced reference genomes are available for only two forest tree species, so these methods are not suita- ble for most forest trees. Gengler et al. [2] described a method to impute miss- ing genotypes using mixed linear models and BLUP. We determined the effect on accuracy of BLUP estimated breeding values of imputation with different levels (10%, 20%, 40%, 60% and 80%) of missing genotypes. Analyses were conducted both with empirical data (3461 SNP markers in a cloned loblolly pine population of 165 gen- otypes) and simulated data, using missing data created by random sampling (some loci missing in all indivi- duals) or by structured sampling (all loci missing in some individuals). Simulations were used to examine the effect of family and progeny size, mating design, proportion of missing genotypes, genotyping strategy and the method for imputation on the accuracy of breeding values. Imputed genotypes were obtained using the numerator relationship matrix (the A matrix) and solving the mixed model equations of y = Xb + Mu + e, where y is the vector of gene content predictions, X is the design matrix (vector of 1s) for the mean, M is the design matrix connecting trees to the gene content vec- tor y, u is the individual tree effect and e is the error variance. The solutions of mixed model equations pro- duce predicted SNP genotypes for trees with missing genotypes. The solutions would be continuous, centered on 1 because the gene content values are 0, 1 or 2. Imputation of missing genotypes in empirical data from an unbalanced mating design with family sizes ran- ging from 1 to 35 was more powerful for data with structured missing genotypes at all levels of missing data than for data with random missing genotypes with same proportions of missing data. The accuracy of imputation for 10% and 80% missing genotypes ranged between 0.96 to 0.23 and 0.96 to 0.16 for structured and random missing genotypes in the data, respectively. As the proportion of missing genotypes increased in the data, the power of imputation decreased. With simula- tion, we found that the imputation was less affected by the distribution of missing genotypes in a balanced mat- ing design with families of equal size. The accuracy of imputation ranged between 0.97 to 0.75 for the 10% and 80% missing genotypes in the data, respectively. * Correspondence: [email protected] 2 Cooperative Tree Improvement Program, Department of Forestry and Environmental Resources, North Carolina State University, Raleigh, NC, USA Full list of author information is available at the end of the article Ogut et al. BMC Proceedings 2011, 5(Suppl 7):P61 http://www.biomedcentral.com/1753-6561/5/S7/P61 © 2011 Ogut et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: funda-ogut

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Imputing missing genotypes: effects of methods and patterns of missing data

POSTER PRESENTATION Open Access

Imputing missing genotypes: effects of methodsand patterns of missing dataFunda Ogut1, Fikret Isik2, Steven McKeand2, Ross Whetten2*

From IUFRO Tree Biotechnology Conference 2011: From Genomes to Integration and DeliveryArraial d’Ajuda, Bahia, Brazil. 26 June - 2 July 2011

Costs of high-throughput genotyping have decreased tothe point where it appears economically feasible to usemolecular genetic marker information in applied breed-ing programs. Some practical questions remain to beaddressed about how best to deal with missing data inthe resulting genotype datasets, to minimize the impactof the missing data on the accuracy of breeding valueprediction. Data can be missing for two reasons – first,genotyping assay failure is likely for at least some loci insome samples; and second, it may prove economicallydesirable to invest more resources for high-density gen-otyping of a few individuals and fewer resources forlower-density genotyping of many individuals [1]. Theproportion of missing genotypes may range from lessthan one percent due to genotyping assay failure, toover 80% if a selective genotyping strategy is used. Manymethods for predicting genetic merit of trees using mar-ker genotype data require complete genotype informa-tion for mathematical reasons. It is therefore importantto use efficient statistical methods to accurately imputemissing genotypes. In species with complete referencegenome sequences available, the map order of markersand linkage disequilibrium (LD) information can beused to guide imputation of missing genotypes. Comple-tely sequenced reference genomes are available for onlytwo forest tree species, so these methods are not suita-ble for most forest trees.Gengler et al. [2] described a method to impute miss-

ing genotypes using mixed linear models and BLUP. Wedetermined the effect on accuracy of BLUP estimatedbreeding values of imputation with different levels (10%,20%, 40%, 60% and 80%) of missing genotypes. Analyseswere conducted both with empirical data (3461 SNP

markers in a cloned loblolly pine population of 165 gen-otypes) and simulated data, using missing data createdby random sampling (some loci missing in all indivi-duals) or by structured sampling (all loci missing insome individuals). Simulations were used to examinethe effect of family and progeny size, mating design,proportion of missing genotypes, genotyping strategyand the method for imputation on the accuracy ofbreeding values. Imputed genotypes were obtained usingthe numerator relationship matrix (the A matrix) andsolving the mixed model equations of y = Xb + Mu + e,where y is the vector of gene content predictions, X isthe design matrix (vector of 1s) for the mean, M is thedesign matrix connecting trees to the gene content vec-tor y, u is the individual tree effect and e is the errorvariance. The solutions of mixed model equations pro-duce predicted SNP genotypes for trees with missinggenotypes. The solutions would be continuous, centeredon 1 because the gene content values are 0, 1 or 2.Imputation of missing genotypes in empirical data

from an unbalanced mating design with family sizes ran-ging from 1 to 35 was more powerful for data withstructured missing genotypes at all levels of missingdata than for data with random missing genotypes withsame proportions of missing data. The accuracy ofimputation for 10% and 80% missing genotypes rangedbetween 0.96 to 0.23 and 0.96 to 0.16 for structured andrandom missing genotypes in the data, respectively. Asthe proportion of missing genotypes increased in thedata, the power of imputation decreased. With simula-tion, we found that the imputation was less affected bythe distribution of missing genotypes in a balanced mat-ing design with families of equal size. The accuracy ofimputation ranged between 0.97 to 0.75 for the 10% and80% missing genotypes in the data, respectively.

* Correspondence: [email protected] Tree Improvement Program, Department of Forestry andEnvironmental Resources, North Carolina State University, Raleigh, NC, USAFull list of author information is available at the end of the article

Ogut et al. BMC Proceedings 2011, 5(Suppl 7):P61http://www.biomedcentral.com/1753-6561/5/S7/P61

© 2011 Ogut et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Page 2: Imputing missing genotypes: effects of methods and patterns of missing data

Author details1Cooperative Tree Improvement Program Department of Forestry andEnvironmental Resources, North Carolina State University Raleigh, NC, USA.2Cooperative Tree Improvement Program, Department of Forestry andEnvironmental Resources, North Carolina State University, Raleigh, NC, USA.

Published: 13 September 2011

References1. Habier D, Fernando RL, Dekkers JCM: Genomic selection using low-density

marker panels. Genetics 2009, 182:343-353.2. Gengler N, Mayeres P, Szydlowski M: A simple method to approximate

gene content in large pedigree populations: application to themyostatin gene in dual-purpose Belgian Blue cattle. Animal 2007,1(1):21-28.

doi:10.1186/1753-6561-5-S7-P61Cite this article as: Ogut et al.: Imputing missing genotypes: effects ofmethods and patterns of missing data. BMC Proceedings 2011 5(Suppl 7):P61.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Ogut et al. BMC Proceedings 2011, 5(Suppl 7):P61http://www.biomedcentral.com/1753-6561/5/S7/P61

Page 2 of 2