Missing single nucleotide polymorphisms (SNPs) are quite common in genetic association studies. Subjects with missing SNPs are often discarded in analyses, which may seriously undermine the inference of SNP-disease association. In this article, we compare two haplotype-based imputation approaches and one regression tree-based imputation approach for association studies. The goal is to assess the imputation accuracy, and to evaluate the impact of imputation on parameter estimation. Haplotype-based approaches build on haplotype reconstruction by the expectation-maximization (EM) algorithm or a weighted EM (WEM) algorithm, depending on whether case-control status is taken into account. The tree-based approach uses a Gibbs sampler to iteratively sample from a full conditional distribution, which is obtained from the classification and regression tree (CART) algorithm. We employ a standard multiple imputation procedure to account for the uncertainty of imputation. We apply the methods to simulated data as well as a case-control study on developmental dyslexia. Our results suggest that imputation generally improves over the standard practice of ignoring missing data in terms of bias and efficiency. The haplotype-based approaches slightly outperform the tree-based approach when there are a small number of SNPs in linkage disequilibrium (LD), but the latter has a computational advantage. Finally, we demonstrate that utilizing the disease status in imputation helps to reduce the bias in the subsequent parameter estimation.
Dai, James Y.; Ruczinski, Ingo; LeBlanc, Michael; and Kooperberg, Charles, "Comparison of Haplotype-based and Tree-based SNP Imputation in Association Studies" (January 2006). UW Biostatistics Working Paper Series. Working Paper 278.
James Y. Dai and Ingo Ruczinski contributed equally to this work.