The goal of determining which of hundreds of thousands of SNPs are associated with disease poses one of the most challenging multiple testing problems. Using the empirical Bayes approach, the local false discovery rate (LFDR) estimated using popular semiparametric models has enjoyed success in simultaneous inference. However, the estimated LFDR can be biased because the semiparametric approach tends to overestimate the proportion of the non-associated single nucleotide polymorphisms (SNPs). One of the negative consequences is that, like conventional p-values, such LFDR estimates cannot quantify the amount of information in the data that favors the null hypothesis of no disease-association.
We address this problem of the semiparametric approach by proposing two simple parametric methods under the minimum description length (MDL) and empirical Bayes frameworks. The performances of the estimators corresponding to the two proposed parametric models and of the popular semiparametric model are compared by simulation to select a method for analyzing genome-wide association data.
The application of the coronary artery disease data indicates that the semiparametric method sometimes leads to overfitting due to nonparametric density estimation. Unlike semiparametric methods, the analyses based on the two parametric models can measure the amount of information in the data that favors one hypothesis over another. In multiple simulation studies, the estimators associated with the parametric mixture model consistently performs better than those of the other two models.
Epidemiology | Genetics | Statistical Methodology | Statistical Models | Statistical Theory
Yang, Ye and Bickel, David R., "Minimum Description Length and Empirical Bayes Methods of Identifying SNPs Associated with Disease" (November 2010). COBRA Preprint Series. Working Paper 74.