Motivation: Feature subset selection is an important aspect of performing binary classification using gene expression data. Once feature subsets are obtained, there is the need to evaluate the various models that are formed. This paper considers both univariate- and multivariate-based feature selection approaches for the problem of binary classification with microarray data. In considering the more sophisticated multivariate approach, the idea is to determine whether it leads to better misclassification error rates because of the greater potential to consider jointly significant subsets of genes than would an approach combining individually predictive genes selected by a univariate approach. Further, we wish to see if the multivariate approaches can perform well without overfitting the data.
Results: An empirical study is presented, in which a 10-fold cross-validation is applied externally to both a univariate-based and two multivariate- (genetic algorithm (GA)-) based feature selection processes. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets. We find that although the multivariate feature selection approaches in general may have more potential to select jointly significant combinations of genes than would the simpler univariate approach, the 10-fold external cross-validation misclassification error rates between the two approaches for all classifiers and across all subset sizes were actually very comparable. Considering all datasets, learning algorithms, and subset sizes together, the average 10-fold external cross-validation error rates for the univariate-, single-stage GA-, and two-stage GA-based processes are 14.2%, 14.6%, and 14.2%, respectively. Further, we find that a more sophisticated two-stage GA approach did not demonstrate a significant advantage over a 1-stage approach. We also find that the univariate approach had higher optimism bias and lower selection bias compared to both GA approaches. Finally, considering all datasets, learning algorithms, and subset sizes together, we find that the optimism bias estimates from the GA analyses were half that of the univariate approach, but the selection bias estimates from the GA analyses were 2.5 times that of the univariate results. This higher selection bias suggests that selecting genes in multivariate models using a GA may be more likely to select spurious genes than would be the case with a univariate-based approach.
Lecocke, Michael L. and Hess, Kenneth, "An Empirical Study of Univariate and GA-Based Feature Selection in Binary Classification with Microarray Data" (March 2005). UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series. Working Paper 5.