Many diseases and other important phenotypic outcomes are the result of a combination of factors. For example, expression levels of genes have been used as input to various statistical methods for predicting phenotypic outcomes. One particular popular variety is the so-called gene set enrichment analysis (GSEA). This paper discusses an augmentation to an existing strategy to estimate the significance of an associations between a disease outcome and a predetermined combination of biological factors, based on a specific data adaptive regression method (the "Super Learner," van der Laan et al., 2007). The procedure uses an aggressive search procedure, potentially resulting in final models that imply associations that would not be discovered using non data-adaptive procedures (e.g., multiple linear regression). A test statistic derived from the "fit" of the Super Learner model to the original data is compared to the permutation distribution of the same statistic, the latter being generated by permuting the outcome labels with respect to the covariate vectors. This comparison is the basis for rejection criteria for the null hypothesis of no association between a set of biological factors (e.g., gene expression levels) and binary phenotypic outcomes. We include simulations that compare the statistical power of the test derived from the Super Learner method with that of other methods for two different data generating distributions.


Bioinformatics | Computational Biology