One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 indices for the PH and PO models that can be interpreted in terms of explained variation. Lastly, we propose a generalized pseudo-R2 measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R2 and pseudo-R2 measures were implemented in R (www.R-project.org) and code is available upon request.
Applied Statistics | Biochemistry | Bioinformatics | Biology | Biometry | Biostatistics | Biotechnology | Cancer Biology | Cell and Developmental Biology | Cell Biology | Clinical Trials | Computational Biology | Genetics | Genetics and Genomics | Genomics | Integrative Biology | Mathematics | Microarrays | Molecular Genetics | Multivariate Analysis | Other Genetics and Genomics | Other Statistics and Probability | Physical Sciences and Mathematics | Probability | Statistical Methodology | Statistical Models | Statistical Theory | Statistics and Probability | Survival Analysis
Spirko-Burns, Lauren and Devarajan, Karthik, "Unified Methods for Feature Selection in Large-Scale Genomic Studies with Censored Survival Outcomes" (March 2019). COBRA Preprint Series. Working Paper 120.
Applied Statistics Commons, Biochemistry Commons, Bioinformatics Commons, Biometry Commons, Biostatistics Commons, Biotechnology Commons, Cancer Biology Commons, Cell Biology Commons, Clinical Trials Commons, Computational Biology Commons, Genetics Commons, Genomics Commons, Integrative Biology Commons, Mathematics Commons, Microarrays Commons, Molecular Genetics Commons, Multivariate Analysis Commons, Other Genetics and Genomics Commons, Other Statistics and Probability Commons, Probability Commons, Statistical Methodology Commons, Statistical Models Commons, Statistical Theory Commons, Survival Analysis Commons