Abstract

Background: An emerging feature in modern biomedical research is collecting and analyzing numerous variables. In the presence of many potential covariates, inference becomes challenging requiring both distinguishing a set of covariates truly associated with an outcome and estimating their corresponding regression coefficients consistently. Traditional statistical inference typically focuses on estimating coefficients assuming a pre-specified set of covariates. Further, advanced machine/statistical learning methods performing both selection and estimation predominantly focus on outcome prediction rather than association inference.

Methods: Motivated by our epidemiological research on long-term childhood cancer survivors, where we aimed to investigate associations between a large pool of longitudinal symptom patterns with future quality of life, we propose a novel approach called Bayesian Information Criterion Elastic Net (BIEN) for standard likelihood-based association inference in the presence of many potential covariates. To evaluate this approach, we compared it with alternative tools, namely Elastic Net (EN), Stepwise selection (SW), and a simpler version of BIEN (BIEN-B), via simulations designed to mimic our study of childhood cancer survivors.

Results: BIEN showed equal or superior performance compared to the other methods in retrieving truly-associated covariates and estimating regression coefficients: both BIEN-B and BIEN notably surpassed EN regardless of the sample size, and outperformed SW in sample sizes smaller than the candidate covariate number, whereas BIEN also attained comparable results to SW at larger sample sizes. Additional simulations under reduced collinearity and larger sample sizes confirmed BIEN’s ability to achieve near-optimal performance. We also applied BIEN to our highly  multicollinear childhood-cancer survivors’ dataset, where the number of candidate variables was close to sample size, and assessed selection and estimation uncertainty both with and without multicollinearity reduction.

Conclusions: BIEN provides a pragmatic regression approach for association inference with many potential covariates. Its performance in both simulated and real-world datasets demonstrates its potential as a useful analytic tool in modern biomedical studies, where large sets of correlated variables are increasingly common.

Disciplines

Biostatistics | Data Science | Epidemiology

Share

COinS