More often than not biomarker studies analyze large quantities of variables with complicated and generally unknown correlation structure. There are numerous statistical methods which attempt to unravel these variables and determine the underlying mechanism through identification of causally related biomarkers. Results from these methods are generally difficult to interpret and nearly impossible to compare across studies. The FDA has currently called for a standardization of methods and protocol for biomarker detection. In response, we propose targeted variable importance (tVIM) as a standardized method for biomarker discovery. Through the use of targeted Maximum Likelihood, tVIM provides double robust estimates of variable importance along with formal inference. These measures are biologically interpretable as a causal effect under specified conditions, allowing for reproducibility across populations. In this analysis we compare tVIM to four different measures of importance provided by three different statistical methods: univariate linear regression (LM), LASSO penalized multiple regression (Q), and two importance measures from randomForest (RF1 and RF2). Their performance is compared in simulation under conditions of increasing correlation. We are interested in their ability to distinguish "true" relevant biomarkers from correlated decoy biomarkers. The comparisons are based on the resulting ranked variable list for each method using the importance measures and p-values when available. In simulation, tVIM coupled with a data-adaptive model selection method outperforms linear regression, LASSO, and randomForest and is more resilient to increases in correlation. In application we apply all methods to the Golub et al 1999 Leukemia data and compare the resulting gene lists based on biological relevance. Both LM and tVIM are also applied to the van't Veer breast cancer data. We compare them based on the top 10 most important genes. From these results, tVIM appears to rank more biologically relevant genes at the top its list than the other methods. Given extreme correlations, methods to reduce bias and provide realistic gene lists are also discussed.


Biostatistics | Clinical Trials | Epidemiology | Microarrays