Published 2006 in International Journal of Biostatistics, Vol. 2, Issue 1.


Many statistical problems involve the learning of an importance/effect of a variable for predicting an outcome of interest based on observing a sample of n independent and identically distributed observations on a list of input variables and an outcome. For example, though prediction/machine learning is, in principle, concerned with learning the optimal unknown mapping from input variables to an outcome from the data, the typical reported output is a list of importance measures for each input variable. The typical approach in prediction has been to learn the unknown optimal predictor from the data and derive, for each of the input variables, the variable importance from the obtained fit. In this article we propose a new approach which involves for each variable separately 1) carefully defining the wished variable importance as a real valued parameter, 2) deriving the efficient influence curve and thereby optimal estimating function for this parameter in the assumed (possibly nonparametric) model, and 3) develop a corresponding locally efficient estimator of this variable importance, obtained by substituting for the nuisance parameters in the optimal estimating function data adaptive estimators. We illustrate this methodology in the context of prediction, and obtain in this manner locally optimal estimators of marginal variable importance and covariate-adjusted variable importance, accompanied with p-values and statistical inference. We also propose a road map for statistical analysis based on this approach. Finally, we generalize this methodology to variable importance parameters for time-dependent variables.


Biostatistics | Multivariate Analysis