Abstract

In Part I of this article we propose a general cross-validation criterian for selecting among a collection of estimators of a particular parameter of interest based on n i.i.d. observations. It is assumed that the parameter of interest minimizes the expectation (w.r.t. to the distribution of the observed data structure) of a particular loss function of a candidate parameter value and the observed data structure, possibly indexed by a nuisance parameter. The proposed cross-validation criterian is defined as the empirical mean over the validation sample of the loss function at the parameter estimate based on the training sample, averaged over random splits of the observed sample. The cross-validation selector is now the estimator which minimizes this cross-validation criterion. We illustrate that this general methodology covers, in particular, the selection problems in the current literature, but results in a wide range of new selection methods. We prove a finite sample oracle inequality, and asymptotic optimality of the cross-validated selector under general conditions. The asymptotic optimality states that the cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true data generating distribution).

Our general framework allows, in particular, the situation in which the observed data structure is a censored version of the full data structure of interest, and where the parameter of interest is a parameter of the full data structure distribution. As examples of the parameter of the full data distribution we consider a density of (a part of) the full data structure, a conditional expectation of an outcome, given explanatory variables, a marginal survival function of a failure time, and multivariate conditional expectation of an outcome vector, given covariates. In part II of this article we show that the general estimating function methodology for censored data structures as provided in van der Laan, Robins (2002) yields the wished loss functions for the selection among estimators of a full-data distribution parameter of interest based on censored data. The corresponding cross-validation selector generalizes any of the existing selection methods in regression and density estimation (including model selection) to the censored data case. Under general conditions, our optimality results now show that the corresponing cross-validation selector performs asymptotically exactly as well as the selector which for each given data set makes the best choice (knowing the true full data distribution).

In Part III of this article we propose a general estimator which is defined as follows. For a collection of subspaces and the complete parameter space, one defines an epsilon-net (i.e., a finite set of points whose epsilon-spheres cover the complete parameter space). For each epsilon and subspace one defines now a corresponding minimum cross-valided empirical risk estimator as the minimizer of cross-validated risk over the subspace-specific epsilon-net. In the special case that the loss function has no nuisance parameter, which thus covers the classical regression and density estimation cases, this epsilon and subspace specific minimum risk estimator reduces to the minimizer of the empirical risk over the corresponding epsilon-net. Finally, one selects epsilon and the subspace with the cross-validation selector. We refer to the resulting estimator as the cross-validated adaptive epsilon-net estimator. We prove an oracle inequality for this estimator which implies that the estimator minimax adaptive in the sense that it achieves the minimax optimal rate of convergence for the smallest of the guessed subspaces containing the true parameter value.

Disciplines

Numerical Analysis and Computation | Statistical Methodology | Statistical Models | Statistical Theory | Survival Analysis

Previous Versions

April 07, 2003