Abstract
Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to identify the algorithm with the best performance. We show that by basing estimates on the cross-validation-selected algorithm, we are asymptotically guaranteed to perform as well as the true, unknown best-performing algorithm. We provide extensions of this approach including online estimation of the optimal ensemble of candidate online estimators. We illustrate the practical performance of our methods using simulations and a real data example where we make streaming predictions of infectious disease incidence using data from a large database.
Disciplines
Biostatistics
Suggested Citation
Benkeser, David; Lendle, Samuel D.; Ju, Cheng; and van der Laan, Mark J., "Online Cross-Validation-Based Ensemble Learning" (October 2016). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 355.
https://biostats.bepress.com/ucbbiostat/paper355