"A Scalable Supervised Subsemble Prediction Algorithm" by Stephanie Sapp and Mark J. van der Laan

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

A Scalable Supervised Subsemble Prediction Algorithm

Authors

Stephanie Sapp, University of California, Berkeley, Department of StatisticsFollow
Mark J. van der Laan, University of California, Berkeley, School of Public Health, Division of BiostatisticsFollow

Abstract

Subsemble is a flexible ensemble method that partitions a full data set into subsets of observations, fits the same algorithm on each subset, and uses a tailored form of V-fold cross-validation to construct a prediction function that combines the subset-specific fits with a second metalearner algorithm. Previous work studied the performance of Subsemble with subsets created randomly, and showed that these types of Subsembles often result in better prediction performance than the underlying algorithm fit just once on the full dataset. Since the final Subsemble estimator varies depending on the data used to create the subset-specific fits, different strategies for creating the subsets used in Subsemble result in different Subsembles. We propose supervised partitioning of the covariate space to create the subsets used in Subsemble, and using a form of histogram regression as the metalearner used to combine the subset-specific fits. We discuss applications to large-scale data sets, and develop a practical Supervised Subsemble method using regression trees to both create the covariate space partitioning, and select the number of subsets used in Subsemble. Through simulations and real data analysis, we show that this subset creation method can have better prediction performance than the random subset version.

Disciplines

Applied Statistics

Suggested Citation

Sapp, Stephanie and van der Laan, Mark J., "A Scalable Supervised Subsemble Prediction Algorithm" (April 2014). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 321.
https://biostats.bepress.com/ucbbiostat/paper321

Download

Included in

Applied Statistics Commons

COinS

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UCB Biostatistics

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UCB Biostatistics