The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. To select the best algorithm for a given set of data we must therefore use cross-validation to compare several candidate algorithms. Super Learner (SL) is an ensemble learning algorithm that uses cross-validation to select among a "library" of candidate algorithms. The SL is not restricted to a single prediction algorithm, but uses the strengths of a variety of learning algorithms to adapt to different datasets.
While the SL has been shown to perform well in a number of settings, it has not been evaluated in large electronic healthcare datasets that are common in recent pharmacoepidemiology and medical research. In this article, we applied the SL on electronic healthcare datasets and evaluated the performance of the SL in its ability to predict treatment assignment (i.e., the propensity score) using three electronic healthcare datasets. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional Propensity Score (hdPS) variable selection algorithm. While the goal of the propensity score model is to control for confounding by balancing covariates across treatment groups, in this study we were interested in the predictive performance of these modeling strategies.
Prediction performance was assessed across three datasets using three metrics: likelihood, area under the curve, and time complexity. The results showed: 1. The hdPS often outperforms other algorithms that consisted of only user-specified baseline variables. 2. The best individual algorithm was highly dependent on the data set. An ensemble data-adaptive method like SL generated the predictive performance. 3. SL which utilized the hdPS methodology outperformed all other algorithms considered in this study. 4. Moreover, in our study, the results showed the reliability of SL: though the SL was optimized with respect to the negative likelihood, it performed best with respect to area under the curve (AUC) in all three data sets.



Included in

Biostatistics Commons