parameters by targeted learning from independent and identically

distributed data in contexts where sample size is so large that it poses

computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. ]]>

Methods: We predicted annual average PM_{2.5} concentrations at about 70,000 census tract centroids, using a point prediction model previously developed for estimating annual average PM_{2.5} concentrations in the continental U.S. for 1980-2010. We then averaged these predicted PM_{2.5} concentrations in all counties weighted by census tract population. In sensitivity analyses, we compared the resulting estimates to four alternative county average estimates using MSE-based R^{2} in order to capture both systematic and random differences in estimates. These estimates included crude aggregates of regulatory monitoring data, averages of predictions at residential addresses in Southern California, and two sets of averages of census tract centroid predictions unweighted by population and interpolated from predictions at 25-km national grid coordinates.

Results: The county-average mean PM_{2.5} was 14.40 (standard deviation=3.94) µg/m^{3} in 1980 and decreased to 12.24 (3.24), 10.42 (3.30), and 8.06 (2.06) µg/m^{3} in 1990, 2000, and 2010, respectively. These estimates were moderately related with crude averages in 2000 and 2010 when monitoring data were available (R^{2}= 0.70-0.82) and almost identical to the unweighted averages in all four decennial years. County averages were also consistent with the county averages derived from residential estimates in Southern California (0.95-0.96). We found grid-based estimates of county-average PM_{2.5} were more consistent with our estimates when we also included monitoring data (0.95-0.98) than grid-only estimates (0.91-0.96); both had slightly lower concentrations than census tract-based estimates.

Conclusions: Our approach to estimating population representative area-level PM_{2.5} concentrations is consistent with averages across residences. These exposure estimates will allow us to assess health impacts of ambient PM_{2.5} concentration in datasets with area-level health data.

The collaborative double robust targeted maximum likelihood estimator (C-TMLE) is an extension of targeted minimum loss-based estimators (TMLE) that pursues an optimal strategy for estimation of the nuisance parameter. The original implementation of C-TMLE algorithm uses a greedy forward stepwise selection procedure to construct a nested sequence of candidate nuisance parameter estimators. Cross-validation is then used to select the candidate that minimizes bias in the estimate of the target parameter, rather than basing selection on the fit of the nuisance parameter model. C-TMLE has exhibited superior relative performance in analyses of sparse data, but the time complexity of the algorithm is $\mathcal{O}(p^2)$, where $p$, is the number of covariates available for inclusion in the model. Despite a criterion that allows for early termination, the greedy algorithm does not scale to large scale and high dimensional data.]]>

This article introduces two scalable versions of C-TMLE. Each relies on an easily computed data adaptive pre-ordering of the variables. The time complexity of these scalable algorithms is $\mathcal{O}(p)$, and an early data adaptive stopping rule further reduces computation time without sacrificing statistical performance. We also introduce SL-CTMLE, an approach that uses super learning to select the best variable ordering from a set of ordering strategies. Simulation studies illustrate the performance of the scalable C-TMLEs relative to the original C-TMLE, the augmented inverse probability of treatment weighted estimator (A-IPTW), the probability of treatment weighting (IPTW) estimator, and standard TMLE using an external non-collaborative estimator of the treatment mechanism. Scalable C-TMLEs were also applied to three real-world health insurance claims datasets to estimate an average treatment effect. High-dimensional covariates were generated from the claims data based on high-dimensional propensity score (hdPS) screening. All C-TMLEs provided similar estimates and mean squared errors. Scalable C-TMLE analyses ran ten times faster than the original C-TMLE in larger datasets, making C-TMLE a feasible option for the analysis of large scale high dimensional data.

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. To select the best algorithm for a given set of data we must therefore use cross-validation to compare several candidate algorithms. Super Learner (SL) is an ensemble learning algorithm that uses cross-validation to select among a "library" of candidate algorithms. The SL is not restricted to a single prediction algorithm, but uses the strengths of a variety of learning algorithms to adapt to different datasets.]]>

While the SL has been shown to perform well in a number of settings, it has not been evaluated in large electronic healthcare datasets that are common in recent pharmacoepidemiology and medical research. In this article, we applied the SL on electronic healthcare datasets and evaluated the performance of the SL in its ability to predict treatment assignment (i.e., the propensity score) using three electronic healthcare datasets. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional Propensity Score (hdPS) variable selection algorithm. While the goal of the propensity score model is to control for confounding by balancing covariates across treatment groups, in this study we were interested in the predictive performance of these modeling strategies.

Prediction performance was assessed across three datasets using three metrics: likelihood, area under the curve, and time complexity. The results showed: 1. The hdPS often outperforms other algorithms that consisted of only user-specified baseline variables. 2. The best individual algorithm was highly dependent on the data set. An ensemble data-adaptive method like SL generated the predictive performance. 3. SL which utilized the hdPS methodology outperformed all other algorithms considered in this study. 4. Moreover, in our study, the results showed the reliability of SL: though the SL was optimized with respect to the negative likelihood, it performed best with respect to area under the curve (AUC) in all three data sets.