parameters by targeted learning from independent and identically

distributed data in contexts where sample size is so large that it poses

computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. ]]>

Methods: We predicted annual average PM_{2.5} concentrations at about 70,000 census tract centroids, using a point prediction model previously developed for estimating annual average PM_{2.5} concentrations in the continental U.S. for 1980-2010. We then averaged these predicted PM_{2.5} concentrations in all counties weighted by census tract population. In sensitivity analyses, we compared the resulting estimates to four alternative county average estimates using MSE-based R^{2} in order to capture both systematic and random differences in estimates. These estimates included crude aggregates of regulatory monitoring data, averages of predictions at residential addresses in Southern California, and two sets of averages of census tract centroid predictions unweighted by population and interpolated from predictions at 25-km national grid coordinates.

Results: The county-average mean PM_{2.5} was 14.40 (standard deviation=3.94) µg/m^{3} in 1980 and decreased to 12.24 (3.24), 10.42 (3.30), and 8.06 (2.06) µg/m^{3} in 1990, 2000, and 2010, respectively. These estimates were moderately related with crude averages in 2000 and 2010 when monitoring data were available (R^{2}= 0.70-0.82) and almost identical to the unweighted averages in all four decennial years. County averages were also consistent with the county averages derived from residential estimates in Southern California (0.95-0.96). We found grid-based estimates of county-average PM_{2.5} were more consistent with our estimates when we also included monitoring data (0.95-0.98) than grid-only estimates (0.91-0.96); both had slightly lower concentrations than census tract-based estimates.

Conclusions: Our approach to estimating population representative area-level PM_{2.5} concentrations is consistent with averages across residences. These exposure estimates will allow us to assess health impacts of ambient PM_{2.5} concentration in datasets with area-level health data.

The collaborative double robust targeted maximum likelihood estimator (C-TMLE) is an extension of targeted minimum loss-based estimators (TMLE) that pursues an optimal strategy for estimation of the nuisance parameter. The original implementation of C-TMLE algorithm uses a greedy forward stepwise selection procedure to construct a nested sequence of candidate nuisance parameter estimators. Cross-validation is then used to select the candidate that minimizes bias in the estimate of the target parameter, rather than basing selection on the fit of the nuisance parameter model. C-TMLE has exhibited superior relative performance in analyses of sparse data, but the time complexity of the algorithm is $\mathcal{O}(p^2)$, where $p$, is the number of covariates available for inclusion in the model. Despite a criterion that allows for early termination, the greedy algorithm does not scale to large scale and high dimensional data.]]>

This article introduces two scalable versions of C-TMLE. Each relies on an easily computed data adaptive pre-ordering of the variables. The time complexity of these scalable algorithms is $\mathcal{O}(p)$, and an early data adaptive stopping rule further reduces computation time without sacrificing statistical performance. We also introduce SL-CTMLE, an approach that uses super learning to select the best variable ordering from a set of ordering strategies. Simulation studies illustrate the performance of the scalable C-TMLEs relative to the original C-TMLE, the augmented inverse probability of treatment weighted estimator (A-IPTW), the probability of treatment weighting (IPTW) estimator, and standard TMLE using an external non-collaborative estimator of the treatment mechanism. Scalable C-TMLEs were also applied to three real-world health insurance claims datasets to estimate an average treatment effect. High-dimensional covariates were generated from the claims data based on high-dimensional propensity score (hdPS) screening. All C-TMLEs provided similar estimates and mean squared errors. Scalable C-TMLE analyses ran ten times faster than the original C-TMLE in larger datasets, making C-TMLE a feasible option for the analysis of large scale high dimensional data.

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. To select the best algorithm for a given set of data we must therefore use cross-validation to compare several candidate algorithms. Super Learner (SL) is an ensemble learning algorithm that uses cross-validation to select among a "library" of candidate algorithms. The SL is not restricted to a single prediction algorithm, but uses the strengths of a variety of learning algorithms to adapt to different datasets.]]>

While the SL has been shown to perform well in a number of settings, it has not been evaluated in large electronic healthcare datasets that are common in recent pharmacoepidemiology and medical research. In this article, we applied the SL on electronic healthcare datasets and evaluated the performance of the SL in its ability to predict treatment assignment (i.e., the propensity score) using three electronic healthcare datasets. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional Propensity Score (hdPS) variable selection algorithm. While the goal of the propensity score model is to control for confounding by balancing covariates across treatment groups, in this study we were interested in the predictive performance of these modeling strategies.

Prediction performance was assessed across three datasets using three metrics: likelihood, area under the curve, and time complexity. The results showed: 1. The hdPS often outperforms other algorithms that consisted of only user-specified baseline variables. 2. The best individual algorithm was highly dependent on the data set. An ensemble data-adaptive method like SL generated the predictive performance. 3. SL which utilized the hdPS methodology outperformed all other algorithms considered in this study. 4. Moreover, in our study, the results showed the reliability of SL: though the SL was optimized with respect to the negative likelihood, it performed best with respect to area under the curve (AUC) in all three data sets.

Our pivotal estimator, whose definition hinges on the targeted minimum loss estimation (TMLE) principle, actually infers the mean reward under the current estimate of the optimal treatment rule. This data-adaptive statistical parameter is worthy of interest on its own. Our main result is a central limit theorem which enables the construction of confidence intervals on both mean rewards under the current estimate of the optimal treatment rule and under the optimal treatment rule itself. The asymptotic variance of the estimator takes the form of the variance of an efficient influence curve at a limiting distribution, allowing to discuss the efficiency of inference.

As a by product, we also derive confidence intervals on two cumulated pseudo-regrets, a key notion in the study of bandits problems. Seen as two additional data-adaptive statistical parameters, they compare the sum of the rewards actually received during the course of the experiment with, either the sum of the means of the rewards, or the counterfactual rewards we would have obtained if we had used from the start the current estimate of the optimal treatment rule to assign treatment.

A simulation study illustrates the procedure. One of the cornerstones of the theoretical study is a new maximal inequality for martingales with respect to the uniform entropy integral. ]]>

Methods: We compared exact *P*-values, valid by definition, with normal and logit-normal approximations in a simulated study of 40 cases and 160 controls. The key measure of biomarker performance was sensitivity at 90% specificity. Data for 3000 uninformative markers and 30 true markers were generated randomly, with 10 replications of the simulation. We also analyzed real data on 2371 antibody array markers measured in plasma from 121 cases with ER/PR positive breast cancer and 121 controls.

Results: Using the same discovery criterion, the valid exact *P*-values lead to discovery of 24 true and 82 false biomarkers while approximate *P*-values yielded 15 true and 15 false biomarkers (normal approximation) and 20 true and 86 false biomarkers (logit-normal approximation). Moreover, the estimated numbers of true markers among those discovered were substantially incorrect for approximate *P*-values: normal estimated 0 true markers discovered but found 15; logit-normal estimated 42 but found 20. The exact method estimated 22, close to the actual number of 24 true discoveries. With real data, exact and approximate *P*-values ranked candidate breast cancer biomarkers very differently.

Conclusions: Exact *P*-values should be used because they are universally valid. Approximate *P*-values can lead to inappropriate biomarker selection rules and incorrect conclusions.

Impact: Rigorous data analysis methodology in discovery research may improve the yield of biomarkers that validate clinically.

]]>