parameters by targeted learning from independent and identically

distributed data in contexts where sample size is so large that it poses

computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results. ]]>

The original C-TMLE procedure can be presented as a greedy forward stepwise algorithm. It does not scale well when the number $p$ of covariates increases drastically. This motivates the introduction of a novel template of C-TMLE procedure where the covariates are pre-ordered. Its time complexity is $\mathcal{O}(p)$ as opposed to the original $\mathcal{O}(p^2)$, a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce a SL-C-TMLE procedure that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is $\mathcal{O}(p)$ as well.

A Julia software makes it easy to implement our variants of C-TMLE procedures. We use the software to assess their computational burdens in different scenarios; to compare their performances in simulation studies involving fully synthetic data or partially synthetic data based on a real, large electronic health database; and to showcase their application to the analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the vanilla C-TMLE procedure is unacceptably slow. Judging from the simulation studies, our pre-ordering strategies work well, and so does the SL-C-TMLE procedure.

]]>The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. Super Learner (SL) is a generic ensemble learning algorithm that uses cross-validation to select among a "library" of candidate prediction models. The SL is not restricted to a single prediction model, but uses the strengths of a variety of learning algorithms to adapt to different databases. While the SL has been shown to perform well in a number of settings, it has not been thoroughly evaluated in large electronic healthcare databases that are common in pharmacoepidemiology and comparative effectiveness research. In this study, we applied and evaluated the performance of the SL in its ability to predict treatment assignment using three electronic healthcare databases. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional propensity score (hdPS) variable selection algorithm. Predictive performance was assessed using three metrics: the negative log-likelihood, area under the curve (AUC), and time complexity. Results showed that the best individual algorithm, in terms of predictive performance, varied across datasets. The SL was able to adapt to the given dataset and optimize predictive performance relative to any individual learner. Combining the SL with the hdPS was the most consistent prediction method and may be promising for PS estimation and prediction modeling in electronic healthcare databases.]]>

Our pivotal estimator, whose definition hinges on the targeted minimum loss estimation (TMLE) principle, actually infers the mean reward under the current estimate of the optimal treatment rule. This data-adaptive statistical parameter is worthy of interest on its own. Our main result is a central limit theorem which enables the construction of confidence intervals on both mean rewards under the current estimate of the optimal treatment rule and under the optimal treatment rule itself. The asymptotic variance of the estimator takes the form of the variance of an efficient influence curve at a limiting distribution, allowing to discuss the efficiency of inference.

As a by product, we also derive confidence intervals on two cumulated pseudo-regrets, a key notion in the study of bandits problems. Seen as two additional data-adaptive statistical parameters, they compare the sum of the rewards actually received during the course of the experiment with, either the sum of the means of the rewards, or the counterfactual rewards we would have obtained if we had used from the start the current estimate of the optimal treatment rule to assign treatment.

A simulation study illustrates the procedure. One of the cornerstones of the theoretical study is a new maximal inequality for martingales with respect to the uniform entropy integral. ]]>

enrollment in the program. Targeted minimum loss-based estimation was used to estimate the mean outcome, while Super Learning was implemented to estimate the required nuisance parameters. Analyses were conducted with the ltmle R package; analysis code is available at an online repository as an R package. Results showed that at 450 days, the probability of in-care survival for subjects with immediate availability and enrollment was 0:93 (95% CI: 0.91, 0.95) and 0:87 (95% CI: 0.86, 0.87) for subjects with immediate availability never enrolling. For subjects without LREC availability, it was 0:91 (95% CI: 0.90, 0.92). Immediate program availability without individual

enrollment, compared to no program availability, was estimated to slightly albeit significantly decrease survival by 4% (95% CI 0.03,0.06, p< 0:01). Immediately availability and enrollment resulted in a 7% higher in-care survival compared to immediate availability with non-enrollment after 450 days (95% CI -0.08,-0.05, p< 0:01). The results are consistent with a fairly small impact of both availability and enrollment in the LREC program on in-care survival. ]]>

The canonical gradient of the target parameter at a particular data distribution will depend on the data distribution through an infinite dimensional nuisance parameter which can be defined as the minimizer of the expectation of a loss function (e.g., log-likelihood loss). For many models and target parameters the nuisance parameter can be split up in two components, one required for evaluation of the target parameter and one real nuisance parameter. The only smoothness condition we will enforce on the statistical model is that these nuisance parameters are multivariate real valued cadlag functions and have a finite supremum and variation norm.

We propose a general one-step targeted minimum loss-based estimator (TMLE) based on an initial estimator of the nuisance parameters defined by a loss-based super-learner that uses cross-validation to combine a library of candidate estimators. We enforce this library to contain minimum loss based estimators minimizing the empirical risk over the parameter space under the additional constraint that the variation norm is bounded by a set constant, across a set of constants for which the maximal constant converges to infinity with sample size. We show that this super-learner is not only asymptotically equivalent with the best performing algorithm in the library, but also that it always converges to the true nuisance parameter values at a rate faster than $n^{-1/4}$. This minimal rate applies to each dimension of the data and even to nonparametric statistical models. We also demonstrate that the implementation of these constant-specific minimum loss-based estimators can be carried out by minimizing the empirical risk over linear combinations of basis functions under the constraint that the sum of the absolute value of the coefficients is smaller than the constant (e.g., Lasso regression), making our proposed estimators practically feasible.

Based on this rate of the super-learner of the nuisance parameter, we can establish that this one-step TMLE is asymptotically efficient at any data generating distribution in the model, under very weak structural conditions on the target parameter mapping and model. We demonstrate our general theorems by constructing such a one-step TMLE of the average causal effect in a nonparametric model, and presenting the corresponding efficiency theorem. ]]>

In this article, we propose a new group-sequential CARA RCT design and corresponding analytical procedure that admits the use of flexible data-adaptive techniques. The proposed design framework can target general adaption optimality criteria that may not have a closed-form solution, thanks to a loss- based approach in defining and estimating the unknown optimal randomization scheme. Both in predicting the conditional response and in constructing the treatment randomization schemes, this framework uses loss-based data-adaptive estimation over general classes of functions (which may change with sample size). Since the randomization adaptation is response-adaptive, this innovative flexibility potentially translates into more effective adaptation towards the optimality criterion. To target the primary study parameter, the proposed analytical method provides robust inference of the parameter, despite arbitrarily mis-specified response models, under the most general settings.

Specifically, we establish that, under appropriate entropy conditions on the classes of functions, the resulting sequence of randomization schemes converges to a fixed scheme, and the proposed treatment effect estimator is consistent (even under a mis-specified response model), asymptotically Gaussian, and gives rise to valid confidence intervals of given asymptotic levels. Moreover, the limiting randomization scheme coincides with the unknown optimal randomization scheme when, simultaneously, the response model is correctly specified and the optimal scheme belongs to the limit of the user-supplied classes of randomization schemes. We illustrate the applicability of these general theoretical results with a LASSO- based CARA RCT. In this example, both the response model and the optimal treatment randomization are estimated using a sequence of LASSO logistic models that may increase with sample size. It follows immediately from our general theorems that this LASSO-based CARA RCT converges to a fixed design and yields consistent and asymptotically Gaussian effect estimates, under minimal conditions on the smoothness of the basis functions in the LASSO logistic models. We exemplify the proposed methods with a simulation study.

]]>In this article we construct a one-dimensional universal least favorable submodel for which the TMLE only takes one step, and thereby requires minimal extra fitting with data to achieve its goal of solving the efficient influence curve equation. We generalize these to universal least favorable submodels through the relevant part of the data distribution as required for targeted minimum loss-based estimation, and to universal score-specific submodels for solving any other desired equation beyond the efficient influence curve equation. We demonstrate the one-step targeted minimum loss-based estimators based on such universal least favorable submodels for a variety of examples showing that any of the goals for TMLE we previously achieved with local (typically multivariate) least favorable parametric submodels and an iterative TMLE can also be achieved with our new one-dimensional universal least favorable submodels, resulting in new one-step TMLEs for a large class of estimation problems previously addressed. Finally, remarkably, given a multidimensional target parameter, we develop a universal canonical one-dimensional submodel such that the one-step TMLE, only maximizing the log-likelihood over a univariate parameter, solves the multivariate efficient influence curve equation. This allows us to construct a one-step TMLE based on a one-dimensional parametric submodel through the initial estimator, that solves any multivariate desired set of estimating equations. ]]>