The collaborative double robust targeted maximum likelihood estimator (C-TMLE) is an extension of targeted minimum loss-based estimators (TMLE) that pursues an optimal strategy for estimation of the nuisance parameter. The original implementation of C-TMLE algorithm uses a greedy forward stepwise selection procedure to construct a nested sequence of candidate nuisance parameter estimators. Cross-validation is then used to select the candidate that minimizes bias in the estimate of the target parameter, rather than basing selection on the fit of the nuisance parameter model. C-TMLE has exhibited superior relative performance in analyses of sparse data, but the time complexity of the algorithm is $\mathcal{O}(p^2)$, where $p$, is the number of covariates available for inclusion in the model. Despite a criterion that allows for early termination, the greedy algorithm does not scale to large scale and high dimensional data.
This article introduces two scalable versions of C-TMLE. Each relies on an easily computed data adaptive pre-ordering of the variables. The time complexity of these scalable algorithms is $\mathcal{O}(p)$, and an early data adaptive stopping rule further reduces computation time without sacrificing statistical performance. We also introduce SL-CTMLE, an approach that uses super learning to select the best variable ordering from a set of ordering strategies. Simulation studies illustrate the performance of the scalable C-TMLEs relative to the original C-TMLE, the augmented inverse probability of treatment weighted estimator (A-IPTW), the probability of treatment weighting (IPTW) estimator, and standard TMLE using an external non-collaborative estimator of the treatment mechanism. Scalable C-TMLEs were also applied to three real-world health insurance claims datasets to estimate an average treatment effect. High-dimensional covariates were generated from the claims data based on high-dimensional propensity score (hdPS) screening. All C-TMLEs provided similar estimates and mean squared errors. Scalable C-TMLE analyses ran ten times faster than the original C-TMLE in larger datasets, making C-TMLE a feasible option for the analysis of large scale high dimensional data.



Included in

Biostatistics Commons