## U.C. Berkeley Division of Biostatistics Working Paper Series

#### Abstract

Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the C-TMLE procedure.

The original C-TMLE procedure can be presented as a greedy forward stepwise algorithm. It does not scale well when the number $p$ of covariates increases drastically. This motivates the introduction of a novel template of C-TMLE procedure where the covariates are pre-ordered. Its time complexity is $\mathcal{O}(p)$ as opposed to the original $\mathcal{O}(p^2)$, a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce a SL-C-TMLE procedure that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is $\mathcal{O}(p)$ as well.

A Julia software makes it easy to implement our variants of C-TMLE procedures. We use the software to assess their computational burdens in different scenarios; to compare their performances in simulation studies involving fully synthetic data or partially synthetic data based on a real, large electronic health database; and to showcase their application to the analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the vanilla C-TMLE procedure is unacceptably slow. Judging from the simulation studies, our pre-ordering strategies work well, and so does the SL-C-TMLE procedure.

Biostatistics

COinS