"Scalable Collaborative Targeted Learning for High-dimensional Data" by Cheng Ju, Susan Gruber et al.

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Scalable Collaborative Targeted Learning for High-dimensional Data

Authors

Cheng Ju, Division of Biostatistics, University of California, BerkeleyFollow
Susan Gruber, Harvard Pilgrim Health Care Institute and Harvard Medical SchoolFollow
Samuel D. Lendle, Division of Biostatistics, University of California, BerkeleyFollow
Antoine Chambaz, University of California, Berkeley, Modal'X, Universit\'e Paris Nanterre, Paris, and MAP5, Universit\'e Paris Descartes et CNRS, ParisFollow
Jessica M. Franklin, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School
Richard Wyss, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School
Sebastian Schneeweiss, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School
Mark J. van der Laan, Division of Biostatistics, University of California, BerkeleyFollow

Abstract

Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the C-TMLE procedure.

The original C-TMLE procedure can be presented as a greedy forward stepwise algorithm. It does not scale well when the number $p$ of covariates increases drastically. This motivates the introduction of a novel template of C-TMLE procedure where the covariates are pre-ordered. Its time complexity is $\mathcal{O}(p)$ as opposed to the original $\mathcal{O}(p^2)$, a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce a SL-C-TMLE procedure that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is $\mathcal{O}(p)$ as well.

A Julia software makes it easy to implement our variants of C-TMLE procedures. We use the software to assess their computational burdens in different scenarios; to compare their performances in simulation studies involving fully synthetic data or partially synthetic data based on a real, large electronic health database; and to showcase their application to the analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the vanilla C-TMLE procedure is unacceptably slow. Judging from the simulation studies, our pre-ordering strategies work well, and so does the SL-C-TMLE procedure.

Disciplines

Biostatistics

Suggested Citation

Ju, Cheng; Gruber, Susan; Lendle, Samuel D.; Chambaz, Antoine; Franklin, Jessica M.; Wyss, Richard; Schneeweiss, Sebastian; and van der Laan, Mark J., "Scalable Collaborative Targeted Learning for High-dimensional Data" (June 2016). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 352.
https://biostats.bepress.com/ucbbiostat/paper352

Download

Included in

Biostatistics Commons

COinS

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Browse

Search

Author Corner

UCB Biostatistics

Collection of Biostatistics Research Archive

U.C. Berkeley Division of Biostatistics Working Paper Series

Title

Authors

Abstract

Disciplines

Suggested Citation

Included in

Share

Browse

Search

Author Corner

UCB Biostatistics