Case-control sampling is an extremely common design used to generate data to estimate effects of exposures or treatments on a binary outcome of interest when the proportion of cases (i.e., binary outcome equal to 1) in the population of interest is low. Case-control sampling represents a biased sample of a target population of interest by sampling a disproportional number of cases. Case-control studies are also commonly employed to estimate the effects of genetic markers or biomarkers on phenotypes. The typical approach used in practice is to fit (conditional) logistic regression models, ignoring the case-control sampling, in order to estimate the conditional odds ratios of being a case, given baseline covariates and the exposure of interest. Although these methods do not rely on knowing the true incidence probability (i.e, probability of being a case), and provide valid logistic regression model based estimates of the conditional effect of exposure on odds ratio scale, they do not provide an estimate of a marginal causal odds ratio or causal relative risk, which are causal parameters representing the typical parameters of interest in randomized trials comparing different treatment or exposure levels. By the same argument, these methods do not provide measures of marginal variable importance. In this article we focus on methods for causal inference and variable importance analysis for matched and unmatched case-control studies relying on knowing the incidence probability, conditional on the matching variable if matching is used. We start out with presenting, for both case-control designs, a simple intercept adjustment method that deterministically maps a, possibly weighted for matched case-control designs, logistic regression fit into a valid model based fit of the actual conditional probability on being a case, given the covariates. The resulting estimate of the conditional probability of being a case has now the important property that its standard error is proportional to the incidence probability (divided by the square root of the sample size) so that the obtained precision is good enough for accurately estimating marginal causal relative risks or causal odds-ratios even when the probability of being a case is extremely rare. Subsequently, we present our general proposed methodology, involving a simple weighting scheme of cases and controls, that maps any estimation method for a parameter developed for prospective sampling from the population of interest into an estimation method based on case-control sampling from this population. For regular case-control designs the weighting only relies on knowing the true population proportion of cases or, equivalenty, the true probability of being a case, and for matched case-control sampling it also relies on knowing this proportion of cases within each population strata of the matching variable. We show that this case-control weighting of an efficient estimator for a prospective sample from the target population of interest maps into an efficient estimator for matched and unmatched case-control sampling. We show how application of this generic methodology provides us with double robust locally efficient targeted maximum likelihood estimators of the causal relative risk and causal odds ratio for regular case control sampling and matched case control sampling. We also illustrate such double robust targeted maximum likelihood estimators in marginal structural models and semi-parametric logistic regression models. Finally, we show that case-control studies nested in randomized trials allow estimation, based on inverse probability of treatment weighted (IPTW) estimators of the marginal causal relative risk or odds ratios without the need to know the incidence probability, and we present the simple implications for observational case-control studies in which this incidence probability is not known but known to be close to zero. By comparing these methods with the efficient method for the case that the incidence probability is known, it follows that even in randomized trials the knowledge of the incidence probability allows for significantly more precise estimation of causal parameters.



Included in

Biostatistics Commons