Background: Molecular epidemiologic studies face a missing data problem as biospecimen data are often collected on only a proportion of subjects eligible for study.

Methods: We investigated all molecular epidemiologic studies published in CEBP in 2009 to characterize the prevalence of missing data and to elucidate how the issue was addressed. We considered multiple imputation (MI), a missing data technique that is readily available and easy to implement, as a possible solution.

Results: While the majority of studies had missing data, only 16% compared subjects with and without missing data. Furthermore, 95% of the studies with missing data performed a complete-case (CC) analysis, a method known to yield biased and inefficient estimates.

Conclusions: Missing data methods are not customarily being incorporated into the analyses of molecular epidemiologic studies. Barriers may include a lack of awareness that missing data exists, particularly when availability of data is part of the inclusion criteria; the need for specialized software; and a perception that the CC approach is the gold standard. Standard MI is a reasonable solution that is valid when the data are missing at random (MAR). If the data are not missing at random (NMAR) we recommend MI over CC when strong auxiliary data are available. MI, with the missing data mechanism specified, is another alternative when the data are NMAR. In all cases, it is recommended to take advantage of MI’s ability to account for the uncertainty of these assumptions.

Impact: Missing data methods are underutilized, which can deleteriously affect the interpretation of results.



Included in

Epidemiology Commons