The ability to make scientific findings reproducible is increasingly important in areas where substantive results are the product of complex statistical computations. Reproducibility can allow others to verify the published findings and conduct alternate analyses of the same data. A question that arises naturally is how can one conduct and distribute reproducible research? This question is relevant from the point of view of both the authors who want to make their research reproducible and readers who want to reproduce relevant findings reported in the scientific literature. We present a framework in which reproducible research can be conducted and distributed via cached computations and describe specific tools for both authors and readers. As a prototype implementation we introduce three software packages written in the R language. The cacheSweave and stashR packages together provide tools for caching computational results in a key-value style database which can be published to a public repository for readers to download. The SRPM package provides tools for generating and interacting with "shared reproducibility packages" (SRPs) which can facilitate the distribution of the data and code. As a case study we demonstrate the use of the toolkit on a national study of air pollution exposure and mortality.
Numerical Analysis and Computation
Peng, Roger and Eckel, Sandrah P., "DISTRIBUTED REPRODUCIBLE RESEARCH USING CACHED COMPUTATIONS" (June 2007). Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 147.