The project described was supported by NIEHS Award Number P42ES004705, the Berkeley Superfund Research Program. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIEHS or NIH.


Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in estimation-sample (one of the V subsamples) and corresponding complementary parameter-generating sample that is used to generate a target parameter. For each of the V parameter-generating samples, we apply an algorithm that maps the sample in a target parameter mapping which represent the statistical target parameter generated by that parameter-generating sample. We define our sample-split data-adaptive statistical target pa- rameter as the average of these V -sample specific target parameters. We present an analogue estimator of this type of data adaptive target parameter and corresponding statistical inference. This general methodology for generating data adaptive target parameters while still providing valid statistical inference is demonstrated with a number of examples. These examples demonstrate that this methodology presents new opportunities for statistical learning from data that go beyond the usual requirement that the estimand is a priori defined in order to allow for proper statistical inference. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming “data-driven”, the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods - that is, the role of statisticians is being supplanted by computer scientist, deriving clever, yet typically ad hoc methods that “discover” the interesting patterns in data. The methodology presented in this paper can harness these methods, and now provide rigorous inference for the patterns, or target parameters suggested by such procedures. In this way, it returns exercises involving learning from data back within the proper domain of rigorous statistical inference. To suggest such potential, and to verify the predictions of the theory, simulation studies based upon algorithms that map the parameter- generating sample into the desired estimand are shown. However, the methodology generalizes to situations where even these algorithms are not prespecified.


Biostatistics | Design of Experiments and Sample Surveys | Statistical Models | Statistical Theory | Survival Analysis | Vital and Health Statistics