The fundamental tension in statistical disclosure control (SDC) of microdata is the trade-off between the protection of individual respondents and the release of enough information for statistical inferences. We consider microdata that include key variables that contain identifying information and target variables that include sensitive information. Releasing the original data may expose some individuals in the sample to high risk of disclosure; deleting key variables is a common approach, but this loses information for some statistical analysis. This paper proposes selective multiple imputation of key variables (SMIKe) as an alternative SDC technique between those two extremes, and applies SMIKe to categorical key variables and continuous nonkey variables in the context of the general location model. Keys of sensitive cases and a mixing set of selected nonsensitive cases are multiply imputed from their posterior predictive distributions, and each set of imputed keys is released to the public with the rest of the data. The size of mixing set can be used to control the trade-off between information loss and protection. Data analysis is conducted using multiple imputation methods with some necessary correction in the case of SMIKe. Simulation studies and an application of SMIKe to the 1995 Health and Ways of Living Survey in Alameda County are also presented.
Biostatistics | Design of Experiments and Sample Surveys
Little, Rod and Liu, Fang, "Selective Multiple Imputation of Keys for Statistical Disclosure Control in Microdata" (January 2003). The University of Michigan Department of Biostatistics Working Paper Series. Working Paper 6.