Simulation studies employed to study properties of estimators for parameters in population-averaged models for clustered or longitudinal data require suitable algorithms for data generation. The most useful algorithms for generating correlated binary data are those that allow general specifications of the marginal mean and correlation structures, while being able to generate clusters of moderate to large size. Such methods, however, cannot reproduce data for all possible multivariate binary distributions. Given a vector of marginal means, they often place restrictions on the range of correlations beyond the natural restrictions applicable to any multivariate binary distribution. Motivated by problems in biostatistics, we compare the algorithms of Emrich and Piedmonte (1991) and Qaqish (2003) with respect to range restrictions induced on correlations. Examples include generating longitudinal binary data and generating correlated binary data compatible with specified marginal means and covariance structures for bivariate, overdispersed binomial outcomes. Results show that both algorithms generally have good coverage with Qaqish's method giving a wider range of correlations for longitudinal data having autocorrelated within-subject associations and Emrich and Piedmonte's method giving a wider range of correlations for clustered data having exchangeable-type correlations. Practical considerations for generating data with varying cluster sizes or for subjects in longitudinal studies with missing data are also discussed.



Included in

Biostatistics Commons