The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series

Prevalence Estimation at the Cluster Level for Correlated Binary Data Using Random Partial-Cluster Sampling

Rujin Wang et al. — Mon, 12 Sep 2016 11:35:42 PDT

For clustered data in the medical sciences, disease is present when one or more of the observations in the cluster has the disease condition. This paper focuses on estimation of periodontal disease prevalence defined as the probability that one or more tooth sites have disease in a randomly selected subject. The prohibitive exam time and monetary cost of the full-mouth examination makes partial-mouth recording protocols attractive alternative methods to assess chronic periodontitis. In particular, Beck et al. (2006) proposed the random site selection method (RSSM), which pre-specifies a fixed number of tooth sites to be selected randomly from each subject. RSSM could reduce the examination time, but standard estimators that define an individual's disease status solely in terms of selected sites tend to underestimate disease prevalence. We define each mouth as a cluster and disease status (presence or absence) at each tooth site as a binary variable. We describe a prevalence estimator based on the conditional linear family (CLF) of correlated binary distributions under the working assumptions of equal site-level means and exchangeable pairwise correlation for all within-cluster pairs of sites. We derive a variance estimator for the CLF-RSSM prevalence estimator by the delta method. Using simulated data, our prevalence estimator and its variance estimator have small to negligible bias and confidence intervals for prevalence have coverage near the 95% nominal level when the working model is correct. Taking missing teeth into consideration, the CLF-RSSM prevalence estimator has approximately 90% coverage in our simulations. Given a more realistic unequal means and dental correlation structure, the CLF-RSSM prevalence and its standard deviation estimator do not perform well under model misspecification. While the overall approach to the estimation of disease prevalence at the cluster level using partial cluster sampling is promising, new estimators that incorporate more realistic distributional assumptions of correlated binary data (e.g. tooth surfaces in a mouth) may be needed according to the application.

Generalizing Evidence from Randomized Trials using Inverse Probability of Sampling Weights

Ashley L. Buchanan et al. — Mon, 14 Sep 2015 12:45:48 PDT

Results obtained in randomized trials may not generalize to specific target populations. In a randomized trial, the treatment assignment mechanism is known, but assuming participants are a random sample from the target population is often dubious. Lack of generalizability can occur when the distribution of treatment effect modifiers in trial participants differs from the distribution in the target population. We consider an inverse probability of sampling weighted (IPSW) estimator for generalizing trial results to a user-specified target population that differs in important clinical or demographic characteristics from the randomized trial. The IPSW estimator is shown to be consistent and asymptotically normal assuming a model for the sampling score (i.e., the probability of participating in the trial) is correctly specified. Expressions for the asymptotic variance and a consistent sandwich-type estimator of the variance are derived. Simulation results comparing the IPSW estimator and a previously proposed stratified estimator show that the estimators perform similarly when the sampling score model includes a binary covariate. However, with a continuous covariate in the sampling score model, the IPSW estimator is less biased and the corresponding Wald confidence interval has better coverage. The IPSW estimator is employed to generalize results from two randomized trials of HIV treatment conducted by the United States (US) National Institutes of Health AIDS Clinical Trials Group to all people currently living with HIV in the US.

Feature Elimination in Support Vector Machines and Empirical Risk Minimization

Sayan Dasgupta et al. — Thu, 27 Aug 2015 13:50:45 PDT

We develop an approach for feature elimination in support vector machines (and empirical risk minimization), based on recursive elimination of features. We present theoretical properties of this method and show that this is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present case studies to show that the assumptions are met in most practical situations and also present simulation studies to demonstrate performance of the proposed approach.

A Marginalized Zero-Inflated Negative Binomial Regression Model with Overall Exposure Effects

John S. Preisser et al. — Thu, 18 Dec 2014 06:58:32 PST

The zero-inflated negative binomial regression model (ZINB) is often employed in diverse fields such as dentistry, health care utilization, highway safety, and medicine, to examine relationships between exposures of interest and overdispersed count outcomes exhibiting many zeros. The regression coefficients of ZINB have latent class interpretations for a susceptible subpopulation at risk for the disease/condition under study with counts generated from a negative binomial distribution and for a non-susceptible subpopulation that provides only zero counts. The ZINB parameters, however, are not well-suited for estimating overall exposure effects, specifically, in quantifying the effect of an explanatory variable in the overall mixture population. In this paper, a marginalized zero-inflated negative binomial regression (MZINB) model for independent responses is proposed to model the population marginal mean count directly, providing straightforward inference for overall exposure effects based on maximum likelihood estimation. Through simulation studies, the performance of MZINB with respect to test size is compared to marginalized zero-inflated Poisson, Poisson, and negative binomial regression. The MZINB model is applied to data from a randomized clinical trial of three toothpaste formulations to prevent incident dental caries in a large population of Scottish schoolchildren.

Doubly Robust Learning for Estimating Individualized Treatment with Censored Data

Ying-Qi Zhao et al. — Wed, 10 Dec 2014 07:08:44 PST

Individualized treatment rules recommend treatments based on individual patient characteristics in order to maximize clinical benefit. When the clinical outcome of interest is survival time, estimation is often complicated by censoring. We develop nonparametric methods for estimating an optimal individualized treatment rule in the presence of censored data. To adjust for censoring, we propose a doubly robust estimator which requires correct specification of either the censoring model or survival model, but not both; the method is shown to be Fisher consistent when either model is correct. Furthermore, we establish the convergence rate of the expected survival under the estimated optimal individualized treatment rule to the expected survival under the optimal individualized treatment rule. We illustrate the proposed methods using simulation study and data from a Phase III clinical trial on non-small cell lung cancer.

sanon : An R Package for Stratified Analysis with Nonparametric Covariable Adjustment

Atsushi Kawaguchi et al. — Wed, 29 Oct 2014 07:15:38 PDT

Kawaguchi et al. (2011) provided methodology and applications for a stratified Mann-Whitney estimator that addresses the same comparison between two randomized groups for a strictly ordinal response variable as the van Elteren test statistic for randomized clinical trials with strata. The sanon package provides the implementation of the method within the R programming environment (R Core Team, 2012). The usage of sanon is illustrated with five examples. The first example is a randomized clinical trial with eight strata and a univariate ordinal response variable. The second example is a randomized clinical trial with four strata, two covariables, and four ordinal response variables. The third example is a cross over design randomized clinical trial with two strata, one covariable, and two ordinal response variables. The fourth example is a randomized clinical trial with seven strata (which are managed as a categorical covariable), three ordinal covariables with missing values, and three ordinal response variables with missing values. The fifth example is a randomized clinical trial with six strata, a categorical covariable with three levels, and three ordinal response variables with missing values.

Latent Supervised Learning for Estimating Treatment Effect Heterogeneity

Susan Wei et al. — Wed, 01 Oct 2014 15:20:50 PDT

It is oft observed in medicine that what works for one patient may not work for another. Determining for whom a treatment works and does not work is of great clinical interest. We propose a methodology to estimate treatment effect heterogeneity, i.e. to ascertain for which subpopulations a treatment is effective or harmful. The model studied assumes the relationship between an outcome of interest (e.g. blood pressure, cholesterol, survival) and a set of covariates (e.g. treatment, age, gender) is modified by a linear combination of a set of features (e.g. gene expression). Specifically a threshold on the linear combination divides the population into two subpopulations with different responses to treatment. Techniques from Latent Supervised Learning, a novel machine learning idea, are applied for model estimation, i.e. estimation of the linear combination and the corresponding threshold. Consistency of the estimator is established. In simulations the proposed methodology demonstrates high classification accuracy in a wide array of settings. Three data analysis examples are presented to illustrate the efficacy and applicability of the proposed methodology.

Latent Supervised Learning for Survival Data

Susan Wei et al. — Tue, 20 Aug 2013 08:52:32 PDT

Latent supervised learning is a machine learning technique for performing binary classification using a surrogate variable for the unobserved training label. We extend latent supervised learning to the case when the surrogate variable is a right-censored survival time. A motivating application for the proposed methodology is to stratify patients into two risk groups given a set of biomarkers. Sieve maximum likelihood estimation is employed for model estimation with special care taken to account for censoring. Consistency of the proposed estimator is established. Simulations show that the proposed estimator is accurate under a range of settings. Applications to real data examples demonstrate its advantages over a competing method; the proposed method produces more significant separation in survival on both training sets and held-out independent test sets.

Feature Elimination in Empirical Risk Minimization and Support Vector Machines

Sayan Dasgupta et al. — Thu, 18 Apr 2013 13:16:51 PDT

We develop an approach for feature elimination in empirical risk minimization and support vector machines, based on recursive elimination of features. We present theoretical properties of this method and show that this is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present case studies to show that the assumptions are met in most practical situations and also present simulation studies to demonstrate performance of the proposed approach.

Latent Supervised Learning

Susan Wei et al. — Mon, 18 Mar 2013 05:13:29 PDT

A new machine learning task is introduced, called latent supervised learning, where the goal is to learn a binary classifier from continuous training labels which serve as surrogates for the unobserved class labels. A specific model is investigated where the surrogate variable arises from a two-component Gaussian mixture with unknown means and variances, and the component membership is determined by a hyperplane in the covariate space. The estimation of the separating hyperplane and the Gaussian mixture parameters forms what shall be referred to as the change-line classification problem. A data-driven sieve maximum likelihood estimator for the hyperplane is proposed, which in turn can be used to estimate the parameters of the Gaussian mixture. The estimator is shown to be consistent. Simulations as well as empirical data show the estimator has high classification accuracy.

Cross-Validation for Nonlinear Mixed Effects Models

Emily Colby et al. — Thu, 14 Mar 2013 06:15:36 PDT

Cross-validation is frequently used for model selection in a variety of applications. However, it is difficult to apply cross-validation to mixed effects models (including the nonlinear mixed effects models) due to the fact that cross-validation requires “out-of-sample” predictions of the outcome variable, which cannot be easily calculated when random effects are present.We describe two novel variants of cross-validation that can be applied to nonlinear mixed effects models. One variant, where out-of-sample predictions are based on post hoc estimates of the random effects, can be used to select the overall structural model. Another variant, where cross-validation seeks to minimize the estimated random effects rather than the estimated residuals, can be used to select covariates to include in the model.We show that these methods produce accurate results in a variety of simulated data sets and apply them to two publicly available population pharmacokinetic data sets.

Parameter Estimation in Cox Proportional Hazard Models with Missing Censoring Indicators

Naomi Brownstein et al. — Mon, 28 Jan 2013 13:56:01 PST

In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the ``gold standard'' for diagnosing temporomandibular disorder (TMD) is a clinical examination by an expert dentist. In a large study, examining all subjects in this manner is infeasible. Instead, it is common to use a cheaper (and less reliable) examination to screen for possible incident cases and perform the ``gold standard'' examination only on participants who screen positive on this simpler examination. Unfortunately, some subjects may leave the study before receiving the ``gold standard'' examination. Within the framework of survival analysis, this results in missing censoring indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment(OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing censoring indicators. We estimate the probability of being a case for those with no ``gold standard'' examination using logistic regression. These predicted probabilities are used to generate multiple imputations of each missing case status and estimate the hazard ratios associated with each putative risk factor. The variance introduced by the procedure is estimated using multiple imputation. We simulate data with missing censoring indicators and show that our method performs as well as or better than the competing methods. Finally, we apply the proposed method to analyze data from the OPPERA study.

Reinforcement Learning Trees

Ruoqing Zhu et al. — Thu, 10 Jan 2013 07:20:28 PST

In this paper, we introduce a new type of tree-based regression method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001). The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach can be adapted to make high-dimensional cuts available at a relatively small computational cost. Second, we propose a variable screening method that progressively mutes noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards a terminal node when the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method. We can show that under the proposed splitting variable selection procedure, the constructed trees are consistent. The error bounds for the proposed RLT are shown to depend on a pre-selected number p₀, where p₀ is an educated guess of the number of strong variables which is usually much smaller than the total number of variables p but at least as large as the true number of strong variables p₁. Hence when p₀ is properly chosen, the error bounds can be significantly improved.

Identification of biologically relevant subtypes via preweighted sparse clustering

Sheila Gaynor et al. — Tue, 04 Dec 2012 04:10:31 PST

Cluster analysis methods are used to identify homogeneous subgroups in a data set. Frequently one applies cluster analysis in order to identify biologically interesting subgroups. In particular, one may wish to identify subgroups that are associated with a particular outcome of interest. Conventional clustering methods often fail to identify such subgroups, particularly when there are a large number of high-variance features in the data set. Conventional methods may identify clusters associated with these high-variance features when one wishes to obtain secondary clusters that are more interesting biologically or more strongly associated with a particular outcome of interest. We describe a modification of the sparse clustering method of Witten and Tibshirani (2010) can be used to identify such secondary clusters or clusters associated with an outcome of interest. We show that this method can correctly identify such clusters of interest in several simulation scenarios. The method is also applied to a large case-control study of temporomandibular disorder and a breast cancer microarray data set.

Simple and accurate trend tests using a permutation approximation

Yi-Hui Zhou et al. — Tue, 16 Oct 2012 11:17:08 PDT

Permutation is an attractive approach to assess association between two vectors x and y, by comparing the observed statistic to the distribution induced by random permutation of one of the vectors. For a number of “standard” statistics, equivalent testing can be performed by using the sample Pearson correlation. Applications include the standard tests applied in the two-sample problem, simple linear regression, several generalized linear models, linear categorical trend tests, and rank-based association. We describe a simple approximation to the distribution of the correlation under permutation, providing accurate p-values that can be quickly computed for a variety of data types. The approximation may be especially useful in high-throughput applications in which a series of x-vectors is compared to one or more y-vectors.

Developing Adaptive Personalized Therapy for Cystic Fibrosis using Reinforcement Learning

Yiyun Tang et al. — Fri, 28 Sep 2012 14:15:29 PDT

Optimal clinical management of inherited chronic diseases, such as Cystic Fibrosis (CF), requires a dynamic approach which updates treatments to cope with the evolving course of illness and to tailor medicines and dosages for individual patient. In this paper, we examine the problem of computing optimal adaptive personalized therapy for CF patients. A temporal difference reinforcement learning method called fitted Q-iteration is utilized to discover the optimal treatment regimen directly from clinical data. We conduct a simulation study of virtual cystic fibrosis patients with Pseudomonas aeruginosa infection and antibiotic therapy with parameters tuned to approximately match published data from CF patients. Our simulation results indicate that reinforcement learning can be an effective tool in developing personalized therapy which optimises the benefit-risk trade off in multi-stage decision making and improves long term outcomes in chronic diseases.

Reader Reaction: On Variance Estimation for the Fine-Gray Model

Chenxi Li et al. — Mon, 20 Aug 2012 08:37:33 PDT

Geskus (2011, Biometrics, 67, 39-49) studied estimation of the Fine-Gray model for the cumulative incidence function with left truncated right censored competing risks data. The limiting distribution for an estimator base on weighting inversely using weights involving estimates of the joint distribution of the truncation and censoring times was derived via classical martingale theory with variance estimation based on martingale results. In this note, we demonstrate that martingale theory is not applicable and that other theoretical arguments, like those in Fine and Gray (1999), are needed to rigorously establish the asymptotic properties of the estimators and to construct valid variance estimators. For inverse probability of censoring weighted estimators, the common wisdom is that martingale theory fails because of estimation of the censoring distribution in the weights. For the Fine-Gray model, alternative theoretical developments are needed even with a known censoring distribution.

A Comparison of Methods for Generating Correlated Binary Variates with Specified Marginal Means and Correlations

John S. Preisser Jr. et al. — Wed, 08 Aug 2012 08:00:16 PDT

Simulation studies employed to study properties of estimators for parameters in population-averaged models for clustered or longitudinal data require suitable algorithms for data generation. The most useful algorithms for generating correlated binary data are those that allow general specifications of the marginal mean and correlation structures, while being able to generate clusters of moderate to large size. Such methods, however, cannot reproduce data for all possible multivariate binary distributions. Given a vector of marginal means, they often place restrictions on the range of correlations beyond the natural restrictions applicable to any multivariate binary distribution. Motivated by problems in biostatistics, we compare the algorithms of Emrich and Piedmonte (1991) and Qaqish (2003) with respect to range restrictions induced on correlations. Examples include generating longitudinal binary data and generating correlated binary data compatible with specified marginal means and covariance structures for bivariate, overdispersed binomial outcomes. Results show that both algorithms generally have good coverage with Qaqish's method giving a wider range of correlations for longitudinal data having autocorrelated within-subject associations and Emrich and Piedmonte's method giving a wider range of correlations for clustered data having exchangeable-type correlations. Practical considerations for generating data with varying cluster sizes or for subjects in longitudinal studies with missing data are also discussed.

A Multistage Non‐inferiority Study Analysis Plan to Evaluate Successively More Stringent Criteria for a Clinical Trial with Rare Events

Siying Li et al. — Mon, 18 Jun 2012 05:57:46 PDT

We address a multistage clinical trial to assess a sequence of hypotheses in the noninferiority and also rare events setting. Three successive hypotheses are used to evaluate whether the new treatment meets the criteria for new drug approval. Sample sizes for a five stage trial for all hypotheses are calculated using Poisson and Logrank sample size methods. Three strategies and corresponding analysis plans are developed to evaluate the sequential hypotheses. Simulations show the design is satisfactory with respect to controlled Type I error, sufficient power, and early success at interim analyses.

Change-Point Models to Estimate the Limit of Detection

Ryan C. May et al. — Mon, 05 Mar 2012 10:22:05 PST

In many biological and environmental studies, measured data is subject to a limit of detection. The limit of detection is generally defined as the lowest concentration of analyte that can be differentiated from a blank sample with some certainty. Data falling below the limit of detection is left-censored, falling below a level that is easily quantified by a measuring device. A great deal of interest lies in estimating the limit of detection for a particular measurement device. In this paper we propose an innovative change-point model to estimate the limit of detection using data from an experiment with known analyte concentrations. Estimation of the limit of detection proceeds by way of a two-stage maximum likelihood method. The proposed methodology is analyzed via simulation, and is applied to copy number data from an HIV pilot study. This method is shown to lead to improved estimation of the limit of detection.

Keywords: Change Point; Linear Calibration Curve; Limit of Detection; Two-Stage Maximum Likelihood