Johns Hopkins University, Dept. of Biostatistics Working PapersCopyright (c) 2014 Johns Hopkins University All rights reserved.
http://biostats.bepress.com/jhubiostat
Recent documents in Johns Hopkins University, Dept. of Biostatistics Working Papersen-usFri, 29 Aug 2014 02:00:26 PDT3600Partially-Latent Class Models (pLCM) for Case-Control Studies of Childhood Pneumonia Etiology
http://biostats.bepress.com/jhubiostat/paper267
http://biostats.bepress.com/jhubiostat/paper267Wed, 27 Aug 2014 12:50:27 PDT
In population studies on the etiology of disease, one goal is the estimation of the fraction of cases attributable to each of several causes. For example, pneumonia is a clinical diagnosis of lung infection that may be caused by viral, bacterial, fungal, or other pathogens. The study of pneumonia etiology is challenging because directly sampling from the lung to identify the etiologic pathogen is not standard clinical practice in most settings. Instead, measurements from multiple peripheral specimens are made. This paper considers the problem of estimating the population etiology distribution and the individual etiology probabilities. We formulate the scientific problem in statistical terms as estimating the posterior distribution of mixing weights and latent class indicators under a partially-latent class model (pLCM) that combines heterogeneous measurements with different error rates obtained from a case-control study. We introduce the pLCM as an extension of the latent class model. We also introduce graphical displays of the population data and inferred latent-class frequencies. The methods are illustrated with simulated and real data sets. The paper closes with a brief description of extensions of the pLCM to the regression setting and to the case where conditional independence among the measures is relaxed.
]]>
Zhenke Wu et al.TARGETED MAXIMUM LIKELIHOOD ESTIMATION USING EXPONENTIAL FAMILIES
http://biostats.bepress.com/jhubiostat/paper266
http://biostats.bepress.com/jhubiostat/paper266Mon, 02 Jun 2014 10:14:08 PDT
Targeted maximum likelihood estimation (TMLE) is a general method for estimating parameters in semiparametric and nonparametric models. Each iteration of TMLE involves fitting a parametric submodel that targets the parameter of interest. We investigate the use of exponential families to define the parametric submodel. This implementation of TMLE gives a general approach for estimating any smooth parameter in the nonparametric model. A computational advantage of this approach is that each iteration of TMLE involves estimation of a parameter in an exponential family, which is a convex optimization problem for which software implementing reliable and computationally efficient methods exists. We illustrate the method in three estimation problems, involving the mean of an outcome missing at random, the parameter of a median regression model, and the causal effect of a continuous exposure, respectively. We conduct a simulation study comparing different choices for the parametric submodel, focusing on the first of these problems. To the best of our knowledge, this is the first study investigating robustness of TMLE to different specifications of the parametric submodel. We find that the choice of submodel can have an important impact on the behavior of the estimator in finite samples.
]]>
Iván Díaz et al.Estimating population treatment effects from a survey sub-sample
http://biostats.bepress.com/jhubiostat/paper265
http://biostats.bepress.com/jhubiostat/paper265Wed, 21 May 2014 08:20:23 PDT
We consider the problem of estimating an average treatment effect for a target population from a survey sub-sample. Our motivating example is generalizing a treatment effect estimated in a sub-sample of the National Comorbidity Survey Replication Adolescent Supplement to the population of U.S. adolescents. To address this problem, we evaluate easy-to-implement methods that account for both non-random treatment assignment and a non-random two-stage selection mechanism. We compare the performance of a Horvitz-Thompson estimator using inverse probability weighting (IPW) and two double robust estimators in a variety of scenarios. We demonstrate that the two double robust estimators generally outperform IPW in terms of mean-squared error even under misspecification of one of the treatment, selection, or outcome models. Moreover, the double robust estimators are easy to implement, providing an attractive alternative to IPW for applied epidemiologic researchers. We demonstrate how to apply these estimators to our motivating example.
]]>
Kara E. Rudolph et al.COX REGRESSION MODELS WITH FUNCTIONAL COVARIATES FOR SURVIVAL DATA
http://biostats.bepress.com/jhubiostat/paper264
http://biostats.bepress.com/jhubiostat/paper264Mon, 12 May 2014 10:27:34 PDT
We extend the Cox proportional hazards model to cases when the exposure is a densely sampled functional process, measured at baseline. The fundamental idea is to combine penalized signal regression with methods developed for mixed effects proportional hazards models. The model is fit by maximizing the penalized partial likelihood, with smoothing parameters estimated by a likelihood-based criterion such as AIC or EPIC. The model may be extended to allow for multiple functional predictors, time varying coefficients, and missing or unequally-spaced data. Methods were inspired by and applied to a study of the association between time to death after hospital discharge and daily measures of disease severity collected in the intensive care unit, among survivors of acute respiratory distress syndrome.
]]>
Jonathan E. Gellar et al.LEVERAGING PROGNOSTIC BASELINE VARIABLES TO GAIN PRECISION IN RANDOMIZED TRIALS
http://biostats.bepress.com/jhubiostat/paper263
http://biostats.bepress.com/jhubiostat/paper263Mon, 05 May 2014 12:39:07 PDT
In a randomized trial, if baseline variables are correlated with the outcome, then appropriately adjusting for these can improve precision for estimating the average treatment effect. An example is the analysis of covariance (ANCOVA) estimator, which can be applied when the outcome is continuous, the quantity of interest is the difference in mean outcomes comparing treatment versus control, and a linear model with only main effects is used. ANCOVA has been shown to have the following desirable properties: it is guaranteed to be at least as precise as the standard unadjusted estimator, asymptotically, under no parametric model assumptions; furthermore, it is locally, semiparametric efficient. Recently, estimators have been developed that extend these desirable properties to a more general setting that allows: any real-valued outcome (e.g., binary, count, or continuous), contrasts other than the difference in mean outcomes (such as the relative risk or odds ratio), and estimators based on a large class of generalized linear models (including logistic regression) that may include interaction terms. Though the asymptotic properties of these new estimators have been established, they have not yet been applied to data distributions from randomized trials. We evaluate the practical performance of these estimators using simulations based on resampling data from completed randomized trials in HIV and stroke. In some cases, these estimators substantially improve power compared to standard estimators that ignore baseline variables. Given the large potential gains and relatively small costs, these estimators have potential to be useful in analyzing randomized trials. We provide guidance on how to select among many possible estimators, and recommend an estimator that is a practical compromise between computational complexity and statistical efficiency. R and SAS code is provided, which allows clinical investigators to assess whether these estimators could be useful in their specific trial contexts.
]]>
Elizabeth Colantuoni et al.INTERADAPT -- AN INTERACTIVE TOOL FOR DESIGNING AND EVALUATING RANDOMIZED TRIALS WITH ADAPTIVE ENROLLMENT CRITERIA
http://biostats.bepress.com/jhubiostat/paper262
http://biostats.bepress.com/jhubiostat/paper262Fri, 14 Mar 2014 12:09:58 PDT
The interAdapt R package is designed to be used by statisticians and clinical investigators to plan randomized trials. It can be used to determine if certain adaptive designs offer tangible benefits compared to standard designs, in the context of investigators’ specific trial goals and constraints. Specifically, interAdapt compares the performance of trial designs with adaptive enrollment criteria versus standard (non-adaptive) group sequential trial designs. Performance is compared in terms of power, expected trial duration, and expected sample size. Users can either work directly in the R console, or with a user-friendly shiny application that requires no programming experience. Several added features are available when using the shiny application. For example, the application allows users to immediately download the results of the performance comparison as a csv-table, or as a printable, html-based report.
]]>
Aaron Joel Fisher et al.VARIABLE-DOMAIN FUNCTIONAL REGRESSION FOR MODELING ICU DATA
http://biostats.bepress.com/jhubiostat/paper261
http://biostats.bepress.com/jhubiostat/paper261Wed, 05 Feb 2014 09:18:44 PST
We introduce a class of scalar-on-function regression models with subject-specific functional predictor domains. The fundamental idea is to consider a bivariate functional parameter that depends both on the functional argument and on the width of the functional predictor domain. Both parametric and nonparametric models are introduced to fit the functional coefficient. The nonparametric model is theoretically and practically invariant to functional support transformation, or support registration. Methods were motivated by and applied to a study of association between daily measures of the Intensive Care Unit (ICU) Sequential Organ Failure Assessment (SOFA) score and two outcomes: in-hospital mortality, and physical impairment at hospital discharge among survivors. Methods are generally applicable to a large number of new studies that record a continuous variables over unequal domains.
]]>
Jonathan E. Gellar et al.ADAPTIVE RANDOMIZED TRIAL DESIGNS THAT CANNOT BE DOMINATED BY ANY STANDARD DESIGN AT THE SAME TOTAL SAMPLE SIZE
http://biostats.bepress.com/jhubiostat/paper260
http://biostats.bepress.com/jhubiostat/paper260Fri, 31 Jan 2014 13:12:05 PST
Prior work has shown that certain types of adaptive designs can always be dominated by a suitably chosen, standard, group sequential design. This applies to adaptive designs with rules for modifying the total sample size. A natural question is whether analogous results hold for other types of adaptive designs. We focus on adaptive enrichment designs, which involve preplanned rules for modifying enrollment criteria based on accrued data in a randomized trial. Such designs often involve multiple hypotheses, e.g., one for the total population and one for a predefined subpopulation, such as those with high disease severity at baseline. We fix the total sample size, and consider overall power, defined as the probability of rejecting at least one false null hypothesis. We present adaptive enrichment designs whose overall power at two alternatives cannot simultaneously be matched by any standard design. In some scenarios there is a substantial gap between the overall power achieved by these adaptive designs and that of any standard design. We also prove that such gains in overall power come at a cost. To attain overall power above what is achievable by certain standard designs, it is necessary to increase power to reject some hypotheses and reduce power to reject others. We conclude by showing the class of adaptive enrichment designs allows certain power tradeoffs that are not available when restricting to standard designs. We illustrate our results in the context of planning a hypothetical, randomized trial of a new antidepressant, using data distributions from (Kirsch et al., 2008).
]]>
Michael RosenblumJoint Estimation of Multiple Graphical Models from High Dimensional Time Series
http://biostats.bepress.com/jhubiostat/paper259
http://biostats.bepress.com/jhubiostat/paper259Thu, 26 Dec 2013 07:09:18 PST
In this manuscript the problem of jointly estimating multiple graphical models in high dimensions is considered. It is assumed that the data are collected from n subjects, each of which consists of m non-independent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of the closeness between subjects. A kernel based method for jointly estimating all graphical models is proposed. Theoretically, under a double asymptotic framework, where both (m,n) and the dimension d can increase, the explicit rate of convergence in parameter estimation is provided, thus characterizing the strength one can borrow across different individuals and impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.
]]>
Huitong Qiu et al.Sparse Median Graphs Estimation in a High Dimensional Semiparametric Model
http://biostats.bepress.com/jhubiostat/paper258
http://biostats.bepress.com/jhubiostat/paper258Thu, 26 Dec 2013 07:05:22 PST
In this manuscript a unified framework for conducting inference on complex aggregated data in high dimensional settings is proposed. The data are assumed to be a collection of multiple non-Gaussian realizations with underlying undirected graphical structures. Utilizing the concept of median graphs in summarizing the commonality across these graphical structures, a novel semiparametric approach to modeling such complex aggregated data is provided along with robust estimation of the median graph, which is assumed to be sparse. The estimator is proved to be consistent in graph recovery and an upper bound on the rate of convergence is given. Experiments on both synthetic and real datasets are conducted to illustrate the empirical usefulness of the proposed models and methods.
]]>
Fang Han et al.Soft Null Hypotheses: A Case Study of Image Enhancement Detection in Brain Lesions
http://biostats.bepress.com/jhubiostat/paper257
http://biostats.bepress.com/jhubiostat/paper257Wed, 26 Jun 2013 13:01:11 PDT
This work is motivated by a study of a population of multiple sclerosis (MS) patients using dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) to identify active brain lesions. At each visit, a contrast agent is administered intravenously to a subject and a series of images is acquired to reveal the location and activity of MS lesions within the brain. Our goal is to identify and quantify lesion enhancement location at the subject level and lesion enhancement patterns at the population level. With this example, we aim to address the difficult problem of transforming a qualitative scientific null hypothesis, such as "this voxel does not enhance", to a well-defined and numerically testable null hypothesis based on existing data. We call the procedure "soft null hypothesis" testing as opposed to the standard "hard null hypothesis" testing. This problem is fundamentally different from: 1) testing when a quantitative null hypothesis is given; 2) clustering using a mixture distribution; or 3) identifying a reasonable threshold with a parametric null assumption. We analyze a total of 20 subjects scanned at 63 visits (~30Gb), the largest population of such clinical brain images.
]]>
Haochang Shou et al.TRIAL DESIGNS THAT SIMULTANEOUSLY OPTIMIZE THE POPULATION ENROLLED AND THE TREATMENT ALLOCATION PROBABILITIES
http://biostats.bepress.com/jhubiostat/paper256
http://biostats.bepress.com/jhubiostat/paper256Tue, 18 Jun 2013 09:04:25 PDT
Standard randomized trials may have lower than desired power when the treatment effect is only strong in certain subpopulations. This may occur, for example, in populations with varying disease severities or when subpopulations carry distinct biomarkers and only those who are biomarker positive respond to treatment. To address such situations, we develop a new trial design that combines two types of preplanned rules for updating how the trial is conducted based on data accrued during the trial. The aim is a design with greater overall power and that can better determine subpopulation specific treatment effects, while maintaining strong control of the familywise Type I error rate. The first component of our design involves response-adaptive randomization, in which the probability of being assigned to the treatment or control arm is updated during the trial to target an optimal allocation. The second component of our design involves enrichment, where the criteria for patient enrollment may be modified to help learn which subpopulations benefit from the treatment. We do a simulation study to compare the power of our design, which we call a response-adaptive enrichment design, to three simpler designs: a standard randomized trial design, a response-adaptive design, and an enrichment design. Our simulation study compares these designs in scenarios that arise from the problem of testing the effectiveness of a hypothetical new antidepressant.
]]>
Brandon S. Luber et al.Structured Functional Principal Component Analysis
http://biostats.bepress.com/jhubiostat/paper255
http://biostats.bepress.com/jhubiostat/paper255Tue, 30 Apr 2013 09:35:21 PDT
Motivated by modern observational studies, we introduce a class of functional models that expands nested and crossed designs. These models account for the natural inheritance of correlation structure from sampling design in studies where the fundamental sampling unit is a function or image. Inference is based on functional quadratics and their relationship with the underlying covariance structure of the latent processes. A computationally fast and scalable estimation procedure is developed for ultra-high dimensional data. Methods are illustrated in three examples: high-frequency accelerometer data for daily activity, pitch linguistic data for phonetic analysis, and EEG data for studying electrical brain activity during sleep.
]]>
Haochang Shou et al.PENALIZED FUNCTION-ON-FUNCTION REGRESSION
http://biostats.bepress.com/jhubiostat/paper254
http://biostats.bepress.com/jhubiostat/paper254Tue, 23 Apr 2013 09:50:30 PDT
We propose a general framework for smooth regression of a functional response on one or multiple functional predictors. Using the mixed model representation of penalized regression expands the scope of function on function regression to many realistic scenarios. In particular, the approach can accommodate a densely or sparsely sampled functional response as well as multiple functional predictors that are observed: 1) on the same or different domains than the functional response; 2) on a dense or sparse grid; and 3) with or without noise. It also allows for seamless integration of continuous or categorical covariates and provides approximate confidence intervals as a by-product of the mixed model inference. The proposed methods are accompanied by easy to use and robust software implemented in the pffr function of the R package refund. Methodological developments are general, but were inspired by and applied to a Diffusion Tensor Imaging (DTI) brain tractography dataset.
]]>
Andrada E. Ivanescu et al.OPTIMAL TESTS OF TREATMENT EFFECTS FOR THE OVERALL POPULATION AND TWO SUBPOPULATIONS IN RANDOMIZED TRIALS, USING SPARSE LINEAR PROGRAMMING
http://biostats.bepress.com/jhubiostat/paper253
http://biostats.bepress.com/jhubiostat/paper253Tue, 23 Apr 2013 06:49:10 PDT
We propose new, optimal methods for analyzing randomized trials, when it is suspected that treatment effects may differ in two predefined subpopulations. Such sub-populations could be defined by a biomarker or risk factor measured at baseline. The goal is to simultaneously learn which subpopulations benefit from an experimental treatment, while providing strong control of the familywise Type I error rate. We formalize this as a multiple testing problem and show it is computationally infeasible to solve using existing techniques. Our solution involves a novel approach, in which we first transform the original multiple testing problem into a large, sparse linear program. We then solve this problem using advanced optimization techniques. This general method can solve a variety of multiple testing problems and decision theory problems related to optimal trial design, for which no solution was previously available. In particular, we construct new multiple testing procedures that satisfy minimax and Bayes optimality criteria. For a given optimality criterion, our new approach yields the optimal tradeoff between power to detect an effect in the overall population versus power to detect effects in subpopulations. We demonstrate our approach in examples motivated by two randomized trials of new treatments for HIV.
]]>
Michael Rosenblum et al.Homotopic Group ICA for Multi-Subject Brain Imaging Data
http://biostats.bepress.com/jhubiostat/paper252
http://biostats.bepress.com/jhubiostat/paper252Thu, 07 Mar 2013 11:53:29 PST
Independent Component Analysis (ICA) is a computational technique for revealing latent factors that underlie sets of measurements or signals. It has become a standard technique in functional neuroimaging. In functional neuroimaging, so called group ICA (gICA) seeks to identify and quantify networks of correlated regions across subjects. This paper reports on the development of a new group ICA approach, Homotopic Group ICA (H-gICA), for blind source separation of resting state functional magnetic resonance imaging (fMRI) data. Resting state brain functional homotopy is the similarity of spontaneous fluctuations between bilaterally symmetrically opposing regions (i.e. those symmetric with respect to the mid-sagittal plane) (Zuo et al., 2010). The approach we proposed improves network estimates by leveraging this known brain functional homotopy. H-gICA increases the potential for network discovery, effectively by averaging information across hemispheres. It is theoretically proven to be identical to standard group ICA when the true sources are both perfectly homotopic and noise-free, while simulation studies and data explorations demonstrate its benefits in the presence of noise. Moreover, compared to commonly applied group ICA algorithms, the structure of the H-gICA input data leads to significant improvement in computational efficiency. A simulation study comfirms its effectiveness in homotopic, non-homotopic and mixed settings, as well as on the landmark ADHD-200 dataset. From a relatively small subset of data, several brain networks were found including: the visual, the default mode and auditory networks, as well as others. These were shown to be more contiguous and clearly delineated than the corresponding ordinary group ICA. Finally, in addition to improving network estimation, H-gICA facilitates the investigation of functional homotopy via ICA-based networks.
]]>
Juemin Yang et al.PREDICTING HUMAN MOVEMENT TYPE BASED ON MULTIPLE ACCELEROMETERS USING MOVELETS
http://biostats.bepress.com/jhubiostat/paper251
http://biostats.bepress.com/jhubiostat/paper251Thu, 07 Mar 2013 11:52:56 PST
We introduce statistical methods for prediction of types of human movement based on three tri-axial accelerometers worn simultaneously at the hip, left, and right wrist. We compare the individual performance of the three accelerometers using movelets and propose a new prediction algorithm that integrates the information from all three accelerometers. The development is motivated by a study of 20 older subjects who were instructed to perform 15 different types of activities during in-laboratory sessions. The differences in the prediction performance for different activity types among the three accelerometers reveal subtle yet important insights into how the intrinsic physical features of human movements could be effectively utilized in prediction. The proposed integrative movelet method takes into account those findings to augment the prediction accuracy and improve our understanding of human movement measurements.
]]>
Bing He et al.Adaptive, Group Sequential Designs that Balance the Benefits and Risks of Wider Inclusion Criteria
http://biostats.bepress.com/jhubiostat/paper250
http://biostats.bepress.com/jhubiostat/paper250Mon, 04 Feb 2013 10:03:35 PST
In designing a Phase III randomized trial, care must be taken in selecting the target population. Advantages of enrolling from a larger population include wider generalizability of results and faster recruitment. However, earlier trials (e.g. Phase II trials) and medical knowledge may provide stronger evidence of a treatment effect for certain subpopulations. This makes a Phase III trial that targets the overall population more risky, since if the treatment only benefits a subpopulation, there may be low power to detect this. We propose new adaptive, group sequential designs aimed at gaining the advantages of wider generalizability and faster recruitment, while mitigating the risks of including a population for which there is greater a priori uncertainty. These designs use preplanned rules for changing the enrollment criteria if the participants from predefined subpopulations are not benefiting from the new treatment. .We demonstrate these adaptive designs in the context of a Phase III trial of a new treatment for stroke, and compare them to standard, group sequential designs in terms of expected sample size..
]]>
Michael Rosenblum et al.FAST COVARIANCE ESTIMATION FOR HIGH-DIMENSIONAL FUNCTIONAL DATA
http://biostats.bepress.com/jhubiostat/paper249
http://biostats.bepress.com/jhubiostat/paper249Wed, 09 Jan 2013 09:11:16 PST
For smoothing covariance functions, we propose two fast algorithms that scale linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J x J with J>500; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions such as J \ge 10,000. Covariance matrices of order J=10,000, and even J=100,000$ are becoming increasingly common, e.g., in 2- and 3-dimensional medical imaging and high-density wearable sensor data. We introduce two new algorithms that can handle very large covariance matrices: 1) FACE: a fast implementation of the sandwich smoother and 2) SVDS: a two-step procedure that first applies singular value decomposition to the data matrix and then smoothes the eigenvectors. Compared to existing techniques, these new algorithms are at least an order of magnitude faster in high dimensions and drastically reduce memory requirements. The new algorithms provide instantaneous (few seconds) smoothing for matrices of dimension J=10,000 and very fast ($<$ 10 minutes) smoothing for J=100,000. Although SVDS is simpler than FACE, we provide ready to use, scalable R software for FACE. When incorporated into R package {\it refund}, FACE improves the speed of penalized functional regression by an order of magnitude, even for data of normal size (J <500). We recommend that FACE be used in practice for the analysis of noisy and high-dimensional functional data.
]]>
Luo Xiao et al.LONGITUDINAL FUNCTIONAL MODELS WITH STRUCTURED PENALTIES
http://biostats.bepress.com/jhubiostat/paper248
http://biostats.bepress.com/jhubiostat/paper248Fri, 02 Nov 2012 11:29:56 PDT
Collection of functional data is becoming increasingly common including longitudinal observations in many studies. For example, we use magnetic resonance (MR) spectra collected over a period of time from late stage HIV patients. MR spectroscopy (MRS) produces a spectrum which is a mixture of metabolite spectra, instrument noise and baseline profile. Analysis of such data typically proceeds in two separate steps: feature extraction and regression modeling. In contrast, a recently-proposed approach, called partially empirical eigenvectors for regression (PEER) (Randolph, Harezlak and Feng, 2012), for functional linear models incorporates a priori knowledge via a scientifically-informed penalty operator in the regression function estimation process. We extend the scope of PEER to the longitudinal setting with continuous outcomes and longitudinal functional covariates. The method presented in this paper: 1) takes into account external information; and 2) allows for a time-varying regression function. In the proposed approach, we express the time-varying regression function as linear combination of several time-invariant component functions; the time dependence enters into the regression function through their coefficients. The estimation procedure is easy to implement due to its mixed model equivalence. We derive the precision and accuracy of the estimates and discuss their connection with the generalized singular value decomposition. Real MRS data and simulations are used to illustrate the concepts.
]]>
Madan G. Kundu et al.