UPenn Biostatistics Working PapersCopyright (c) 2016 University of Pennsylvania All rights reserved.
http://biostats.bepress.com/upennbiostat
Recent documents in UPenn Biostatistics Working Papersen-usSun, 27 Mar 2016 01:30:58 PDT3600Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling
http://biostats.bepress.com/upennbiostat/art46
http://biostats.bepress.com/upennbiostat/art46Fri, 25 Mar 2016 12:37:14 PDT
The ability to simulate correlated binary data is important for sample size calculation and comparison of methods for analysis of clustered and longitudinal data with dichotomous outcomes. One available approach for simulating length n vectors of dichotomous random variables is to sample from the multinomial distribution of all possible length n permutations of zeros and ones. However, the multinomial sampling method has only been implemented in general form (without ﬁrst making restrictive assumptions) for vectors of length 2 and 3, because specifying the multinomial distribution is very challenging for longer vectors. I overcome this diﬃculty by presenting an algorithm for simulating correlated binary data via multinomial sampling that can be easily applied to directly compute the multinomial distribution for any n. I demonstrate the approach to simulate vectors of length 4 and 8 in an assessment of power during the planning phases of a study and to assess the choice of working correlation structure in an analysis with generalized estimating equations.
]]>
Justine ShultsMaximum Likelihood Based Analysis of Equally Spaced Longitudinal Count Data with Specified Marginal Means, First-order Antedependence, and Linear Conditional Expectations
http://biostats.bepress.com/upennbiostat/art45
http://biostats.bepress.com/upennbiostat/art45Fri, 25 Mar 2016 12:30:11 PDT
This manuscript implements a maximum likelihood based approach that is appropriate for equally spaced longitudinal count data with over-dispersion, so that the variance of the outcome variable is larger than expected for the assumed Poisson distribution. We implement the proposed method in the analysis of two data sets and make comparisons with the semi-parametric generalized estimating equations (GEE) approach that incorrectly ignores the over-dispersion. The simulations demonstrate that the proposed method has better small sample efficiency than GEE. We also provide code in R that can be used to recreate the analysis results that we provide in this manuscript.
]]>
Victoria Gamerman et al.Statistical estimation of white matter microstructure from conventional MRI
http://biostats.bepress.com/upennbiostat/art44
http://biostats.bepress.com/upennbiostat/art44Wed, 23 Dec 2015 08:26:15 PST
Diffusion tensor imaging (DTI) has become the predominant modality for studying white matter integrity in multiple sclerosis (MS) and other neurological disorders. Unfortunately, the use of DTI-based biomarkers in large multi-center studies is hindered by systematic biases that confound the study of disease-related changes. Furthermore, the site-to-site variability in multi-center studies is significantly higher for DTI than that for conventional MRI-based markers. In our study, we apply the Quantitative MR Estimation Employing Normalization (QuEEN) model to estimate the four DTI measures: MD, FA, RD, and AD. QuEEN uses a voxel-wise generalized additive regression model to relate the normalized intensities of one or more conventional MRI modalities to a quantitative modality, such as DTI. We assess the accuracy of the models by comparing the prediction error of estimated DTI images to the scan-rescan error in subjects with two sets of scans. Across the four DTI measures, the performance of the models is not consistent: Both MD and RD estimations appear to be quite accurate, while AD estimation is less accurate than MD and RD; the accuracy of FA estimation is poor. Thus, it some cases when assessing white matter integrity, it may sufficient to acquire conventional MRI sequences alone.
]]>
Leah Suttner et al.Removing inter-subject technical variability in magnetic resonance imaging studies
http://biostats.bepress.com/upennbiostat/art43
http://biostats.bepress.com/upennbiostat/art43Wed, 28 Oct 2015 14:16:00 PDT
Magnetic resonance imaging (MRI) intensities are acquired in arbitrary units, making scans non-comparable across sites and between subjects. Intensity normalization is a first step for the improvement of comparability of the images across subjects. However, we show that unwanted inter-scan variability associated with imaging site, scanner effect and other technical artifacts is still present after standard intensity normalization in large multi-site neuroimaging studies. We propose RAVEL (Removal of Artificial Voxel Effect by Linear regression), a tool to remove residual technical variability after intensity normalization. As proposed by SVA and RUV [Leek and Storey, 2007, 2008, Gagnon-Bartsch and Speed, 2012], two batch effect correction tools largely used in genomics, we decompose the voxel intensities of images registered to a template into a biological component and an unwanted variation component. The unwanted variation component is estimated from a control region obtained from the cerebrospinal fluid (CSF), where intensities are known to be unassociated with disease status and other clinical covariates. We perform a singular value decomposition (SVD) of the control voxels to estimate factors of unwanted variation. We then estimate the unwanted factors using linear regression for every voxel of the brain and take the residuals as the RAVEL-corrected intensities. We assess the performance of RAVEL using T1-weighted (T1-w) images from more than 900 subjects with Alzheimer’s disease (AD) and mild cognitive impairment (MCI), as well as healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We compare RAVEL to intensity-normalization-only methods, histogram matching, and White Stripe. We show that RAVEL performs best at improving the replicability of the brain regions that are empirically found to be most associated with AD, and that these regions are significantly more present in structures impacted by AD (hippocampus, amygdala, parahippocampal gyrus, enthorinal area and fornix stria terminals). In addition, we show that the RAVEL-corrected intensities have the best performance in distinguishing between MCI subjects and healthy subjects by using the mean hippocampal intensity (AUC=67%), a marked improvement compared to results from intensity normalization alone (AUC=63% and 59% for histogram matching and White Stripe, respectively). RAVEL is generalizable to many imaging modalities, and shows promise for longitudinal studies. Additionally, because the choice of the control region is left to the user, RAVEL can be applied in studies of many brain disorders.
]]>
Jean-Philippe Fortin et al.Control-Group Feature Normalization for Multivariate Pattern Analysis Using the Support Vector Machine
http://biostats.bepress.com/upennbiostat/art42
http://biostats.bepress.com/upennbiostat/art42Thu, 24 Sep 2015 14:45:43 PDT
Normalization of feature vector values is a common practice in machine learning. Generally, each feature value is standardized to the unit hypercube or by normalizing to zero mean and unit variance. Classification decisions based on support vector machines (SVMs) or by other methods are sensitive to the specific normalization used on the features. In the context of multivariate pattern analysis using neuroimaging data, standardization effectively up- and down-weights features based on their individual variability. Since the standard approach uses the entire data set to guide the normalization it utilizes the total variability of these features. This total variation is inevitably dependent on the amount of marginal separation between groups. Thus, such a normalization may attenuate the separability of the data in high dimensional space. In this work we propose an alternate approach that uses an estimate of the control-group standard deviation to normalize features before training. We also show that control-based normalization provides better interpretation with respect to the estimated multivariate disease pattern and improves the classifier performance in many cases.
]]>
Kristin A. Linn et al.Addressing Confounding in Predictive Models with an Application to Neuroimaging
http://biostats.bepress.com/upennbiostat/art41
http://biostats.bepress.com/upennbiostat/art41Thu, 24 Sep 2015 14:39:04 PDT
Understanding structural changes in the brain that are caused by a particular disease is a major goal of neuroimaging research. Multivariate pattern analysis (MVPA) comprises a collection of tools that can be used to understand complex disease effects across the brain. We discuss several important issues that must be considered when analyzing data from neuroimaging studies using MVPA. In particular, we focus on the consequences of confounding by non-imaging variables such as age and sex on the results of MVPA. After reviewing current practice to address confounding in neuroimaging studies, we propose an alternative approach based on inverse probability weighting. Although the proposed method is motivated by neuroimaging applications, it is broadly applicable to many problems in machine learning and predictive modeling. We demonstrate the advantages of our approach on simulated and real data examples.
]]>
Kristin A. Linn et al.Nonparametric methods for doubly robust estimation of continuous treatment effects
http://biostats.bepress.com/upennbiostat/art39
http://biostats.bepress.com/upennbiostat/art39Tue, 22 Sep 2015 12:25:36 PDT
Continuous treatments (e.g., doses) arise often in practice, but available causal effect estimators require either parametric models for the effect curve or else consistent estimation of a single nuisance function. We propose a novel doubly robust kernel smoothing approach, which requires only mild smoothness assumptions on the effect curve and allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and also discuss an approach for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.
]]>
Edward H. Kennedy et al.Regression modeling of longitudinal binary outcomes with outcome-dependent observation times
http://biostats.bepress.com/upennbiostat/art38
http://biostats.bepress.com/upennbiostat/art38Tue, 10 Mar 2015 15:50:39 PDT
Conventional longitudinal data analysis methods assume that outcomes are independent of the data-collection schedule. However, the independence assumption may be violated, for example, when adverse events trigger additional physician visits in between prescheduled follow-ups. Observation times may therefore be associated with outcome values, which may introduce bias when estimating the eect of covariates on outcomes using standard longitudinal regression methods. Existing semi-parametric methods that accommodate outcome-dependent observation times are limited to the analysis of continuous outcomes. We develop new methods for the analysis of binary outcomes, while retaining the exibility of semi-parametric models. Our methods are based on counting process approaches, rather than relying on possibly intractable likelihood-based or pseudo-likelihood-based approaches, and provide marginal, population-level inference. In simulations, we evaluate the statistical properties of our proposed methods. Comparisons are made to 'naive' GEE approaches that either do not account for outcome-dependent observation times or incorporate weights based on the observation-time process. We illustrate the utility of our proposed methods using data from a randomized controlled trial of interventions designed to improve adherence to warfarin therapy. We show that our method performs well in the presence of outcome-dependent observation times, and provide identical inference to 'naive' approaches when observation times are not associated with outcomes.
]]>
Kay See Tan et al.Statistical Estimation of T1 Relaxation Time Using Conventional Magnetic Resonance Imaging
http://biostats.bepress.com/upennbiostat/art37
http://biostats.bepress.com/upennbiostat/art37Tue, 10 Mar 2015 15:42:32 PDT
Quantitative T_{1} maps estimate T_{1} relaxation times and can be used to assess diffuse tissue abnormalities within normal-appearing tissue. T_{1} maps are popular for studying the progression and treatment of multiple sclerosis (MS). However, their inclusion in standard imaging protocols remains limited due to the additional scanning time and expert calibration required and susceptibility to bias and noise. Here, we propose a new method of estimating T_{1} maps using four conventional MR images, which are intensity- normalized using cerebellar gray matter as a reference tissue and related to T_{1} using a smooth regression model. Using leave-one-out cross-validation, we generate statistical T_{1} maps for 61 subjects with MS. The statistical maps are less noisy than the acquired maps and show similar accuracy. Tests of group differences in normal-appearing white matter across MS subtypes give similar results using both methods, but tests performed using statistical maps are more powerful.
]]>
Amanda Mejia et al.Normalization Techniques for Statistical Inference from Magnetic Resonance Imaging
http://biostats.bepress.com/upennbiostat/art36
http://biostats.bepress.com/upennbiostat/art36Tue, 08 Oct 2013 14:58:28 PDT
While computed tomography and other imaging techniques are measured in absolute units with physical meaning, magnetic resonance images are expressed in arbitrary units that are difficult to interpret and differ between study visits and subjects. Much work in the image processing literature on intensity normalization has focused on histogram matching and other histogram mapping techniques, with little emphasis on normalizing images to have biologically interpretable units. Furthermore, there are no formalized principles or goals for the crucial comparability of image intensities within and across subjects. To address this, we propose a set of criteria necessary for the normalization of images. We further propose simple and robust biologically motivated normalization techniques for multisequence brain imaging that have the same interpretation across acquisitions and satisfy the proposed criteria. We compare the performance of different normalization methods in thousands of images of patients with Alzheimer's Disease, hundreds of patients with multiple sclerosis, and hundreds of healthy subjects obtained in several different studies at dozens of imaging centers.
]]>
Russell T. Shinohara et al.On the Simulation of Longitudinal Discrete Data with Specified Marginal Means and First-Order Antedependence
http://biostats.bepress.com/upennbiostat/art35
http://biostats.bepress.com/upennbiostat/art35Mon, 07 Oct 2013 14:54:29 PDT
We propose a straightforward approach for simulation of discrete random variables with overdispersion, specified marginal means, and product correlations that are plausible for longitudinal data with equal, or unequal, temporal spacings. The method stems from results we prove for variables with first-order antedependence and linearity of the conditional expectations. The proposed approach will be especially useful for assessment of methods such as generalized estimating equations, which specify separate models for the marginal means and correlation structure of measurements on a subject.
]]>
Matthew Guerra et al.Bayesian Methods for Network-Structured Genomics Data
http://biostats.bepress.com/upennbiostat/art34
http://biostats.bepress.com/upennbiostat/art34Tue, 05 Jan 2010 09:18:05 PST
Graphs and networks are common ways of depicting information. In biology, many different processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This information provides useful supplement to the standard numerical genomic data such as microarray gene expression data. Effectively utilizing such an information can lead to a better identification of biologically relevant genomic features in the context of our prior biological knowledge. In this paper, we present a Bayesian variable selection procedure for network-structured covariates for both Gaussian linear and probit models. The key of our approach is the introduction of a Markov random field prior for the indicator variables that describe which covariates should be included in the model and the use of the Wolff algorithm for Markov Chain Monte Carlo inference. We illustrate the proposed procedure with simulations and with an analysis of genomic data. Finally, we present some other areas of genomics research where novel Bayesian approaches may play important roles.
]]>
Stefano Monni et al.Quasi-Least Squares with Mixed Linear Correlation Structures
http://biostats.bepress.com/upennbiostat/art33
http://biostats.bepress.com/upennbiostat/art33Thu, 08 Oct 2009 12:56:48 PDT
Quasi-least squares (QLS) is a two-stage computational approach for estimation of the correlation parameters in the framework of generalized estimating equations (GEE). We prove two general results for the class of mixed linear correlation structures: namely, that the stage one QLS estimate of the correlation parameter always exists and is feasible (yields a positive definite estimated correlation matrix) for any correlation structure, while the stage two estimator exists and is unique (and therefore consistent) with probability one, for the class of mixed linear correlation structures. Our general results justify the implementation of QLS for particular members of the class of mixed linear correlation structures that are appropriate for the analysis of familial data, with families that vary in size and composition. We describe the familial structures and implement them in an analysis of optical spherical values in the Old Order Amish (OOA). For the OOA analysis, we show that we would suffer a substantial loss in efficiency, if the familial structures were the true structures, but were misspecified as simpler approximate structures. We also provide software for implementation of the familial structures in R. Key-Words: Quasi-least squares; linear correlation structure; mixed correlation structure; familial data.
]]>
Jichun Xie et al."Implementation of quasi-least squares With the R package qlspack"
http://biostats.bepress.com/upennbiostat/art32
http://biostats.bepress.com/upennbiostat/art32Wed, 17 Jun 2009 08:25:14 PDT
Quasi-least squares (QLS) is an alternative method for estimating the correlation parameters within the framework of generalized estimating equations (GEE) that has two main advantages over the moment estimates that are typically applied for GEE: (1) It guarantees a consistent estimate of the correlation parameter and a positive definite estimated correlation matrix, for several correlation structures; and (2) It allows for easier implementation of some correlation structures that have not yet been implemented in the framework of GEE. Furthermore, because QLS is a method in the framework of GEE, existing software can be employed within the QLS algorithm for estimation of the correlation and regression parameters. In this manuscript we describe and demonstrate the user written package qlspack that allows for implementation of QLS in R software. Our package qlspack calls up the geepack package Yan (2002) and Halekoh et al. (2006) to update the estimate of the regression parameter at the current QLS estimate of the correlation parameter; hence, geepack related functions for standard error estimation can be used after implementing qlspack.
]]>
Jichun Xie et al.A Hidden Markov Random Field Model for Genome-wide Association Studies
http://biostats.bepress.com/upennbiostat/art31
http://biostats.bepress.com/upennbiostat/art31Mon, 05 Jan 2009 06:49:54 PST
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single SNP analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferonni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that a SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case-control genome-wide association study of neuroblastoma and identify one new SNP that is potentially associated with neuroblastoma.
]]>
Hongzhe Li et al.GeneticsAnalysis of Adverse Events in Drug Safety: A Multivariate Approach Using Stratified Quasi-least Squares
http://biostats.bepress.com/upennbiostat/art29
http://biostats.bepress.com/upennbiostat/art29Sun, 28 Dec 2008 17:06:27 PST
Safety assessment in drug development involves numerous statistical challenges, and yet statistical methodologies and their applications to safety data have not been fully developed, despite a recent increase of interest in this area. In practice, a conventional univariate approach for analysis of safety data involves application of the Fisher's exact test to compare the proportion of subjects who experience adverse events (AEs) between treatment groups; This approach ignores several common features of safety data, including the presence of multiple endpoints, longitudinal follow-up, and a possible relationship between the AEs within body systems. In this article, we propose various regression modeling strategies to model multiple longitudinal AEs that are biologically classified into different body systems via the stratified quasi-least squares (SQLS) method. We then analyze safety data from a clinical drug development program at Wyeth Research that compared an experimental drug with a standard treatment using SQLS, which could be a superior alternative to application of the Fisher's exact test.
]]>
Hanjoo Kim et al.A Network-constrained Empirical Bayes Method for Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art28
http://biostats.bepress.com/upennbiostat/art28Wed, 29 Oct 2008 07:07:06 PDT
Empirical Bayes methods are widely used in the analysis of microarray gene expression data in order to identify the differentially expressed genes or genes that are associated with other general phenotypes. Available methods often assume that genes are independent. However, genes are expected to function interactively and to form molecular modules to affect the phenotypes. In order to account for regulatory dependency among genes, we propose in this paper a network-constrained empirical Bayes method for analyzing genomic data in the framework of general linear models, where the dependency of genes is modeled by a discrete Markov random field model defined on a pre-defined biological network. This method provides a statistical framework for integrating the known biological network information into the analysis of genomic data. We present an iterated conditional mode algorithm for parameter estimation and for estimating the posterior probabilities using Gibbs sampling. We demonstrate the application of the proposed methods using simulations and analysis of a human brain aging microarray gene expression data set.
]]>
Caiyan Li et al."%QLS SAS Macro: A SAS macro for Analysis of Longitudinal Data Using Quasi-Least Squares".
http://biostats.bepress.com/upennbiostat/art27
http://biostats.bepress.com/upennbiostat/art27Tue, 05 Aug 2008 09:05:29 PDT
Quasi-least squares (QLS) is an alternative computational approach for estimation of the correlation parameter in the framework of generalized estimating equations (GEE). QLS overcomes some limitations of GEE that were discussed in Crowder (Biometrika 82 (1995) 407-410). In addition, it allows for easier implementation of some correlation structures that are not available for GEE. We describe a user written SAS macro called %QLS, and demonstrate application of our macro using a clinical trial example for the comparison of two treatments for a common toenail infection. %QLS also computes the lower and upper boundaries of the correlation parameter for analysis of longitudinal binary data that were described by Prentice (Biometrics 44 (1988), 1033-1048). Furthermore, it displays a warning message if the Prentice constraints are violated; This warning is not provided in existing GEE software packages and other packages that were recently developed for application of QLS (in Stata, Matlab, and R). %QLS allows for analysis of normal, binary, or Poisson data with one of the following working correlation structures: the first-order autoregressive (AR(1)), equicorrelated, Markov, or tri-diagonal structures. Keywords: longitudinal data, generalized estimating equations, quasi-least squares, SAS.
]]>
Hanjoo Kim et al.On the designation of the patterned associations for longitudinal Bernoulli data: weight matrix versus true correlation structure?
http://biostats.bepress.com/upennbiostat/art26
http://biostats.bepress.com/upennbiostat/art26Wed, 02 Jul 2008 09:33:49 PDT
Due to potential violation of standard constraints for the correlation for binary data, it has been argued recently that the working correlation matrix should be viewed as a weight matrix that should not be confused with the true correlation structure. We propose two arguments to support our view to the contrary for the first-order autoregressive AR(1) correlation matrix. First, we prove that the standard constraints are not unduly restrictive for the AR(1) structure that is plausible for longitudinal data; furthermore, for the logit link function the upper boundary value only depends on the regression parameter and the change in covariate values between successive measurements. In addition, for given marginal means and parameter $\alpha$, we provide a general proof that satisfaction of the standard constraints for consecutive marginal means will guarantee the existence of a compatible multivariate distribution with an AR(1) structure. The relative laxity of the standard constraints for the AR(1) structure coupled with the existence of a simple model that yields data with an AR(1) structure bolsters our view that for the AR(1) structure at least, it is appropriate to view this model as a correlation structure versus a weight matrix.
]]>
Hanjoo Kim et al.U-Statistics-based Tests for Multiple Genes in Genetic Association Studies
http://biostats.bepress.com/upennbiostat/art25
http://biostats.bepress.com/upennbiostat/art25Fri, 25 Apr 2008 07:42:00 PDT
Abstract: As our understanding of biological pathways and the genes that regulate these pathways increases, consideration of these biological pathways has become an increasingly important part of genetic and molecular epidemiology. Pathway-based genetic association studies often involve genotyping of variants in genes acting in certain biological pathways. Such pathway-based genetic association studies can potentially capture the highly heterogeneous nature of many complex traits, with multiple causative loci and multiple alleles at some of the causative loci. In this paper, we develop two nonparametric test statistics that consider simultaneously the effects of multiple markers. Our approach, which is based on data-adaptive U-statistics, can handle both qualitative data such as case-control data and quantitative continuous phenotype data. Simulations demonstrate that our proposed methods are more powerful than standard methods, especially when there are multiple risk loci each with small genetic effects. When the number of disease-predisposing genes is small, the data-adaptive weighting of the U-statistics over all the markers produces similar power to commonly used single marker tests. We further illustrate the potential merits of our proposed tests in the analysis of a data set from a pathway-based candidate gene association study of breast cancer and hormone metabolism pathways. Finally, potential applications of the proposed tests to genome-wide association studies are also discussed.
]]>
Zhi Wei et al.Genetics