UPenn Biostatistics Working PapersCopyright (c) 2015 University of Pennsylvania All rights reserved.
http://biostats.bepress.com/upennbiostat
Recent documents in UPenn Biostatistics Working Papersen-usFri, 30 Oct 2015 01:40:37 PDT3600Removing inter-subject technical variability in magnetic resonance imaging studies
http://biostats.bepress.com/upennbiostat/art43
http://biostats.bepress.com/upennbiostat/art43Wed, 28 Oct 2015 14:16:00 PDT
Magnetic resonance imaging (MRI) intensities are acquired in arbitrary units, making scans non-comparable across sites and between subjects. Intensity normalization is a first step for the improvement of comparability of the images across subjects. However, we show that unwanted inter-scan variability associated with imaging site, scanner effect and other technical artifacts is still present after standard intensity normalization in large multi-site neuroimaging studies. We propose RAVEL (Removal of Artificial Voxel Effect by Linear regression), a tool to remove residual technical variability after intensity normalization. As proposed by SVA and RUV [Leek and Storey, 2007, 2008, Gagnon-Bartsch and Speed, 2012], two batch effect correction tools largely used in genomics, we decompose the voxel intensities of images registered to a template into a biological component and an unwanted variation component. The unwanted variation component is estimated from a control region obtained from the cerebrospinal fluid (CSF), where intensities are known to be unassociated with disease status and other clinical covariates. We perform a singular value decomposition (SVD) of the control voxels to estimate factors of unwanted variation. We then estimate the unwanted factors using linear regression for every voxel of the brain and take the residuals as the RAVEL-corrected intensities. We assess the performance of RAVEL using T1-weighted (T1-w) images from more than 900 subjects with Alzheimer’s disease (AD) and mild cognitive impairment (MCI), as well as healthy controls from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We compare RAVEL to intensity-normalization-only methods, histogram matching, and White Stripe. We show that RAVEL performs best at improving the replicability of the brain regions that are empirically found to be most associated with AD, and that these regions are significantly more present in structures impacted by AD (hippocampus, amygdala, parahippocampal gyrus, enthorinal area and fornix stria terminals). In addition, we show that the RAVEL-corrected intensities have the best performance in distinguishing between MCI subjects and healthy subjects by using the mean hippocampal intensity (AUC=67%), a marked improvement compared to results from intensity normalization alone (AUC=63% and 59% for histogram matching and White Stripe, respectively). RAVEL is generalizable to many imaging modalities, and shows promise for longitudinal studies. Additionally, because the choice of the control region is left to the user, RAVEL can be applied in studies of many brain disorders.
]]>
Jean-Philippe Fortin et al.Control-Group Feature Normalization for Multivariate Pattern Analysis Using the Support Vector Machine
http://biostats.bepress.com/upennbiostat/art42
http://biostats.bepress.com/upennbiostat/art42Thu, 24 Sep 2015 14:45:43 PDT
Normalization of feature vector values is a common practice in machine learning. Generally, each feature value is standardized to the unit hypercube or by normalizing to zero mean and unit variance. Classification decisions based on support vector machines (SVMs) or by other methods are sensitive to the specific normalization used on the features. In the context of multivariate pattern analysis using neuroimaging data, standardization effectively up- and down-weights features based on their individual variability. Since the standard approach uses the entire data set to guide the normalization it utilizes the total variability of these features. This total variation is inevitably dependent on the amount of marginal separation between groups. Thus, such a normalization may attenuate the separability of the data in high dimensional space. In this work we propose an alternate approach that uses an estimate of the control-group standard deviation to normalize features before training. We also show that control-based normalization provides better interpretation with respect to the estimated multivariate disease pattern and improves the classifier performance in many cases.
]]>
Kristin A. Linn et al.Addressing Confounding in Predictive Models with an Application to Neuroimaging
http://biostats.bepress.com/upennbiostat/art41
http://biostats.bepress.com/upennbiostat/art41Thu, 24 Sep 2015 14:39:04 PDT
Understanding structural changes in the brain that are caused by a particular disease is a major goal of neuroimaging research. Multivariate pattern analysis (MVPA) comprises a collection of tools that can be used to understand complex disease effects across the brain. We discuss several important issues that must be considered when analyzing data from neuroimaging studies using MVPA. In particular, we focus on the consequences of confounding by non-imaging variables such as age and sex on the results of MVPA. After reviewing current practice to address confounding in neuroimaging studies, we propose an alternative approach based on inverse probability weighting. Although the proposed method is motivated by neuroimaging applications, it is broadly applicable to many problems in machine learning and predictive modeling. We demonstrate the advantages of our approach on simulated and real data examples.
]]>
Kristin A. Linn et al.Nonparametric methods for doubly robust estimation of continuous treatment effects
http://biostats.bepress.com/upennbiostat/art39
http://biostats.bepress.com/upennbiostat/art39Tue, 22 Sep 2015 12:25:36 PDT
Continuous treatments (e.g., doses) arise often in practice, but available causal effect estimators require either parametric models for the effect curve or else consistent estimation of a single nuisance function. We propose a novel doubly robust kernel smoothing approach, which requires only mild smoothness assumptions on the effect curve and allows for misspecification of either the treatment density or outcome regression. We derive asymptotic properties and also discuss an approach for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of nurse staffing on hospital readmissions penalties.
]]>
Edward H. Kennedy et al.Regression modeling of longitudinal binary outcomes with outcome-dependent observation times
http://biostats.bepress.com/upennbiostat/art38
http://biostats.bepress.com/upennbiostat/art38Tue, 10 Mar 2015 15:50:39 PDT
Conventional longitudinal data analysis methods assume that outcomes are independent of the data-collection schedule. However, the independence assumption may be violated, for example, when adverse events trigger additional physician visits in between prescheduled follow-ups. Observation times may therefore be associated with outcome values, which may introduce bias when estimating the eect of covariates on outcomes using standard longitudinal regression methods. Existing semi-parametric methods that accommodate outcome-dependent observation times are limited to the analysis of continuous outcomes. We develop new methods for the analysis of binary outcomes, while retaining the exibility of semi-parametric models. Our methods are based on counting process approaches, rather than relying on possibly intractable likelihood-based or pseudo-likelihood-based approaches, and provide marginal, population-level inference. In simulations, we evaluate the statistical properties of our proposed methods. Comparisons are made to 'naive' GEE approaches that either do not account for outcome-dependent observation times or incorporate weights based on the observation-time process. We illustrate the utility of our proposed methods using data from a randomized controlled trial of interventions designed to improve adherence to warfarin therapy. We show that our method performs well in the presence of outcome-dependent observation times, and provide identical inference to 'naive' approaches when observation times are not associated with outcomes.
]]>
Kay See Tan et al.Statistical Estimation of T1 Relaxation Time Using Conventional Magnetic Resonance Imaging
http://biostats.bepress.com/upennbiostat/art37
http://biostats.bepress.com/upennbiostat/art37Tue, 10 Mar 2015 15:42:32 PDT
Quantitative T_{1} maps estimate T_{1} relaxation times and can be used to assess diffuse tissue abnormalities within normal-appearing tissue. T_{1} maps are popular for studying the progression and treatment of multiple sclerosis (MS). However, their inclusion in standard imaging protocols remains limited due to the additional scanning time and expert calibration required and susceptibility to bias and noise. Here, we propose a new method of estimating T_{1} maps using four conventional MR images, which are intensity- normalized using cerebellar gray matter as a reference tissue and related to T_{1} using a smooth regression model. Using leave-one-out cross-validation, we generate statistical T_{1} maps for 61 subjects with MS. The statistical maps are less noisy than the acquired maps and show similar accuracy. Tests of group differences in normal-appearing white matter across MS subtypes give similar results using both methods, but tests performed using statistical maps are more powerful.
]]>
Amanda Mejia et al.Normalization Techniques for Statistical Inference from Magnetic Resonance Imaging
http://biostats.bepress.com/upennbiostat/art36
http://biostats.bepress.com/upennbiostat/art36Tue, 08 Oct 2013 14:58:28 PDT
While computed tomography and other imaging techniques are measured in absolute units with physical meaning, magnetic resonance images are expressed in arbitrary units that are difficult to interpret and differ between study visits and subjects. Much work in the image processing literature on intensity normalization has focused on histogram matching and other histogram mapping techniques, with little emphasis on normalizing images to have biologically interpretable units. Furthermore, there are no formalized principles or goals for the crucial comparability of image intensities within and across subjects. To address this, we propose a set of criteria necessary for the normalization of images. We further propose simple and robust biologically motivated normalization techniques for multisequence brain imaging that have the same interpretation across acquisitions and satisfy the proposed criteria. We compare the performance of different normalization methods in thousands of images of patients with Alzheimer's Disease, hundreds of patients with multiple sclerosis, and hundreds of healthy subjects obtained in several different studies at dozens of imaging centers.
]]>
Russell T. Shinohara et al.On the Simulation of Longitudinal Discrete Data with Specified Marginal Means and First-Order Antedependence
http://biostats.bepress.com/upennbiostat/art35
http://biostats.bepress.com/upennbiostat/art35Mon, 07 Oct 2013 14:54:29 PDT
We propose a straightforward approach for simulation of discrete random variables with overdispersion, specified marginal means, and product correlations that are plausible for longitudinal data with equal, or unequal, temporal spacings. The method stems from results we prove for variables with first-order antedependence and linearity of the conditional expectations. The proposed approach will be especially useful for assessment of methods such as generalized estimating equations, which specify separate models for the marginal means and correlation structure of measurements on a subject.
]]>
Matthew Guerra et al.Bayesian Methods for Network-Structured Genomics Data
http://biostats.bepress.com/upennbiostat/art34
http://biostats.bepress.com/upennbiostat/art34Tue, 05 Jan 2010 09:18:05 PST
Graphs and networks are common ways of depicting information. In biology, many different processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This information provides useful supplement to the standard numerical genomic data such as microarray gene expression data. Effectively utilizing such an information can lead to a better identification of biologically relevant genomic features in the context of our prior biological knowledge. In this paper, we present a Bayesian variable selection procedure for network-structured covariates for both Gaussian linear and probit models. The key of our approach is the introduction of a Markov random field prior for the indicator variables that describe which covariates should be included in the model and the use of the Wolff algorithm for Markov Chain Monte Carlo inference. We illustrate the proposed procedure with simulations and with an analysis of genomic data. Finally, we present some other areas of genomics research where novel Bayesian approaches may play important roles.
]]>
Stefano Monni et al.Quasi-Least Squares with Mixed Linear Correlation Structures
http://biostats.bepress.com/upennbiostat/art33
http://biostats.bepress.com/upennbiostat/art33Thu, 08 Oct 2009 12:56:48 PDT
Quasi-least squares (QLS) is a two-stage computational approach for estimation of the correlation parameters in the framework of generalized estimating equations (GEE). We prove two general results for the class of mixed linear correlation structures: namely, that the stage one QLS estimate of the correlation parameter always exists and is feasible (yields a positive definite estimated correlation matrix) for any correlation structure, while the stage two estimator exists and is unique (and therefore consistent) with probability one, for the class of mixed linear correlation structures. Our general results justify the implementation of QLS for particular members of the class of mixed linear correlation structures that are appropriate for the analysis of familial data, with families that vary in size and composition. We describe the familial structures and implement them in an analysis of optical spherical values in the Old Order Amish (OOA). For the OOA analysis, we show that we would suffer a substantial loss in efficiency, if the familial structures were the true structures, but were misspecified as simpler approximate structures. We also provide software for implementation of the familial structures in R. Key-Words: Quasi-least squares; linear correlation structure; mixed correlation structure; familial data.
]]>
Jichun Xie et al."Implementation of quasi-least squares With the R package qlspack"
http://biostats.bepress.com/upennbiostat/art32
http://biostats.bepress.com/upennbiostat/art32Wed, 17 Jun 2009 08:25:14 PDT
Quasi-least squares (QLS) is an alternative method for estimating the correlation parameters within the framework of generalized estimating equations (GEE) that has two main advantages over the moment estimates that are typically applied for GEE: (1) It guarantees a consistent estimate of the correlation parameter and a positive definite estimated correlation matrix, for several correlation structures; and (2) It allows for easier implementation of some correlation structures that have not yet been implemented in the framework of GEE. Furthermore, because QLS is a method in the framework of GEE, existing software can be employed within the QLS algorithm for estimation of the correlation and regression parameters. In this manuscript we describe and demonstrate the user written package qlspack that allows for implementation of QLS in R software. Our package qlspack calls up the geepack package Yan (2002) and Halekoh et al. (2006) to update the estimate of the regression parameter at the current QLS estimate of the correlation parameter; hence, geepack related functions for standard error estimation can be used after implementing qlspack.
]]>
Jichun Xie et al.A Hidden Markov Random Field Model for Genome-wide Association Studies
http://biostats.bepress.com/upennbiostat/art31
http://biostats.bepress.com/upennbiostat/art31Mon, 05 Jan 2009 06:49:54 PST
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single SNP analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferonni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that a SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case-control genome-wide association study of neuroblastoma and identify one new SNP that is potentially associated with neuroblastoma.
]]>
Hongzhe Li et al.GeneticsAnalysis of Adverse Events in Drug Safety: A Multivariate Approach Using Stratified Quasi-least Squares
http://biostats.bepress.com/upennbiostat/art29
http://biostats.bepress.com/upennbiostat/art29Sun, 28 Dec 2008 17:06:27 PST
Safety assessment in drug development involves numerous statistical challenges, and yet statistical methodologies and their applications to safety data have not been fully developed, despite a recent increase of interest in this area. In practice, a conventional univariate approach for analysis of safety data involves application of the Fisher's exact test to compare the proportion of subjects who experience adverse events (AEs) between treatment groups; This approach ignores several common features of safety data, including the presence of multiple endpoints, longitudinal follow-up, and a possible relationship between the AEs within body systems. In this article, we propose various regression modeling strategies to model multiple longitudinal AEs that are biologically classified into different body systems via the stratified quasi-least squares (SQLS) method. We then analyze safety data from a clinical drug development program at Wyeth Research that compared an experimental drug with a standard treatment using SQLS, which could be a superior alternative to application of the Fisher's exact test.
]]>
Hanjoo Kim et al.A Network-constrained Empirical Bayes Method for Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art28
http://biostats.bepress.com/upennbiostat/art28Wed, 29 Oct 2008 07:07:06 PDT
Empirical Bayes methods are widely used in the analysis of microarray gene expression data in order to identify the differentially expressed genes or genes that are associated with other general phenotypes. Available methods often assume that genes are independent. However, genes are expected to function interactively and to form molecular modules to affect the phenotypes. In order to account for regulatory dependency among genes, we propose in this paper a network-constrained empirical Bayes method for analyzing genomic data in the framework of general linear models, where the dependency of genes is modeled by a discrete Markov random field model defined on a pre-defined biological network. This method provides a statistical framework for integrating the known biological network information into the analysis of genomic data. We present an iterated conditional mode algorithm for parameter estimation and for estimating the posterior probabilities using Gibbs sampling. We demonstrate the application of the proposed methods using simulations and analysis of a human brain aging microarray gene expression data set.
]]>
Caiyan Li et al."%QLS SAS Macro: A SAS macro for Analysis of Longitudinal Data Using Quasi-Least Squares".
http://biostats.bepress.com/upennbiostat/art27
http://biostats.bepress.com/upennbiostat/art27Tue, 05 Aug 2008 09:05:29 PDT
Quasi-least squares (QLS) is an alternative computational approach for estimation of the correlation parameter in the framework of generalized estimating equations (GEE). QLS overcomes some limitations of GEE that were discussed in Crowder (Biometrika 82 (1995) 407-410). In addition, it allows for easier implementation of some correlation structures that are not available for GEE. We describe a user written SAS macro called %QLS, and demonstrate application of our macro using a clinical trial example for the comparison of two treatments for a common toenail infection. %QLS also computes the lower and upper boundaries of the correlation parameter for analysis of longitudinal binary data that were described by Prentice (Biometrics 44 (1988), 1033-1048). Furthermore, it displays a warning message if the Prentice constraints are violated; This warning is not provided in existing GEE software packages and other packages that were recently developed for application of QLS (in Stata, Matlab, and R). %QLS allows for analysis of normal, binary, or Poisson data with one of the following working correlation structures: the first-order autoregressive (AR(1)), equicorrelated, Markov, or tri-diagonal structures. Keywords: longitudinal data, generalized estimating equations, quasi-least squares, SAS.
]]>
Hanjoo Kim et al.On the designation of the patterned associations for longitudinal Bernoulli data: weight matrix versus true correlation structure?
http://biostats.bepress.com/upennbiostat/art26
http://biostats.bepress.com/upennbiostat/art26Wed, 02 Jul 2008 09:33:49 PDT
Due to potential violation of standard constraints for the correlation for binary data, it has been argued recently that the working correlation matrix should be viewed as a weight matrix that should not be confused with the true correlation structure. We propose two arguments to support our view to the contrary for the first-order autoregressive AR(1) correlation matrix. First, we prove that the standard constraints are not unduly restrictive for the AR(1) structure that is plausible for longitudinal data; furthermore, for the logit link function the upper boundary value only depends on the regression parameter and the change in covariate values between successive measurements. In addition, for given marginal means and parameter $\alpha$, we provide a general proof that satisfaction of the standard constraints for consecutive marginal means will guarantee the existence of a compatible multivariate distribution with an AR(1) structure. The relative laxity of the standard constraints for the AR(1) structure coupled with the existence of a simple model that yields data with an AR(1) structure bolsters our view that for the AR(1) structure at least, it is appropriate to view this model as a correlation structure versus a weight matrix.
]]>
Hanjoo Kim et al.U-Statistics-based Tests for Multiple Genes in Genetic Association Studies
http://biostats.bepress.com/upennbiostat/art25
http://biostats.bepress.com/upennbiostat/art25Fri, 25 Apr 2008 07:42:00 PDT
Abstract: As our understanding of biological pathways and the genes that regulate these pathways increases, consideration of these biological pathways has become an increasingly important part of genetic and molecular epidemiology. Pathway-based genetic association studies often involve genotyping of variants in genes acting in certain biological pathways. Such pathway-based genetic association studies can potentially capture the highly heterogeneous nature of many complex traits, with multiple causative loci and multiple alleles at some of the causative loci. In this paper, we develop two nonparametric test statistics that consider simultaneously the effects of multiple markers. Our approach, which is based on data-adaptive U-statistics, can handle both qualitative data such as case-control data and quantitative continuous phenotype data. Simulations demonstrate that our proposed methods are more powerful than standard methods, especially when there are multiple risk loci each with small genetic effects. When the number of disease-predisposing genes is small, the data-adaptive weighting of the U-statistics over all the markers produces similar power to commonly used single marker tests. We further illustrate the potential merits of our proposed tests in the analysis of a data set from a pathway-based candidate gene association study of breast cancer and hormone metabolism pathways. Finally, potential applications of the proposed tests to genome-wide association studies are also discussed.
]]>
Zhi Wei et al.GeneticsIncorporation of Genetic Pathway Information into Analysis of Multivariate Gene Expression Data
http://biostats.bepress.com/upennbiostat/art24
http://biostats.bepress.com/upennbiostat/art24Mon, 14 Apr 2008 09:52:09 PDT
Abstract: Multivariate microarray gene expression data are commonly collected to study the genomic responses under ordered conditions such as over increasing/decreasing dose levels or over time during biological processes. One important question from such multivariate gene expression experiments is to identify genes that show different expression patterns over treatment dosages or over time and pathways that are perturbed during a given biological process. In this paper, we develop a hidden Markov random field model for multivariate expression data in order to identify genes and subnetworks that are related to biological processes, where the dependency of the differential expression patterns of genes on the networks are modeled by a Markov random field. Simulation studies indicated that the method is quite effective in identifying genes and the modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway information, with similar observed false discovery rates. We applied the proposed methods for analysis of a microarray time course gene expression study of TrkA- and TrkB-transfected neuroblastoma cell lines and identified genes and subnetworks on MAPK, focal adhesion and prion disease pathways that may explain cell differentiation in TrkA-transfected cell lines.
]]>
Zhi Wei et al.Network-constrained Regularization and Variable Selection for Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art23
http://biostats.bepress.com/upennbiostat/art23Mon, 10 Dec 2007 06:49:55 PST
Graphs or networks are common ways of depicting information. In biology in particular, many different biological processes are represented by graphs, such as regulatory networks or metabolic pathways. This kind of {\it a priori} information gathered over many years of biomedical research is a useful supplement to the standard numerical genomic data such as microarray gene expression data. How to incorporate information encoded by the known biological networks or graphs into analysis of numerical data raises interesting statistical challenges. In this paper, we introduce a network-constrained regularization procedure for linear regression analysis in order to incorporate the information from these graphs into an analysis of the numerical data, where the network is represented as a graph and its corresponding Laplacian matrix. We define a network-constrained penalty function that penalizes the $L_1$-norm of the coefficients but encourages smoothness of the coefficients on the network. An efficient algorithm is also proposed for computing the network-constrained regularization paths, much like the Lars algorithm does for the lasso. We illustrate the methods using simulated data and analysis of a microarray gene expression data set of glioblastoma.
]]>
Caiyan Li et al.Vertex Clustering in Random Graphs via Reversible Jump Markov Chain Monte Carlo
http://biostats.bepress.com/upennbiostat/art22
http://biostats.bepress.com/upennbiostat/art22Wed, 05 Dec 2007 06:21:34 PST
Networks are a natural and effective tool to study relational data, in which observations are collected on pairs of units. The units are represented by nodes and their relations by edges. In biology, for example, proteins and their interactions, and, in social science, people and inter-personal relations may be the nodes and the edges of the network. In this paper we address the question of clustering vertices in networks, as a way to uncover homogeneity patterns in data that enjoy a network representation. We use a mixture model for random graphs and propose a reversible jump Markov chain Monte Carlo algorithm to infer its parameters. Applications of the algorithm to one simulated data set and three real data sets, which describe friendships among members of a University karate club, social interactions of dolphins, and gap junctions in the C. Elegans, are given.
]]>
Stefano Monni et al.