UPenn Biostatistics Working PapersCopyright (c) 2014 University of Pennsylvania All rights reserved.
http://biostats.bepress.com/upennbiostat
Recent documents in UPenn Biostatistics Working Papersen-usWed, 29 Jan 2014 10:01:03 PST3600Normalization Techniques for Statistical Inference from Magnetic Resonance Imaging
http://biostats.bepress.com/upennbiostat/art36
http://biostats.bepress.com/upennbiostat/art36Tue, 08 Oct 2013 14:58:28 PDT
While computed tomography and other imaging techniques are measured in absolute units with physical meaning, magnetic resonance images are expressed in arbitrary units that are difficult to interpret and differ between study visits and subjects. Much work in the image processing literature on intensity normalization has focused on histogram matching and other histogram mapping techniques, with little emphasis on normalizing images to have biologically interpretable units. Furthermore, there are no formalized principles or goals for the crucial comparability of image intensities within and across subjects. To address this, we propose a set of criteria necessary for the normalization of images. We further propose simple and robust biologically motivated normalization techniques for multisequence brain imaging that have the same interpretation across acquisitions and satisfy the proposed criteria. We compare the performance of different normalization methods in thousands of images of patients with Alzheimer's Disease, hundreds of patients with multiple sclerosis, and hundreds of healthy subjects obtained in several different studies at dozens of imaging centers.
]]>
Russell T. Shinohara et al.On the Simulation of Longitudinal Discrete Data with Specified Marginal Means and First-Order Antedependence
http://biostats.bepress.com/upennbiostat/art35
http://biostats.bepress.com/upennbiostat/art35Mon, 07 Oct 2013 14:54:29 PDT
We propose a straightforward approach for simulation of discrete random variables with overdispersion, specified marginal means, and product correlations that are plausible for longitudinal data with equal, or unequal, temporal spacings. The method stems from results we prove for variables with first-order antedependence and linearity of the conditional expectations. The proposed approach will be especially useful for assessment of methods such as generalized estimating equations, which specify separate models for the marginal means and correlation structure of measurements on a subject.
]]>
Matthew Guerra et al.Bayesian Methods for Network-Structured Genomics Data
http://biostats.bepress.com/upennbiostat/art34
http://biostats.bepress.com/upennbiostat/art34Tue, 05 Jan 2010 09:18:05 PST
Graphs and networks are common ways of depicting information. In biology, many different processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This information provides useful supplement to the standard numerical genomic data such as microarray gene expression data. Effectively utilizing such an information can lead to a better identification of biologically relevant genomic features in the context of our prior biological knowledge. In this paper, we present a Bayesian variable selection procedure for network-structured covariates for both Gaussian linear and probit models. The key of our approach is the introduction of a Markov random field prior for the indicator variables that describe which covariates should be included in the model and the use of the Wolff algorithm for Markov Chain Monte Carlo inference. We illustrate the proposed procedure with simulations and with an analysis of genomic data. Finally, we present some other areas of genomics research where novel Bayesian approaches may play important roles.
]]>
Stefano Monni et al.Quasi-Least Squares with Mixed Linear Correlation Structures
http://biostats.bepress.com/upennbiostat/art33
http://biostats.bepress.com/upennbiostat/art33Thu, 08 Oct 2009 12:56:48 PDT
Quasi-least squares (QLS) is a two-stage computational approach for estimation of the correlation parameters in the framework of generalized estimating equations (GEE). We prove two general results for the class of mixed linear correlation structures: namely, that the stage one QLS estimate of the correlation parameter always exists and is feasible (yields a positive definite estimated correlation matrix) for any correlation structure, while the stage two estimator exists and is unique (and therefore consistent) with probability one, for the class of mixed linear correlation structures. Our general results justify the implementation of QLS for particular members of the class of mixed linear correlation structures that are appropriate for the analysis of familial data, with families that vary in size and composition. We describe the familial structures and implement them in an analysis of optical spherical values in the Old Order Amish (OOA). For the OOA analysis, we show that we would suffer a substantial loss in efficiency, if the familial structures were the true structures, but were misspecified as simpler approximate structures. We also provide software for implementation of the familial structures in R. Key-Words: Quasi-least squares; linear correlation structure; mixed correlation structure; familial data.
]]>
Jichun Xie et al."Implementation of quasi-least squares With the R package qlspack"
http://biostats.bepress.com/upennbiostat/art32
http://biostats.bepress.com/upennbiostat/art32Wed, 17 Jun 2009 08:25:14 PDT
Quasi-least squares (QLS) is an alternative method for estimating the correlation parameters within the framework of generalized estimating equations (GEE) that has two main advantages over the moment estimates that are typically applied for GEE: (1) It guarantees a consistent estimate of the correlation parameter and a positive definite estimated correlation matrix, for several correlation structures; and (2) It allows for easier implementation of some correlation structures that have not yet been implemented in the framework of GEE. Furthermore, because QLS is a method in the framework of GEE, existing software can be employed within the QLS algorithm for estimation of the correlation and regression parameters. In this manuscript we describe and demonstrate the user written package qlspack that allows for implementation of QLS in R software. Our package qlspack calls up the geepack package Yan (2002) and Halekoh et al. (2006) to update the estimate of the regression parameter at the current QLS estimate of the correlation parameter; hence, geepack related functions for standard error estimation can be used after implementing qlspack.
]]>
Jichun Xie et al.A Hidden Markov Random Field Model for Genome-wide Association Studies
http://biostats.bepress.com/upennbiostat/art31
http://biostats.bepress.com/upennbiostat/art31Mon, 05 Jan 2009 06:49:54 PST
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single SNP analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferonni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that a SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case-control genome-wide association study of neuroblastoma and identify one new SNP that is potentially associated with neuroblastoma.
]]>
Hongzhe Li et al.GeneticsAnalysis of Adverse Events in Drug Safety: A Multivariate Approach Using Stratified Quasi-least Squares
http://biostats.bepress.com/upennbiostat/art29
http://biostats.bepress.com/upennbiostat/art29Sun, 28 Dec 2008 17:06:27 PST
Safety assessment in drug development involves numerous statistical challenges, and yet statistical methodologies and their applications to safety data have not been fully developed, despite a recent increase of interest in this area. In practice, a conventional univariate approach for analysis of safety data involves application of the Fisher's exact test to compare the proportion of subjects who experience adverse events (AEs) between treatment groups; This approach ignores several common features of safety data, including the presence of multiple endpoints, longitudinal follow-up, and a possible relationship between the AEs within body systems. In this article, we propose various regression modeling strategies to model multiple longitudinal AEs that are biologically classified into different body systems via the stratified quasi-least squares (SQLS) method. We then analyze safety data from a clinical drug development program at Wyeth Research that compared an experimental drug with a standard treatment using SQLS, which could be a superior alternative to application of the Fisher's exact test.
]]>
Hanjoo Kim et al.A Network-constrained Empirical Bayes Method for Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art28
http://biostats.bepress.com/upennbiostat/art28Wed, 29 Oct 2008 07:07:06 PDT
Empirical Bayes methods are widely used in the analysis of microarray gene expression data in order to identify the differentially expressed genes or genes that are associated with other general phenotypes. Available methods often assume that genes are independent. However, genes are expected to function interactively and to form molecular modules to affect the phenotypes. In order to account for regulatory dependency among genes, we propose in this paper a network-constrained empirical Bayes method for analyzing genomic data in the framework of general linear models, where the dependency of genes is modeled by a discrete Markov random field model defined on a pre-defined biological network. This method provides a statistical framework for integrating the known biological network information into the analysis of genomic data. We present an iterated conditional mode algorithm for parameter estimation and for estimating the posterior probabilities using Gibbs sampling. We demonstrate the application of the proposed methods using simulations and analysis of a human brain aging microarray gene expression data set.
]]>
Caiyan Li et al."%QLS SAS Macro: A SAS macro for Analysis of Longitudinal Data Using Quasi-Least Squares".
http://biostats.bepress.com/upennbiostat/art27
http://biostats.bepress.com/upennbiostat/art27Tue, 05 Aug 2008 09:05:29 PDT
Quasi-least squares (QLS) is an alternative computational approach for estimation of the correlation parameter in the framework of generalized estimating equations (GEE). QLS overcomes some limitations of GEE that were discussed in Crowder (Biometrika 82 (1995) 407-410). In addition, it allows for easier implementation of some correlation structures that are not available for GEE. We describe a user written SAS macro called %QLS, and demonstrate application of our macro using a clinical trial example for the comparison of two treatments for a common toenail infection. %QLS also computes the lower and upper boundaries of the correlation parameter for analysis of longitudinal binary data that were described by Prentice (Biometrics 44 (1988), 1033-1048). Furthermore, it displays a warning message if the Prentice constraints are violated; This warning is not provided in existing GEE software packages and other packages that were recently developed for application of QLS (in Stata, Matlab, and R). %QLS allows for analysis of normal, binary, or Poisson data with one of the following working correlation structures: the first-order autoregressive (AR(1)), equicorrelated, Markov, or tri-diagonal structures. Keywords: longitudinal data, generalized estimating equations, quasi-least squares, SAS.
]]>
Hanjoo Kim et al.On the designation of the patterned associations for longitudinal Bernoulli data: weight matrix versus true correlation structure?
http://biostats.bepress.com/upennbiostat/art26
http://biostats.bepress.com/upennbiostat/art26Wed, 02 Jul 2008 09:33:49 PDT
Due to potential violation of standard constraints for the correlation for binary data, it has been argued recently that the working correlation matrix should be viewed as a weight matrix that should not be confused with the true correlation structure. We propose two arguments to support our view to the contrary for the first-order autoregressive AR(1) correlation matrix. First, we prove that the standard constraints are not unduly restrictive for the AR(1) structure that is plausible for longitudinal data; furthermore, for the logit link function the upper boundary value only depends on the regression parameter and the change in covariate values between successive measurements. In addition, for given marginal means and parameter $\alpha$, we provide a general proof that satisfaction of the standard constraints for consecutive marginal means will guarantee the existence of a compatible multivariate distribution with an AR(1) structure. The relative laxity of the standard constraints for the AR(1) structure coupled with the existence of a simple model that yields data with an AR(1) structure bolsters our view that for the AR(1) structure at least, it is appropriate to view this model as a correlation structure versus a weight matrix.
]]>
Hanjoo Kim et al.U-Statistics-based Tests for Multiple Genes in Genetic Association Studies
http://biostats.bepress.com/upennbiostat/art25
http://biostats.bepress.com/upennbiostat/art25Fri, 25 Apr 2008 07:42:00 PDT
Abstract: As our understanding of biological pathways and the genes that regulate these pathways increases, consideration of these biological pathways has become an increasingly important part of genetic and molecular epidemiology. Pathway-based genetic association studies often involve genotyping of variants in genes acting in certain biological pathways. Such pathway-based genetic association studies can potentially capture the highly heterogeneous nature of many complex traits, with multiple causative loci and multiple alleles at some of the causative loci. In this paper, we develop two nonparametric test statistics that consider simultaneously the effects of multiple markers. Our approach, which is based on data-adaptive U-statistics, can handle both qualitative data such as case-control data and quantitative continuous phenotype data. Simulations demonstrate that our proposed methods are more powerful than standard methods, especially when there are multiple risk loci each with small genetic effects. When the number of disease-predisposing genes is small, the data-adaptive weighting of the U-statistics over all the markers produces similar power to commonly used single marker tests. We further illustrate the potential merits of our proposed tests in the analysis of a data set from a pathway-based candidate gene association study of breast cancer and hormone metabolism pathways. Finally, potential applications of the proposed tests to genome-wide association studies are also discussed.
]]>
Zhi Wei et al.GeneticsIncorporation of Genetic Pathway Information into Analysis of Multivariate Gene Expression Data
http://biostats.bepress.com/upennbiostat/art24
http://biostats.bepress.com/upennbiostat/art24Mon, 14 Apr 2008 09:52:09 PDT
Abstract: Multivariate microarray gene expression data are commonly collected to study the genomic responses under ordered conditions such as over increasing/decreasing dose levels or over time during biological processes. One important question from such multivariate gene expression experiments is to identify genes that show different expression patterns over treatment dosages or over time and pathways that are perturbed during a given biological process. In this paper, we develop a hidden Markov random field model for multivariate expression data in order to identify genes and subnetworks that are related to biological processes, where the dependency of the differential expression patterns of genes on the networks are modeled by a Markov random field. Simulation studies indicated that the method is quite effective in identifying genes and the modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway information, with similar observed false discovery rates. We applied the proposed methods for analysis of a microarray time course gene expression study of TrkA- and TrkB-transfected neuroblastoma cell lines and identified genes and subnetworks on MAPK, focal adhesion and prion disease pathways that may explain cell differentiation in TrkA-transfected cell lines.
]]>
Zhi Wei et al.Network-constrained Regularization and Variable Selection for Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art23
http://biostats.bepress.com/upennbiostat/art23Mon, 10 Dec 2007 06:49:55 PST
Graphs or networks are common ways of depicting information. In biology in particular, many different biological processes are represented by graphs, such as regulatory networks or metabolic pathways. This kind of {\it a priori} information gathered over many years of biomedical research is a useful supplement to the standard numerical genomic data such as microarray gene expression data. How to incorporate information encoded by the known biological networks or graphs into analysis of numerical data raises interesting statistical challenges. In this paper, we introduce a network-constrained regularization procedure for linear regression analysis in order to incorporate the information from these graphs into an analysis of the numerical data, where the network is represented as a graph and its corresponding Laplacian matrix. We define a network-constrained penalty function that penalizes the $L_1$-norm of the coefficients but encourages smoothness of the coefficients on the network. An efficient algorithm is also proposed for computing the network-constrained regularization paths, much like the Lars algorithm does for the lasso. We illustrate the methods using simulated data and analysis of a microarray gene expression data set of glioblastoma.
]]>
Caiyan Li et al.Vertex Clustering in Random Graphs via Reversible Jump Markov Chain Monte Carlo
http://biostats.bepress.com/upennbiostat/art22
http://biostats.bepress.com/upennbiostat/art22Wed, 05 Dec 2007 06:21:34 PST
Networks are a natural and effective tool to study relational data, in which observations are collected on pairs of units. The units are represented by nodes and their relations by edges. In biology, for example, proteins and their interactions, and, in social science, people and inter-personal relations may be the nodes and the edges of the network. In this paper we address the question of clustering vertices in networks, as a way to uncover homogeneity patterns in data that enjoy a network representation. We use a mixture model for random graphs and propose a reversible jump Markov chain Monte Carlo algorithm to infer its parameters. Applications of the algorithm to one simulated data set and three real data sets, which describe friendships among members of a University karate club, social interactions of dolphins, and gap junctions in the C. Elegans, are given.
]]>
Stefano Monni et al.A Hidden Spatial-temporal Markov Random Field Model for Network-based Analysis of Time Course Gene Expression Data
http://biostats.bepress.com/upennbiostat/art21
http://biostats.bepress.com/upennbiostat/art21Tue, 02 Oct 2007 12:40:25 PDT
Microarray time course (MTC) gene expression data are commonly collected to study the dynamic nature of biological processes. One important problem is to identify genes that show different expression profiles over time and pathways that are perturbed during a given biological process. While methods are available to identify the genes with differential expression levels over time, there is a lack of methods that can incorporate the pathway information in identifying the pathways being modified/activated during a biological process. In this paper, we develop a hidden spatial-temporal Markov random field (hstMRF)-based method for identifying genes and subnetworks that are related to biological diseases, where the dependency of the differential expression patterns of genes on the networks are modeled over time and over the network of pathways. Simulation studies indicated that the method is quite effective in identifying genes and modified subnetworks and has higher sensitivity than the commonly used procedures that do not use the pathway structure or time dependency information, with similar false discovery rates. Application to a microarray gene expression study of systemic inflammation in humans identified a core set of genes on the KEGG pathways that show clear differential expression patterns over time. In addition, the method confirmed that the TOLL-like signaling pathway plays an important role in immune response to endotoxins.
]]>
Zhi Wei et al.GeneticsVariable Selection for Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements
http://biostats.bepress.com/upennbiostat/art20
http://biostats.bepress.com/upennbiostat/art20Mon, 23 Jul 2007 11:59:35 PDT
Nonparametric varying-coefficient models are commonly used for analysis of data measured repeatedly over time, including longitudinal and functional responses data. While many procedures have been developed for estimating the varying-coefficients, the problem of variable selection for such models has not been addressed. In this article, we present a regularized estimation procedure for variable selection for such nonparametric varying-coefficient models using basis function approximations and a group smoothly clipped absolute deviation penalty (gSCAD). This gSCAD procedure simultaneously selects significant variables with time-varying effects and estimates unknown smooth functions using basis function approximations. With appropriate selection of the tuning parameters, we have established the oracle property of the procedure and the consistency of the function estimation. The methods are illustrated with simulations and an application to analysis of microarray time-course gene expression data to in order to identify the transcription factors that are related to yeast cell cycle process.
]]>
Lifeng Wang et al.Methodological Issues in the Study of the Effects of Hemoglobin Variability
http://biostats.bepress.com/upennbiostat/art19
http://biostats.bepress.com/upennbiostat/art19Tue, 19 Jun 2007 11:33:33 PDT
We consider estimating the effect of hemoglobin variability on mortality in hemodialysis patients. Causal effects can be defined as comparisons of outcomes under different hypothetical interventions. Defining measures of the effect of hemoglobin variability and clinical outcomes is complicated by the fact that hypothetical interventions on variability used to define its effect inevitably involve manipulation of related variables. We propose a model-based definition of the effect of the hemoglobin variability as a parameter for variability in a causal model for the effect of an overall intervention on hemoglobin levels over time. We consider this problem using history-adjusted marginal structural models, and apply this approach to data from a large observational database. We consider issues arising when the variable of interest is endogenous, and consider in principle alternate estimands.
]]>
Marshall Joffe et al.A Markov Random Field Model for Network-based Analysis of Genomic Data
http://biostats.bepress.com/upennbiostat/art18
http://biostats.bepress.com/upennbiostat/art18Thu, 29 Mar 2007 11:59:29 PDT
A central problem in genomic research is the identification of genes and pathways involved in diseases and other biological processes. The genes identified or the univariate test statistics are often linked to known biological pathways through gene set enrichment analysis in order to identify the pathways involved. However, most of the procedures for identifying differentially expressed genes do not utilize the known pathway information in the phase of identifying such genes. In this paper, we develop a Markov random field (MRF)-based method for identifying genes and subnetworks that are related to diseases. Such a procedure models the dependency of the differential expression patterns of genes on the networks using a local discrete MRF model. Simulation studies indicated that the method is quite effective in identifying genes and subnetworks that are related to disease and has higher sensitivity and lower false discovery rates than the commonly used procedures that do not use the pathway structure information. Applications to two breast cancer microarray gene expression datasets identified several subnetworks on several of the KEGG transcriptional pathways that are related to breast cancer recurrence or survival due to breast cancer. The proposed MRF-based model efficiently utilizes the known pathway structures in identifying the differentially expressed genes and the subnetworks that might be related to phenotype. As more biological networks are identified and documented in databases, the proposed method should find more applications in identifying the subnetworks that are related to diseases and other biological processes.
]]>
Zhi Wei et al.Statistical Methods for Inference of Genetic Networks and Regulatory Modules
http://biostats.bepress.com/upennbiostat/art17
http://biostats.bepress.com/upennbiostat/art17Fri, 23 Mar 2007 08:53:41 PDT
Large-scale microarray gene expression data, motif data derived from promotor sequences, genome-wide chromatin immunoprecipitation (ChIP-chip) data, DNA polymorphism data and epigenomic data provide the possibility of constructing genetic networks or biological pathways, especially regulatory networks. In this paper, we review some new statistical methods for inference of genetic networks and regulatory modules, including a threshold gradient descent procedure for inference of Gaussian graphical models, a sparse regression mixture modeling approach for inference of regulatory modules, and the varying coefficient model for identifying regulatory subnetworks by integrating microarray time-course gene expression data and motif or ChIP-chip data. We present the statistical formulations of the problems, statistical methods, and results from analysis of real data sets. Areas of future research are also discussed.
]]>
Hongzhe LiGroup SCAD Regression Analysis for Microarray Time Course Gene Expression Data
http://biostats.bepress.com/upennbiostat/art16
http://biostats.bepress.com/upennbiostat/art16Thu, 01 Feb 2007 12:04:56 PST
Since many important biological systems or processes are dynamic systems, it is important to study the gene expression patterns over time in a genomic scale in order to capture the dynamic behavior of gene expression. Microarray technologies have made it possible to measure the gene expression levels of essentially all the genes during a given biological process. In order to determine the transcriptional factors involved in gene regulation during a given biological process, we propose to develop a functional response model with varying coefficients in order to model the transcriptional effects on gene expression levels and to develop a group smoothly clipped absolute deviation (SCAD) regression procedure for selecting the transcriptional factors with varying coefficients that are involved in gene regulation during a biological process. Simulation studies indicated that such a procedure is quite effective in selecting the relevant variables with time-varying coefficients and in estimating the coefficients. Application to the yeast cell cycle microarray time course gene expression data set identified 19 of the 21 known transcriptional factors related to the cell cycle process. In addition, we have identified another 52 TFs that also have periodic transcriptional effects on gene expression during the cell cycle process. Compared to simple linear regression analysis at each time point, our procedure identified more known cell cycle related transcriptional factors. The proposed group SCAD regression procedure is very effective for identifying variables with time-varying coefficients, in particular, for identifying the transcriptional factors that are related to gene expression over time. By identifying the transcriptional factors that are related to gene expression variations over time, the procedure can potentially provide more insight into the gene regulatory networks.
]]>
Lifeng Wang et al.