Collection of Biostatistics Research Archive Copyright (c) 2009 All rights reserved. http://biostats.bepress.com Recent documents in Collection of Biostatistics Research Archive en-us Fri, 03 Jul 2009 05:20:22 PDT 3600 Integrative Clustering of Multiple Genomic Data Types using a Joint Latent Variable Model with Application to Breast and Lung Cancer Subtype Analysis http://biostats.bepress.com/cobra/ps/art56 http://biostats.bepress.com/cobra/ps/art56 Sat, 20 Jun 2009 16:19:11 PDT The molecular complexity of a tumor manifests at the genomic, epigenomic, transcriptomic, and proteomic levels. Genomic profiling of these levels should allow an integrated characterization of tumor etiology. However, there is a shortage of effective statistical and bioinformatic tools for truly integrative data analysis. The standard approach to integrative clustering is separate clustering followed by manual integration. A more statistically powerful approach would incorporate all data types simultaneously and generate a single integrated cluster assignment across data types. We developed a joint latent variable model for integrative clustering. We call the resulting methodology {\bf iCluster}. iCluster incorporates flexible modeling of the associations between different data types and the variance-covariance structure within data types, while achieving simultaneous data dimension reduction. Likelihood-based inference is obtained through the Expectation-Maximization algorithm. We demonstrate the iCluster algorithm using two examples of joint analysis of copy number and gene expression data, one from breast cancer and one from lung cancer. In both cases, we identified subtypes characterized by concordant DNA copy number changes and gene expression as well as unique profiles specific to one or the other in a completely automated fashion. In addition, the algorithm discovers novel subtypes by combining weak yet consistent alteration patterns across data types. Ronglai Shen Computational Biology/Bioinformatics Simple, Defensible Sample Sizes Based on Cost Efficiency -- With Discussion and Rejoinder http://biostats.bepress.com/cobra/ps/art55 http://biostats.bepress.com/cobra/ps/art55 Wed, 17 Jun 2009 16:12:32 PDT The conventional approach of choosing sample size to provide 80% or greater power ignores the cost implications of different sample size choices. Costs, however, are often impossible for investigators and funders to ignore in actual practice. Here, we propose and justify a new approach for choosing sample size based on cost efficiency, the ratio of a study's projected scientific and/or practical value to its total cost. By showing that a study's projected value exhibits diminishing marginal returns as a function of increasing sample size for a wide variety of definitions of study value, we are able to develop two simple choices that can be defended as more cost efficient than any larger sample size. The first is to choose the sample size that minimizes the average cost per subject. The second is to choose sample size to minimize total cost divided by the square root of sample size. This latter method is theoretically more justifiable for innovative studies, but also performs reasonably well and has some justification in other cases. For example, if projected study value is assumed to be proportional to power at a specific alternative and total cost is a linear function of sample size, then this approach is guaranteed either to produce more than 90% power or to be more cost efficient than any sample size that does. These methods are easy to implement, based on reliable inputs, and well justified, so they should be regarded as acceptable alternatives to current conventional approaches. Peter Bacchetti General Biostatistics "Implementation of quasi-least squares With the R package qlspack" http://biostats.bepress.com/upennbiostat/papers/art32 http://biostats.bepress.com/upennbiostat/papers/art32 Wed, 17 Jun 2009 08:25:14 PDT Quasi-least squares (QLS) is an alternative method for estimating the correlation parameters within the framework of generalized estimating equations (GEE) that has two main advantages over the moment estimates that are typically applied for GEE: (1) It guarantees a consistent estimate of the correlation parameter and a positive definite estimated correlation matrix, for several correlation structures; and (2) It allows for easier implementation of some correlation structures that have not yet been implemented in the framework of GEE. Furthermore, because QLS is a method in the framework of GEE, existing software can be employed within the QLS algorithm for estimation of the correlation and regression parameters. In this manuscript we describe and demonstrate the user written package qlspack that allows for implementation of QLS in R software. Our package qlspack calls up the geepack package Yan (2002) and Halekoh et al. (2006) to update the estimate of the regression parameter at the current QLS estimate of the correlation parameter; hence, geepack related functions for standard error estimation can be used after implementing qlspack. Jichun Xie General Biostatistics Graphical Displays for Clarifying How Allocation Ratio Affects Total Sample Size for the Two Sample Logrank Test http://biostats.bepress.com/uncbiostat/papers/art12 http://biostats.bepress.com/uncbiostat/papers/art12 Tue, 16 Jun 2009 05:46:13 PDT For time-to-event data, the power of the two sample logrank test for the comparison of two treatment groups can be greatly influenced by the ratio of the number of patients in each of the treatment groups. Despite the possible loss of power, unequal allocations may be of interest due to a need to collect more data on one of the groups or to considerations related to the acceptability of the treatments to patients. Investigators pursuing such designs may be interested in the cost of the unbalanced design relative to a balanced design with respect to the total number of patients required for the study. We present graphical displays to illustrate the sample size adjustment factor, or ratio of the sample size required by an unequal allocation compared to the sample size required by a balanced allocation, for various survival rates, treatment hazards ratios, and sample size allocation ratios. These graphical displays conveniently summarize information in the literature and provide a useful tool for planning sample sizes for the two sample logrank test. Benjamin R. Saville Clinical Trials Statistical Theory and Methods Two-Stage Phase II Clinical Trials with Heterogeneous Patient Populations http://biostats.bepress.com/dukebiostat/papers/art5 http://biostats.bepress.com/dukebiostat/papers/art5 Fri, 12 Jun 2009 09:13:36 PDT The patient population for a phase II trial often consists of multiple subgroups with different prognosis. In this case, a popular design approach is to specify the response rate and the prevalence of each subgroup, to calculate the response rate of the whole population by the weighted average of the response rates across subgroups, and to choose a standard phase II design such as Simon's optimal or minimax design to test on the response rate for the whole population. Although the prevalence of each subgroup is accurately specified, the observed prevalence among the accrued patients to the study may be quite different from the estimated one because of the small sample size, which is typical in most phase II trials. In this case, the fixed rejection value for a chosen standard phase II design may be either too conservative (i.e., increasing the false rejection probability of the experimental therapy) if the trial accrues more high-risk patients than expected or too anti-conservative (i.e., increasing the false acceptance probability of the experimental therapy) if the trial accrues more low-risk patients than expected. We can avoid such problem by adjusting the rejection value depending on the observed prevalence from the trial. In this paper, we investigate two flexible design approaches that choose rejection values depending on the observed prevalence, and compare them under various Sin-Ho Jung Clinical Trials Nonparametric Incidence Estimation From Prevalent Cohort Survival Data http://biostats.bepress.com/cobra/ps/art54 http://biostats.bepress.com/cobra/ps/art54 Wed, 01 Apr 2009 11:57:44 PDT Incidence is an important epidemiologic concept particularly useful in assessing an intervention, quantifying disease risk, and planning health resources. Incident cohort studies constitute the gold-standard in estimating disease incidence. However, due to material constraints, data are often collected from prevalent cohort studies whereby diseased individuals are recruited through a cross-sectional survey and followed forward in time. We discuss the identifiability of measures of incidence in the context of prevalent cohort survival studies and derive nonparametric maximum likelihood estimators and their asymptotic properties. The proposed methodology accounts for calendar-time and age-at-onset variation in disease incidence while also addressing common complications arising from the sampling scheme, hence providing flexible and robust estimates. We also discuss age-specific incidence and adjustments for temporal variations in survival. We apply our methodology to data from the Canadian Study of Health and Aging and provide insight into temporal trends in the incidence of dementia in the Canadian elderly population. Marco Carone Epidemiology A Novel Topology for Representing Protein Folds http://biostats.bepress.com/cobra/ps/art53 http://biostats.bepress.com/cobra/ps/art53 Wed, 25 Mar 2009 11:16:55 PDT Various topologies for representing three dimensional protein structures have been advanced for purposes ranging from prediction of folding rates to ab initio structure prediction. Examples include relative contact order, Delaunay tessellations, and backbone torsion angle distributions. Here we introduce a new topology based on a novel means for operationalizing three dimensional proximities with respect to the underlying chain. The measure involves first interpreting a rank-based representation of the nearest neighbors of each residue as a permutation, then determining how perturbed this permutation is relative to an unfolded chain. We show that the resultant topology provides improved association with folding and unfolding rates determined for a set of two-state proteins under standardized conditions. Furthermore, unlike existing topologies, the proposed geometry exhibits fine scale structure with respect to sequence position along the chain, potentially providing insights into folding initiation and/or nucleation sites. Mark R. Segal Computational Biology/Bioinformatics Fitting ACE Structural Equation Models to Case-Control Family Data http://biostats.bepress.com/cobra/ps/art52 http://biostats.bepress.com/cobra/ps/art52 Wed, 18 Mar 2009 21:53:44 PDT Investigators interested in whether a disease aggregates in families often collect case-control family data, which consist of disease status and covariate information for families selected via case or control probands. Here, we focus on the use of case-control family data to investigate the relative contributions to the disease of additive genetic effects (A), shared family environment (C), and unique environment (E). To this end, we describe a ACE model for binary family data and then introduce an approach to fitting the model to case-control family data. The structural equation model, which has been described previously, combines a general-family extension of the classic ACE twin model with a (possibly covariate-specific) liability-threshold model for binary outcomes. Our likelihood-based approach to fitting involves conditioning on the proband's disease status, as well as setting prevalence equal to a pre-specified value that can be estimated from the data themselves if necessary. Simulation experiments suggest that our approach to fitting yields approximately unbiased estimates of the A, C, and E variance components, provided that certain commonly-made assumptions hold. These assumptions include: the usual assumptions for the classic ACE and liability-threshold models; assumptions about shared family environment for relative pairs; and assumptions about the case-control family sampling, including single ascertainment. When our approach is used to fit the ACE model to Austrian case-control family data on depression, the resulting estimate of heritability is very similar to those from previous analyses of twin data. Kristin N. Javaras Genetics Correlated Binary Regression Using Orthogonalized Residuals http://biostats.bepress.com/cobra/ps/art51 http://biostats.bepress.com/cobra/ps/art51 Wed, 11 Mar 2009 14:49:30 PDT This paper focuses on marginal regression models for correlated binary responses when estimation of the association structure is of primary interest. A new estimating function approach based on orthogonalized residuals is proposed. This procedure allows a new representation and addresses some of the difficulties of the conditional-residual formulation of alternating logistic regressions of Carey, Zeger & Diggle (1993). The new method is illustrated with an analysis of data on impaired pulmonary function. Richard C. Zink Multivariate Analysis Statistical Models Statistical Theory and Methods Reinforcement Learning Design for Cancer Clinical Trials http://biostats.bepress.com/uncbiostat/papers/art11 http://biostats.bepress.com/uncbiostat/papers/art11 Mon, 09 Mar 2009 16:15:21 PDT We develop reinforcement learning trials for discovering individualized treatment regimens for life threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is illustrated in a simulation analysis. For future research, we will apply this general strategy to studying and identifying new treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer, which usually includes multiple lines of chemotherapy treatment. Yufan Zhao Clinical Trials Disease Modeling Statistical Theory and Methods