Collection of Biostatistics Research ArchiveCopyright (c) 2014 COBRA All rights reserved.
http://biostats.bepress.com
Recent documents in Collection of Biostatistics Research Archiveen-usSat, 20 Sep 2014 01:39:35 PDT3600Variable selection for zero-inflated and overdispersed data with application to health care demand in Germany
http://biostats.bepress.com/cobra/art110
http://biostats.bepress.com/cobra/art110Fri, 19 Sep 2014 15:32:10 PDT
In health services and outcome research, count outcomes are frequently encountered and often have a large proportion of zeros. The zero-inflated negative binomial (ZINB) regression model has important applications for this type of data. With many possible candidate risk factors, this paper proposes new variable selection methods for the ZINB model. We consider maximum likelihood function plus a penalty including the least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP). An EM (expectation-maximization) algorithm is proposed for estimating the model parameters and conducting variable selection simultaneously. This algorithm consists of estimating penalized weighted negative binomial models and penalized logistic models via the coordinated descent algorithm. Furthermore, statistical properties including the standard error formula are provided. A simulation study shows that the new algorithm not only has more accurate or at least comparable estimation, also is more robust than the traditional stepwise variable selection. The application is illustrated with a data set on health care demand in Germany. The proposed techniques have been implemented in an open-source R package mpath.
]]>
Zhu Wang et al.Estimation of the Overall Treatment Effect in the Presence of Interference in Cluster-randomized Trials of Infectious Disease Prevention
http://biostats.bepress.com/harvardbiostat/paper180
http://biostats.bepress.com/harvardbiostat/paper180Fri, 19 Sep 2014 07:16:14 PDTNicole Bohme Carnegie et al.Targeted Learning of an Optimal Dynamic Treatment, and Statistical Inference for its Mean Outcome
http://biostats.bepress.com/ucbbiostat/paper329
http://biostats.bepress.com/ucbbiostat/paper329Wed, 03 Sep 2014 10:59:06 PDT
Suppose we observe n independent and identically distributed observations of a time-dependent random variable consisting of baseline covariates, initial treatment and censoring indicator, intermediate covariates, subsequent treatment and censoring indicator, and a final outcome. For example, this could be data generated by a sequentially randomized controlled trial, where subjects are sequentially randomized to a first line and second line treatment, possibly assigned in response to an intermediate biomarker, and are subject to right-censoring. In this article we consider estimation of an optimal dynamic multiple time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to only respond to a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric beyond possible knowledge about the treatment and censoring mechanism, while still providing statistical inference for the mean outcome under the optimal rule. This contrasts from the current literature that relies on parametric assumptions. For the sake of presentation, we first consider the case that the treatment/censoring is only assigned at a single time-point, and subsequently, we cover the multiple time-point case. We characterize the optimal dynamic treatment as a statistical target parameter in the nonparametric statistical model, and we propose highly data adaptive estimators of this optimal dynamic regimen, utilizing sequential loss-based super-learning of sequentially defined (so called) blip-functions, based on newly proposed loss-functions. We also propose a cross-validation selector (among candidate estimators of the optimal dynamic regimens) based on a cross-validated targeted minimum loss-based estimator of the mean outcome under the candidate regimen, thereby aiming directly to select the candidate estimator that maximizes the mean outcome. We also establish that the mean of the counterfactual outcome under the optimal dynamic treatment is a pathwise differentiable parameter under assumptions, and develop a targeted minimum loss-based estimator (TMLE) of this target parameter. We establish asymptotic linearity and statistical inference based on this targeted minimum loss-based estimator under specified conditions. In a sequentially randomized trial the statistical inference essentially only relies upon a second order difference between the estimator of the optimal dynamic treatment and the optimal dynamic treatment to be asymptotically negligible, which may be a problematic condition when the rule is based on multivariate time-dependent covariates. To avoid this condition, we also develop targeted minimum loss based estimators and statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the {\em estimate} of the optimal dynamic treatment. In particular, we develop a novel cross-validated TMLE approach that provides asymptotic inference under minimal conditions, avoiding the need for any empirical process conditions. For the sake of presentation, in the main part of the article we focus on two-time point interventions, but the results are generalized to general multiple time point interventions in the appendix.
]]>
Mark J. van der Laan et al.Partially-Latent Class Models (pLCM) for Case-Control Studies of Childhood Pneumonia Etiology
http://biostats.bepress.com/jhubiostat/paper267
http://biostats.bepress.com/jhubiostat/paper267Wed, 27 Aug 2014 12:50:27 PDT
In population studies on the etiology of disease, one goal is the estimation of the fraction of cases attributable to each of several causes. For example, pneumonia is a clinical diagnosis of lung infection that may be caused by viral, bacterial, fungal, or other pathogens. The study of pneumonia etiology is challenging because directly sampling from the lung to identify the etiologic pathogen is not standard clinical practice in most settings. Instead, measurements from multiple peripheral specimens are made. This paper considers the problem of estimating the population etiology distribution and the individual etiology probabilities. We formulate the scientific problem in statistical terms as estimating the posterior distribution of mixing weights and latent class indicators under a partially-latent class model (pLCM) that combines heterogeneous measurements with different error rates obtained from a case-control study. We introduce the pLCM as an extension of the latent class model. We also introduce graphical displays of the population data and inferred latent-class frequencies. The methods are illustrated with simulated and real data sets. The paper closes with a brief description of extensions of the pLCM to the regression setting and to the case where conditional independence among the measures is relaxed.
]]>
Zhenke Wu et al.Some models and methods for the analysis of observational data
http://biostats.bepress.com/cobra/art109
http://biostats.bepress.com/cobra/art109Tue, 12 Aug 2014 16:21:53 PDT
This article provides a short, concise and essentially self-contained exposition of some of the most important models and methods for the analysis of observational data, and a substantial number of illustrations of their application. Although for the most part our presentation follows P. Rosenbaum’s book, “Observational Studies”, and naturally draws on related literature, it contains original elements and simplifies and generalizes some basic results. The illustrations, based on simulated data, show the methods at work in some detail, highlighting pitfalls and emphasizing certain subjective aspects of the statistical analyses.
]]>
José A. FerreiraInstrumental Variable Estimation in a Survival Context
http://biostats.bepress.com/harvardbiostat/paper179
http://biostats.bepress.com/harvardbiostat/paper179Tue, 12 Aug 2014 07:33:27 PDTEric J. Tchetgen Tchetgen et al.Likelihood Based Estimation of Logistic Structural Nested Mean Models with an Instrumental Variable
http://biostats.bepress.com/harvardbiostat/paper178
http://biostats.bepress.com/harvardbiostat/paper178Mon, 04 Aug 2014 07:13:37 PDTRoland A. Matsouaka et al.A General Approach to Detect Gene (G)-environment (E) Additive Interaction Leveraging G-E Independence in Case-control Studies
http://biostats.bepress.com/harvardbiostat/paper177
http://biostats.bepress.com/harvardbiostat/paper177Wed, 30 Jul 2014 09:55:20 PDTEric Tchetgen Tchetgen et al.A Novel Targeted Learning Method for Quantitative Trait Loci Mapping
http://biostats.bepress.com/ucbbiostat/paper328
http://biostats.bepress.com/ucbbiostat/paper328Fri, 25 Jul 2014 14:21:00 PDT
We present a novel semiparametric method for quantitative trait loci (QTL) mapping in experimental crosses. Conventional genetic mapping methods typically assume parametric models with Gaussian errors and obtain parameter estimates through maximum likelihood estimation. In contrast with univariate regression and interval mapping methods, our model requires fewer assumptions and also accommodates various machine learning algorithms. Estimation is performed with targeted maximum likelihood learning methods. We demonstrate our semiparametric targeted learning approach in a simulation study and a well-studied barley dataset.
]]>
Hui Wang et al.Methods for Exploring Treatment Effect Heterogeneity in Subgroup Analysis: An Application to Global Clinical Trials
http://biostats.bepress.com/cobra/art108
http://biostats.bepress.com/cobra/art108Tue, 22 Jul 2014 20:52:03 PDT
Multi-country randomised clinical trials (MRCTs) are common in the medical literature and their interpretation has been the subject of extensive recent discussion. In many MRCTs, an evaluation of treatment effect homogeneity across countries or regions is conducted. Subgroup analysis principles require a significant test of interaction in order to claim heterogeneity of treatment effect across subgroups, such as countries in a MRCT. As clinical trials are typically underpowered for tests of interaction, overly optimistic expectations of treatment effect homogeneity can lead researchers, regulators and other stakeholders to over-interpret apparent differences between subgroups even when heterogeneity tests are insignificant. In this paper we consider some exploratory analysis tools to address this issue. We present three measures derived using the theory of order statistics which can be used to understand the magnitude and the nature of the variation in treatment effects that can arise merely as an artefact of chance. These measures are not intended to replace a formal test of interaction, but instead provide non-inferential visual aids allowing comparison of the observed and expected differences between regions or other subgroups, and are a useful supplement to a formal test of interaction. We discuss how our methodology differs from recently published methods addressing the same issue. A case study of our approach is presented using data from the PLATO study, which was a large cardiovascular MRCT that has been the subject of controversy in the literature. An R package is available from the authors on request.
]]>
I. Manjula Schou et al.Pre-maceration, Saignée and Temperature affect Daily Evolution of Pigment Extraction During Vinification
http://biostats.bepress.com/cobra/art107
http://biostats.bepress.com/cobra/art107Tue, 22 Jul 2014 20:52:01 PDT
Consumer demand for intensely coloured wines necessitates the systematic testing of pigment extraction in Sangiovese, a cultivar poor in easily extractable anthocyanins. Pre-fermentation (absent, cold soak pre-fermentation at 5 °C, cryomaceration by liquid N_{2} addition), temperature (20 or 30 °C), and saignée were compared during vinification (800 kg). Concentrations of anthocyanins, non-anthocyanic flavonoids and SO_{2}-resistant pigments were recorded daily. A semiparametric Bayesian model permitted the kinetic description and the comparison of sigmoidal- and exponential-like curves. In total anthocyanins, saignée at 30 °C yielded a significant gain, later lost at drawing off; cryomaceration had little effect and cold soak no effect at drawing off. Non-anthocyanic flavonoids increased steadily with saignée and at 30 °C. SO_{2}-resistant pigments were higher, particularly for the higher temperature/saignée combination. Using daily recordings, the model indicates turning points for concentration rise or fall, thus allowing a precise and detailed comparison of the vinification methods.
]]>
Ottorino L. Pantani et al.A Simple Regression-based Approach to Account for Survival Bias in Birth Outcomes Research
http://biostats.bepress.com/harvardbiostat/paper176
http://biostats.bepress.com/harvardbiostat/paper176Mon, 21 Jul 2014 06:49:06 PDTEric J. Tchetgen Tchetgen et al.A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome
http://biostats.bepress.com/harvardbiostat/paper175
http://biostats.bepress.com/harvardbiostat/paper175Mon, 21 Jul 2014 06:49:02 PDTEric Tchetgen TchetgenEntering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis
http://biostats.bepress.com/ucbbiostat/paper327
http://biostats.bepress.com/ucbbiostat/paper327Thu, 17 Jul 2014 16:05:26 PDT
This outlook article will appear in Advances in Statistics and it reviews the research of Dr. van der Laan's group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming to only rely on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current big data movement.
]]>
Mark J. van der Laan et al.Bayesian Model Averaging:- An Application in Cancer Clinical Trial
http://biostats.bepress.com/cobra/art106
http://biostats.bepress.com/cobra/art106Wed, 16 Jul 2014 22:37:17 PDT
Data driven conclusion is mostly accepted approach in any medical research problem. In case of limited knowledge of deep idea about supportive data on the problem, automatic digging of the variable plays important role for insight view of the study. Bayesian model averaging can be considered for automatics variable selection. It can be used as an alternative of stepwise regression method. The aim of this paper is to show the application of Bayesian modeling averaging in medical research particularly in cancer trial. Method is illustrated on Bone marrow transplant data. It can be recommended that BMA can be used frequently in data selection and as a tool of exploratory data analysis method. It is very handy method of choice for data analysis.
]]>
Atanu BhattacharjeeControl Function Assisted IPW Estimation with a Secondary Outcome in Case-Control Studies
http://biostats.bepress.com/harvardbiostat/paper174
http://biostats.bepress.com/harvardbiostat/paper174Wed, 16 Jul 2014 07:45:51 PDTTamar Sofer et al.Super-Learning of an Optimal Dynamic Treatment Rule
http://biostats.bepress.com/ucbbiostat/paper326
http://biostats.bepress.com/ucbbiostat/paper326Wed, 02 Jul 2014 13:58:35 PDT
We consider the estimation of an optimal dynamic two time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to depend only on a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric, beyond possible knowledge about the treatment and censoring mechanisms. We propose data adaptive estimators of this optimal dynamic regime which are defined by sequential loss-based learning under both the blip function and weighted classification frameworks. Rather than \textit{a priori} selecting an estimation framework and algorithm, we propose combining estimators from both frameworks using a super-learning based cross-validation selector that seeks to minimize an appropriate cross-validated risk. One of the proposed risks directly measures the performance of the mean outcome under the optimal rule. The resulting selector is guaranteed to asymptotically perform as well as the best convex combination of candidate algorithms in terms of loss-based dissimilarity under conditions. We offer simulation results to support our theoretical findings. This work expands upon that of an earlier technical report (van der Laan, 2013) with new results and simulations, and is accompanied by a work which develops inference for the mean outcome under the optimal rule (van der Laan and Luedtke, 2014).
]]>
Alexander R. Luedtke et al.Targeted Learning of the Mean Outcome Under an Optimal Dynamic Treatment Rule
http://biostats.bepress.com/ucbbiostat/paper325
http://biostats.bepress.com/ucbbiostat/paper325Wed, 02 Jul 2014 13:54:48 PDT
We consider estimation of and inference for the mean outcome under the optimal dynamic two time-point treatment rule defined as the rule that maximizes the mean outcome under the dynamic treatment, where the candidate rules are restricted to depend only on a user-supplied subset of the baseline and intermediate covariates. This estimation problem is addressed in a statistical model for the data distribution that is nonparametric beyond possible knowledge about the treatment and censoring mechanism. This contrasts from the current literature that relies on parametric assumptions. We establish that the mean of the counterfactual outcome under the optimal dynamic treatment is a pathwise differentiable parameter under conditions, and develop a targeted minimum loss-based estimator (TMLE) of this target parameter. We establish asymptotic linearity and statistical inference for this estimator under specified conditions. In a sequentially randomized trial the statistical inference relies upon a second order difference between the estimator of the optimal dynamic treatment and the optimal dynamic treatment to be asymptotically negligible, which may be a problematic condition when the rule is based on multivariate time-dependent covariates. To avoid this condition, we also develop targeted minimum loss based estimators and statistical inference for data adaptive target parameters that are defined in terms of the mean outcome under the estimate of the optimal dynamic treatment. In particular, we develop a novel cross-validated TMLE approach that provides asymptotic inference under minimal conditions, avoiding the need for any empirical process conditions. We offer simulation results to support our theoretical findings. This work expands upon that of an earlier technical report (van der Laan, 2013; van der Laan and Luedtke, 2014) with new results and simulations, and is accompanied by a work which explores the estimation of the optimal rule (Luedtke and van der Laan, 2014).
]]>
Mark J. van der Laan et al.Predicting the Future Subject's Outcome via an Optimal Stratification Procedure with Baseline Information
http://biostats.bepress.com/harvardbiostat/paper173
http://biostats.bepress.com/harvardbiostat/paper173Tue, 01 Jul 2014 07:17:06 PDTFlorence H. Yong et al.Deductive Derivation and Computerization of Compatible Semiparametric Efficient Estimation
http://biostats.bepress.com/ucbbiostat/paper324
http://biostats.bepress.com/ucbbiostat/paper324Tue, 10 Jun 2014 13:35:23 PDT
Researchers often seek robust inference for a parameter through semiparametric estimation. Efficient semiparametric estimation currently requires theoretical derivation of the efficient influence function (EIF), which can be a challenging and time-consuming task. If this task can be computerized, it can save dramatic human effort, which can be transferred, for example, to the design of new studies. Although the EIF is, in principle, a derivative, simple numerical differentiation to calculate the EIF by a computer masks the EIF's functional dependence on the parameter of interest. For this reason, the standard approach to obtaining the EIF has been the theoretical construction of the space of scores under all possible parametric submodels. This process currently depends on the correctness of conjectures about these spaces, and the correct verification of such conjectures. The correct guessing of such conjectures, though successful in some problems, is a nondeductive process, i.e., is not guaranteed to succeed (e.g., is not computerizable), and the verification of conjectures is generally susceptible to mistakes. We propose a method that can deductively produce semiparametric locally efficient estimators. The proposed method is computerizable, meaning that it does not need either conjecturing for, or otherwise theoretically deriving the functional form of the EIF, and is guaranteed to produce the result. The method is demonstared through an example.
]]>
Constantine E. Frangakis et al.