preprint submitted to Elsevier


The availability of the human genome sequence and progress in sequencing and bioinformatic technologies have enabled genome-wide investigation of somatic mu- tations in human cancers. This article briefly reviews challenges arising in the statistical analysis of mutational data of this kind. A first challenge is that of designing studies that efficiently allocate sequencing resources. We show that this can be addressed by two-stage designs, and demonstrate via simulations that even relatively small studies can produce lists of candidate cancer genes that are highly informative for future research efforts. A second challenge is to distinguish mutated genes that are selected for by cancer (drivers) from mutated genes that have no role in the development of cancer and simply happened to mutate (passengers). We suggest that this question is best approached as a classification problem and discuss some of the difficulties of more traditional testing-based approaches. A third challenge is to identify biologic processes affected by the driver genes. This can be achieved by gene set analyses. These can reliably identify functional groups and pathways that are enriched for mutated genes even when the individual genes involved in those pathways or sets are not mutated at sufficient frequencies to provide conclusive evidence as drivers.


Bioinformatics | Computational Biology