Both somatic copy number alterations (CNAs) and germline copy number variants (CNVs) that are prevalent in healthy individuals can appear as recurrent changes in comparative genomic hybridization (CGH) analyses of tumors. In order to identify important cancer genes CNAs and CNVs must be distinguished. Although the Database of Genomic Variants (Iafrate et al., 2004) contains a list of all known CNVs, there is no standard methodology to use the database effectively.

We develop a prediction model that distinguishes CNVs from CNAs based on the information contained in the Database and several other variables, including potential CNV’s length, height, closeness to a telomere or centromere and occurrence in other patients. The models are fitted on data from glioblastoma and their corresponding normal samples that were collected as part of The Cancer Genome Atlas project and hybridized on Agilent 244K arrays. Using the Database alone CNVs can be correctly identified with about 85% accuracy if the outliers are removed before segmentation and with 72% accuracy if the outliers are included, and additional variables improve the prediction by about 2-3% and 12%, respectively.


Bioinformatics | Computational Biology