Copy number variation (CNV) in genomic DNA is linked to a variety of human diseases (including cancer, HIV acquisition, autoimmune and neurodegenerative diseases), and array-based CGH (aCGH) is currently the main technology to locate CNVs. Several methods can analyze aCGH data at the single sample level, but disease-critical genes are more likely to be found in regions that are common or recurrent among samples. Unfortunately, defining recurrent CNV regions remains a challenge. Moreover, the heterogeneous nature of many diseases requires that we search for CNVs that affect only some subsets of the samples (without prior knowledge of which regions and subsets of samples are affected), but this is neglected by current methods.

We have developed two methods to define recurrent CNV regions. Our methods are unique and qualitatively different from existing approaches: they detect both regions over the complete set of arrays and alterations that are common only to some subsets of the samples and, thus, CNV alterations that might characterize previously unknown groups; they use probabilities of alteration as input (not discretized gain/loss calls, which discard uncertainty and variability) and return probabilities of being a shared common region, thus allowing researchers to modify thresholds as needed; the two parameters of the methods have an immediate, straightforward, biological interpretation. Using data from previous studies, we show that we can detect patterns that other methods miss and, by using probabilities, that researchers can modify, as needed, thresholds of immediate interpretability to answer specific research questions.

These methods are a qualitative advance in the location of recurrent CNV regions and will be instrumental in efforts to standardize definitions of recurrent CNVs and cluster samples with respect to patterns of CNV, and ultimately in the search for genomic regions harboring disease-critical genes.


Bioinformatics | Computational Biology