A survey of analysis software for array-comparative genomic hybridisation studies to detect copy number variation
© Henry Stewart Publications 2010
Received: 27 August 2010
Accepted: 27 August 2010
Published: 27 August 2010
Copy number variants (CNVs) create a major source of variation among individuals and populations. Array-based comparative genomic hybridisation (aCGH) is a powerful method used to detect and compare the copy numbers of DNA sequences at high resolution along the genome. In recent years, several informatics tools for accurate and efficient CNV detection and assessment have been developed. In this paper, most of the well known algorithms, analysis software and the limitations of that software will be briefly reviewed.
Keywordscopy number variants CNV deletion insertion duplication aCGH
Copy number variants (CNVs) are DNA sequences that are present in different amounts among individuals in a population. Copy number differences can confer a change in gene expression, phenotypic variation, disease susceptibility,[1–5] and gene and genome evolution [6, 7]. Repetitive sequences that flank a specific genomic region can further facilitate a duplication or deletion of that region via the mechanism of non-allelic homologous recombination, which can occur when paralogous sequences in the genome mis-pair during meiois [8–10]. A key method used to study CNVs across individuals is that of array-based comparative genomic hybridisation (aCGH). The goal of aCGH experiments is to detect and compare the copy numbers of DNA sequences at high resolution along the genome. Several informatics tools currently exist for accurate and efficient CNV detection and assessment. These tools assist in automated analysis of array CGH data and user-friendly copy number reporting for individual samples. The goal of the statistical algorithms used in these software programs is to call aberrations reliably, accurately and precisely.
The analysis of CNVs is broken down into several steps, including: (i) pre-processing and normalisation of the raw data; (ii) aligning data with its genome location, conducting segmentation analysis and providing statistical analysis to ensure the reliability of detection; and (iii) post-processing to assign biological meaning to the different states.
(i) Normalisation of the log2 ratios is typically conducted in an attempt to adjust for sources of systematic variation. Since these effects are often not known or measured, most aCGH methodologies incorporate global normalisation techniques, centring the data about the sample mean or median for a given hybridisation . Normalisation remains imperfect, and an accurate estimation of the copy number is unlikely. It is assumed, however, that changes in the observed, normalised log2 ratios correspond directly to changes in the true copy numbers.
(ii) From a statistical perspective, segmentation has received most attention, and many different schemes have been proposed. Three main methods include: a) segmenting chromosome-arrayed genotypes into discrete regions, with probes in each region presenting different signal intensity patterns to adjacent regions; and b) labelling particular segments that are inherently different in copy number from their expected value. Segmentation methods seek to identify the locations of log2 ratio mean change (ie change points or breakpoints) and to estimate the values of those means. All of these segmentation methods provide breakpoint locations but do not identify the associated genomic alterations as gains or losses. Because a primary objective of aCGH analysis is to identify regions of copy number gain and loss, follow-up methods have been proposed for this from segmentation results. Some of the studies have used a non-parametric estimate of the standard deviation to identify a global threshold for categorising segments [12, 13]. Another approach for identifying gains and losses based on segmentation results entails the combination of identified segments across chromosomes and a subsequent establishment of a no-change baseline.
The challenge with segmentation methods is that, in order to find the optimal segmentation, all possible change-points need to be evaluated, creating a combinatorial explosion. For example, if there are ten copy number segments positioned across 2,000 probe intensities, then there are 200010 places in which these segments may lie (roughly a 1 followed by 33 zeros, or one decillion). Given that a chromosome may have as many as a hundred change-points, and that today's whole-genome arrays contain over 50,000 intensity values for a given chromosome, the search space can easily exceed 50,000100 possible change-points .
This seemingly impossible series of calculations is what has led researchers to adopt various heuristics to make this process computationally viable, as was done with circular binary segmentation (CBS), or by avoiding segmentation altogether.
(iii) Finally, a post-processing step is necessary to assign biological meaning to the different states.
Software for array CGH analysis
Lowess normalisation, CBS, ratio thresholding
UCSC Genome Browser
Cluster multiple samples into groups based on their aberration profiles
Hidden Markov model (HMM) for detecting probe-based aberrations
Circular binary segmentation (CBS)
CBS-type algorithms, log-ratio thresholding etc
L, W, M
L, W, M
Direct thresholding, moving average thresholding,
K-means clustering, HMM and CBS
L, W, M
Java, R, Perl
CBS segmentation algorithm
Log ratio for normalisation
L, W, M
Objective Bayes HMM (OB-HMM)
L, W, M
Thresholding, bootstrap-based method, analysis of copy errors (ACE), clustering along chromosomes
(CALC) with false discovery rate (FDR)
L, W, M
L, W, M
CBS, HMM, BioHMM, comparative genomic hybridisation (CGH) segmentation, gain and loss analysis of DNA
(GLAD), wavelet-based smoothing, Smith-Waterman
algorithm and analysis of copy errors
L, W, M
Clustering along chromosomes
L, W, M
L, W, M
Fisher's exact test
GO, KEGG pathway
L, W, M
L, W, M
Adaptive weights smoothing
L, W, M
CBS algorithm has been implemented in the 'DNAcopy' package
L, W, M
L, W, M
Limma, GLAD, DNAcopy, tilingArray and aCGH
L, W, M
Expectation maximisation (EM)
L, W, M
Minimal common regions (MCRs)
L, W, M
L, W, M
L, W, M
Hidden Markov model (HMM) and BioHMM
Fridlyand et al. proposed an unsupervised HMM for identifying copy number changes on chromosomes. Marioni et al. described a new segmentation scheme, BioHMM, which extends the HMM approach of Fridlyand et al., to take account of the distance between adjacent clones or of clone quality that are likely to affect the segmentation.
Olshen et al. introduced another sophisticated method for aCGH analysis, CBS. This is a modification of the change-point approach, allowing for tertiary splits by connecting the two chromosomal ends. It splits the chromosomes into contiguous regions of equal copy number by modelling discrete copy number gains and losses. It then assesses the significance of the proposed splits by using a permutation reference distribution.
Several studies have shown that CBS is the most efficient method [16–19]. They have also shown that optimal combination of the smoothing step and the segmentation step may result in improved performance.
Willenbrock and Fridlyand  compared three publicly available methods for the analysis of aGGH data -- DNAcopy (CBS), the gain and loss analysis of DNA Gaussian-based approach (GLAD) and the 'cluster along chromosomes' (CLAC) approach --and they showed that segmentation by any of the three methods aids downstream analyses of aCGH data. They also noted that DNAcopy had the best operational characteristics in terms of its sensitivity and false discovery rate (FDR) for breakpoint detection, but it was not able to identify single clone aberrations. CBS has been implemented in DNAcopy. Applying CBS to the same simulated data set, the authors were able to achieve a 0.06 median FDR with 0.88 sensitivity. Although effective for finding segments, and despite speed optimisations by Venkatraman and Olshen, however, CBS is not computationally efficient for whole-genome analysis. For example, analysis on Affymetrix 500K data has been shown to take over 20 minutes per sample and roughly 45 minutes per sample on Illumina 550K data .
Bottom-up agglomerative approach
In contrast to the top-down strategy employed by CBS, Wang et al. introduced a bottom-up agglomerative approach, CLAC, which enjoys better computational efficiency. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters (genome regions with copy number gains/losses) by controlling the FDR at a certain level.
The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions, possibly useful for association studies. The copy number analysis model (CNAM) is a commercial tool that uses two types of segmentation: univariate (on a per-sample basis) and multivariate (on a multi-sample basis) .
While most segmentation methods employ parametric models for array CGH data, some non-parametric approaches that are free of distribution assumptions have also shown success in calling gains and losses in array CGH data. Hsu et al. proposed to minimise noise from the array CGH data using wavelets before making inferences on the aberrations. Tibshirani et al. developed a spatial smoothing approach using fused lasso regression for calling gains and losses. The regression framework of fused lasso brings great computational efficiency and can be easily generalised to other analyses involving CGH data. Jong et al. used genetic local search algorithms and Willenbrock et al. used the adaptive weights smoothing method, GLADmerge (a modified version of GLAD ), for combining segments obtained from GLAD, first within and then across chromosomes through hierarchical clustering in which clusters of segments are identified from the resultant dendrograms.
An ideal tool for the analysis of aCGH data should allow the user to choose among several of the algorithms. For the end-users, the web-based applications are the most suitable, since they do not require software installation and there are no concerns about the hardware. Some of the available tools are analysis of array-based comparative genomic hybridisation (ADaCGH) and in silico array-CGH (ISACGH) (see Table 1). Some of the tools are implemented in MATLAB or an executable file with a very simple interface which guides the user through the analysis (Table 1). A very helpful feature that exists in some of the tools is the ability to estimate the statistical significance of the detected copy number changes and then rank them accordingly. R http://www.r-project.org also has several packages for the analysis of aCGH data.
R is a powerful, yet flexible, statistical computing/programming environment. Its object-orientation programming scheme has made algorithm development easy and flexible and has attracted a huge developer community. R is platform independent, and works on all major computer operating systems. R has several packages for the analysis of aCGH data. These packages are freely available at the Comprehensive R Archive Network (CRAN) section of the website . They include a CBS method (DNACopy)[16, 29] an unsupervised HMM approach  GLAD, cghMCR, the CLAC and method using the hierarchical clustering algorithm, a penalised least-squares regression  and the wavelet approach . BioHMM is another integral part of the segmentation, normalisation and processing of aCGH data (snapCGH) R library . This library lets the user apply other segmentation schemes using common input and output data objects. Additionally, snapCGH works seamlessly with limma objects  and enables the use of pre-processing (and other) functions therein. RAN-aCGH is an R graphical user interface (GUI) for analysis and visualisation of aCGH data and includes several of the packages in R . There are also a number of web-based applications, such as ADaCGH  and ISACGH, for viewing and comparing outputs from multiple algorithms (Table 1).
Discussion and conclusion
Most of the methods do well in detecting the existence and the width of aberrations for large changes and high signal-to-noise ratio. None of the algorithms, however, reliably detected aberrations with small width and low signal-to-noise ratio [35–38]. Several previous studies have compared the performance of these methods, as well as the segmentation schemes [17, 19, 39].
Lockwood et al. reviewed 16 different tools that were used in visualisation or analysis of ACGH data.
Lai et at. compared ten different methods and found that HMM  performed poorly, with a high false-positive rate (~0.40-0.60) and low sensitivity (~50-80 per cent) with copy number segments . These authors showed that DNAcopy  generally performed better than GLAD  and HMM with regard to detection of copy number alterations. Their results also indicated that HMM performed best for small aberrations (given a sufficient signal-to-noise ratio), and that GLAD did better than HMM for wider aberrations . They showed that simple smoothing algorithms such as lowess and wavelets are the fastest, and the HMM and CBS  were the slowest. They also noted that only CLAC  and the array CGH expression integration tool (ACE) incorporate the FDR. They also noted that some of the segmentation methods, such as CGHseg  and CBS, consistently performed well .
Wang compared several different segmentation methods and found that CGHseg appeared to be overly sensitive to outlier measurements, and thus would be more suitable for detecting single gene copy number changes . Her result showed that CLAC was conservative in handling outliers with opposite signs in the same alteration region and therefore tended to break large alteration segments into small blocks. CBS provided clean solutions for segmentation but had the limitation of detecting break points whose alteration signals were weak .
The few early methods employed automatically to call gains and losses from aCGH data involved smoothing the log2 ratio vectors followed by applying certain thresholds [38, 42, 43]. A common drawback of these methods was not taking into account the biological covariates, such as the distance between adjacent clones or clone quality, which are likely to affect the segmentation (ie some regions of the genome being densely covered, while others have larger gaps between probes).
aCGH analysis has come a long way, and the software packages have become more accurate and user friendly, but we are likely to see even more improvements in these software packages in the future.
This study was supported by NIH 5T15LM009451-03 for A.K.F., 5R01MH81203-2 and 2R01AA11853-11 for J.M.S. and L.D., P30CA46934 for T.P. and R01LM009254 and R01LM008111 for L.E.H.
- Aitman TJ, Dong R, Vyse TJ, Norsworthy PJ, et al: Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature. 2006, 439: 851-855. 10.1038/nature04489.View ArticlePubMedGoogle Scholar
- Fanciulli M, Norsworthy PJ, Petretto E, Dong R, et al: FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet. 2007, 39: 721-723. 10.1038/ng2046.PubMed CentralView ArticlePubMedGoogle Scholar
- Gonzalez E, Kulkarni H, Bolivar H, Mangano A, et al: The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science. 2005, 307: 1434-1440. 10.1126/science.1101160.View ArticlePubMedGoogle Scholar
- McCarroll ME, Shi Y, Harris S, Puli S, et al: Computational prediction and experimental evaluation of a photoinduced electron-transfer sensor. J Phys Chem B. 2006, 110: 22991-22994. 10.1021/jp065876s.View ArticlePubMedGoogle Scholar
- Stranger BE, Forrest MS, Dunning M, Ingle CE, et al: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315: 848-853. 10.1126/science.1136678.PubMed CentralView ArticlePubMedGoogle Scholar
- Dumas L, Kim YH, Karimpour-Fard A, Cox M, et al: Gene copy number variation spanning 60 million years of human and primate evolution. Genome Res. 2007, 17: 1266-1277. 10.1101/gr.6557307.PubMed CentralView ArticlePubMedGoogle Scholar
- Fortna A, Kim Y, MacLaren E, Marshall K, et al: Lineage-specific gene duplication and loss in human and great ape evolution. PLoS Biol. 2004, 2: E207-10.1371/journal.pbio.0020207.PubMed CentralView ArticlePubMedGoogle Scholar
- Hurles M: How homologous recombination generates a mutable genome. Hum Genomics. 2005, 2: 179-186.PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp AJ, Cheng Z, Eichler EE: Structural variation of the human genome. Annu Rev Genomics Hum Genet. 2006, 7: 407-442. 10.1146/annurev.genom.7.080505.115618.View ArticlePubMedGoogle Scholar
- Shaw CJ, Lupski JR: Implications of human genome architecture for rearrangement-based disorders: The genomic basis of disease. Hum Mol Genet. 2004, 13 (Spec No 1): R57-R64.View ArticlePubMedGoogle Scholar
- Fridlyand J, Snijders AM, Pinkel D, Albertson DG, et al: Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.View ArticleGoogle Scholar
- Paris PL, Andaya A, Fridlyand J, Jain AN, et al: Whole genome scanning identifies genotypes associated with recurrence and metastasis in prostate tumors. Hum Mol Genet. 2004, 13: 1303-1313. 10.1093/hmg/ddh155.View ArticlePubMedGoogle Scholar
- Rossi MR, Gaile D, Laduca J, Matsui S, et al: Identification of consistent novel submegabase deletions in low-grade oligodendroglio-mas using array-based comparative genomic hybridization. Genes Chromosomes Cancer. 2005, 44: 85-96. 10.1002/gcc.20218.View ArticlePubMedGoogle Scholar
- CNAM: [http://www.helixtree.com/SNP_Variation/CNAM/index.html]
- Marioni JC, Thorne NP, Tavare S: BioHMM: A heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22: 1144-1146. 10.1093/bioinformatics/btl089.View ArticlePubMedGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.View ArticlePubMedGoogle Scholar
- Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 21: 3763-3770. 10.1093/bioinformatics/bti611.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang P: Algorithms for Calling Gains and Losses in Array CGH Data. 2009, Humana Press, Hatfield, UK, 556:Google Scholar
- Willenbrock H, Fridlyand J: A comparison study: Applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 4084-4091. 10.1093/bioinformatics/bti677.View ArticlePubMedGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657-663. 10.1093/bioinformatics/btl646.View ArticlePubMedGoogle Scholar
- Wang P, Kim Y, Pollack J, Narasimhan B, et al: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 6: 45-58. 10.1093/biostatistics/kxh017.View ArticlePubMedGoogle Scholar
- Hsu L, Self SG, Grove D, Randolph T, et al: Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005, 6: 211-226. 10.1093/biostatistics/kxi004.View ArticlePubMedGoogle Scholar
- Tibshirani R, Wang P: Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2008, 9: 18-29.View ArticlePubMedGoogle Scholar
- Jong K, Marchiori E, Meijer G, Vaart AV, et al: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004, 20: 3636-3637. 10.1093/bioinformatics/bth355.View ArticlePubMedGoogle Scholar
- Hupe P, Stransky N, Thiery JP, Radvanyi F, et al: Analysis of array CGH data: From signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20: 3413-3422. 10.1093/bioinformatics/bth418.View ArticlePubMedGoogle Scholar
- Diaz-Uriarte R, Rueda OM: ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data. PLoS One. 2007, 2: e737-10.1371/journal.pone.0000737.PubMed CentralView ArticlePubMedGoogle Scholar
- Conde L, Montaner D, Burguet-Castell J, Tarraga J, et al: ISACGH: A web-based environment for the analysis of Array CGH and gene expression which includes functional profiling. Nucleic Acids Res. 2007, W81-W85. 35 Web ServerGoogle Scholar
- The Comprehensive R Archive Network. [http://www.bioconductor.org/CRAN/]
- DNAcopy: DNA copy number data analysis. [http://bioconductorfhcrcorg/packages/19/bioc/html/DNAcopyhtml]
- Aguirre AJ, Brennan C, Bailey G, Sinha R, Feng B, et al: High resolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci USA. 2004, 101 (24): 9067-9072. 10.1073/pnas.0402932101.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang T, Wu B, Lizardi P, Zhao H: Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics. 2005, 21: 3811-3817. 10.1093/bioinformatics/bti646.View ArticlePubMedGoogle Scholar
- Smith ML, Marioni JC, Hardcastle TJ, Thorne NP: snapCGH: Segmentation, normalization and processing of aCGH data. Bioconductor Users Guide. 2006Google Scholar
- Smyth GK, Michaud J, Scott HS: Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005, 21: 2067-2075. 10.1093/bioinformatics/bti270.View ArticlePubMedGoogle Scholar
- Kim S, Kim B-S: RAN-aCGH: R GUI tools for analysis and visualization of an array-CGH experiment. Genomics Informatics. 2007, 5: 137-139.Google Scholar
- Ferreira BI, Alonso J, Carrillo J, Acquadro F, et al: Array CGH and gene-expression profiling reveals distinct genomic instability patterns associated with DNA repair and cell-cycle checkpoint pathways in Ewing's sarcoma. Oncogene. 2008, 27: 2084-2090. 10.1038/sj.onc.1210845.View ArticlePubMedGoogle Scholar
- Hodgson G, Hager JH, Volik S, Hariono S, et al: Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nat Genet. 2001, 29: 459-464. 10.1038/ng771.View ArticlePubMedGoogle Scholar
- Lassmann S, Weis R, Makowiec F, Roth J, et al: Array CGH identifies distinct DNA copy number profiles of oncogenes and tumor suppressor genes in chromosomal- and microsatellite-unstable sporadic colorectal carcinomas. J Mol Med. 2007, 85: 293-304. 10.1007/s00109-006-0126-5.View ArticlePubMedGoogle Scholar
- Pollack JR, Sorlie T, Perou CM, Rees CA, et al: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA. 2002, 99: 12963-12968. 10.1073/pnas.162471999.PubMed CentralView ArticlePubMedGoogle Scholar
- Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet. 2006, 14: 139-148. 10.1038/sj.ejhg.5201531.View ArticlePubMedGoogle Scholar
- van Wieringen WN, Belien JA, Vosse SJ, Achame EM, et al: ACE-it: A tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics. 2006, 22: 1919-1920. 10.1093/bioinformatics/btl269.View ArticlePubMedGoogle Scholar
- Picard F, Robin S, Lavielle M, Vaisse C, et al: A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005, 6: 27-10.1186/1471-2105-6-27.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng C, Kimmel R, Neiman P, Zhao LP: Array rank order regression analysis for the detection of gene copy-number changes in human cancer. Genomics. 2003, 82: 122-129. 10.1016/S0888-7543(03)00122-8.View ArticlePubMedGoogle Scholar
- Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, et al: CGH-Explorer: A program for analysis of array-CGH data. Bioinformatics. 2005, 21: 821-822. 10.1093/bioinformatics/bti113.View ArticlePubMedGoogle Scholar