Skip to main content
  • Software review
  • Published:

A survey of analysis software for array-comparative genomic hybridisation studies to detect copy number variation

Abstract

Copy number variants (CNVs) create a major source of variation among individuals and populations. Array-based comparative genomic hybridisation (aCGH) is a powerful method used to detect and compare the copy numbers of DNA sequences at high resolution along the genome. In recent years, several informatics tools for accurate and efficient CNV detection and assessment have been developed. In this paper, most of the well known algorithms, analysis software and the limitations of that software will be briefly reviewed.

Background

Copy number variants (CNVs) are DNA sequences that are present in different amounts among individuals in a population. Copy number differences can confer a change in gene expression, phenotypic variation, disease susceptibility,[1–5] and gene and genome evolution [6, 7]. Repetitive sequences that flank a specific genomic region can further facilitate a duplication or deletion of that region via the mechanism of non-allelic homologous recombination, which can occur when paralogous sequences in the genome mis-pair during meiois [8–10]. A key method used to study CNVs across individuals is that of array-based comparative genomic hybridisation (aCGH). The goal of aCGH experiments is to detect and compare the copy numbers of DNA sequences at high resolution along the genome. Several informatics tools currently exist for accurate and efficient CNV detection and assessment. These tools assist in automated analysis of array CGH data and user-friendly copy number reporting for individual samples. The goal of the statistical algorithms used in these software programs is to call aberrations reliably, accurately and precisely.

The analysis of CNVs is broken down into several steps, including: (i) pre-processing and normalisation of the raw data; (ii) aligning data with its genome location, conducting segmentation analysis and providing statistical analysis to ensure the reliability of detection; and (iii) post-processing to assign biological meaning to the different states.

(i) Normalisation of the log2 ratios is typically conducted in an attempt to adjust for sources of systematic variation. Since these effects are often not known or measured, most aCGH methodologies incorporate global normalisation techniques, centring the data about the sample mean or median for a given hybridisation [11]. Normalisation remains imperfect, and an accurate estimation of the copy number is unlikely. It is assumed, however, that changes in the observed, normalised log2 ratios correspond directly to changes in the true copy numbers.

(ii) From a statistical perspective, segmentation has received most attention, and many different schemes have been proposed. Three main methods include: a) segmenting chromosome-arrayed genotypes into discrete regions, with probes in each region presenting different signal intensity patterns to adjacent regions; and b) labelling particular segments that are inherently different in copy number from their expected value. Segmentation methods seek to identify the locations of log2 ratio mean change (ie change points or breakpoints) and to estimate the values of those means. All of these segmentation methods provide breakpoint locations but do not identify the associated genomic alterations as gains or losses. Because a primary objective of aCGH analysis is to identify regions of copy number gain and loss, follow-up methods have been proposed for this from segmentation results. Some of the studies have used a non-parametric estimate of the standard deviation to identify a global threshold for categorising segments [12, 13]. Another approach for identifying gains and losses based on segmentation results entails the combination of identified segments across chromosomes and a subsequent establishment of a no-change baseline.

The challenge with segmentation methods is that, in order to find the optimal segmentation, all possible change-points need to be evaluated, creating a combinatorial explosion. For example, if there are ten copy number segments positioned across 2,000 probe intensities, then there are 200010 places in which these segments may lie (roughly a 1 followed by 33 zeros, or one decillion). Given that a chromosome may have as many as a hundred change-points, and that today's whole-genome arrays contain over 50,000 intensity values for a given chromosome, the search space can easily exceed 50,000100 possible change-points [14].

This seemingly impossible series of calculations is what has led researchers to adopt various heuristics to make this process computationally viable, as was done with circular binary segmentation (CBS), or by avoiding segmentation altogether.

(iii) Finally, a post-processing step is necessary to assign biological meaning to the different states.

In this paper, a survey is presented of currently available analysis tools for aCGH data to detect copy number variation. Several of these tools provide user-friendly software, with visualisation tools and links to other databases (Table 1). Comparison of these methods is difficult because array platforms differ in probe type, size, varying resolutions and noise levels. A series of methods for performing this analysis are described below.

Table 1 Software for array CGH analysis

Hidden Markov model (HMM) and BioHMM

Fridlyand et al.[11] proposed an unsupervised HMM for identifying copy number changes on chromosomes. Marioni et al.[15] described a new segmentation scheme, BioHMM, which extends the HMM approach of Fridlyand et al.,[11] to take account of the distance between adjacent clones or of clone quality that are likely to affect the segmentation.

CBS

Olshen et al.[16] introduced another sophisticated method for aCGH analysis, CBS. This is a modification of the change-point approach, allowing for tertiary splits by connecting the two chromosomal ends. It splits the chromosomes into contiguous regions of equal copy number by modelling discrete copy number gains and losses. It then assesses the significance of the proposed splits by using a permutation reference distribution.

Several studies have shown that CBS is the most efficient method [16–19]. They have also shown that optimal combination of the smoothing step and the segmentation step may result in improved performance.

Willenbrock and Fridlyand [19] compared three publicly available methods for the analysis of aGGH data -- DNAcopy (CBS), the gain and loss analysis of DNA Gaussian-based approach (GLAD) and the 'cluster along chromosomes' (CLAC) approach --and they showed that segmentation by any of the three methods aids downstream analyses of aCGH data. They also noted that DNAcopy had the best operational characteristics in terms of its sensitivity and false discovery rate (FDR) for breakpoint detection, but it was not able to identify single clone aberrations. CBS has been implemented in DNAcopy. Applying CBS to the same simulated data set, the authors were able to achieve a 0.06 median FDR with 0.88 sensitivity. Although effective for finding segments, and despite speed optimisations by Venkatraman and Olshen,[20] however, CBS is not computationally efficient for whole-genome analysis. For example, analysis on Affymetrix 500K data has been shown to take over 20 minutes per sample and roughly 45 minutes per sample on Illumina 550K data [14].

Bottom-up agglomerative approach

In contrast to the top-down strategy employed by CBS, Wang et al. introduced a bottom-up agglomerative approach, CLAC,[21] which enjoys better computational efficiency. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters (genome regions with copy number gains/losses) by controlling the FDR at a certain level.

Multivariate method

The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions, possibly useful for association studies. The copy number analysis model (CNAM) is a commercial tool that uses two types of segmentation: univariate (on a per-sample basis) and multivariate (on a multi-sample basis) [14].

Other methods

While most segmentation methods employ parametric models for array CGH data, some non-parametric approaches that are free of distribution assumptions have also shown success in calling gains and losses in array CGH data. Hsu et al.[22] proposed to minimise noise from the array CGH data using wavelets before making inferences on the aberrations. Tibshirani et al.[23] developed a spatial smoothing approach using fused lasso regression for calling gains and losses. The regression framework of fused lasso brings great computational efficiency and can be easily generalised to other analyses involving CGH data. Jong et al.[24] used genetic local search algorithms and Willenbrock et al.[19] used the adaptive weights smoothing method, GLADmerge (a modified version of GLAD [25]), for combining segments obtained from GLAD, first within and then across chromosomes through hierarchical clustering in which clusters of segments are identified from the resultant dendrograms.

An ideal tool for the analysis of aCGH data should allow the user to choose among several of the algorithms. For the end-users, the web-based applications are the most suitable, since they do not require software installation and there are no concerns about the hardware. Some of the available tools are analysis of array-based comparative genomic hybridisation (ADaCGH)[26] and in silico array-CGH (ISACGH)[27] (see Table 1). Some of the tools are implemented in MATLAB or an executable file with a very simple interface which guides the user through the analysis (Table 1). A very helpful feature that exists in some of the tools is the ability to estimate the statistical significance of the detected copy number changes and then rank them accordingly. R http://www.r-project.org also has several packages for the analysis of aCGH data.

R packages

R is a powerful, yet flexible, statistical computing/programming environment. Its object-orientation programming scheme has made algorithm development easy and flexible and has attracted a huge developer community. R is platform independent, and works on all major computer operating systems. R has several packages for the analysis of aCGH data. These packages are freely available at the Comprehensive R Archive Network (CRAN) section of the website [28]. They include a CBS method (DNACopy)[16, 29] an unsupervised HMM approach [11] GLAD,[25] cghMCR,[30] the CLAC and method using the hierarchical clustering algorithm,[21] a penalised least-squares regression [31] and the wavelet approach [22]. BioHMM is another integral part of the segmentation, normalisation and processing of aCGH data (snapCGH)[32] R library [15]. This library lets the user apply other segmentation schemes using common input and output data objects. Additionally, snapCGH works seamlessly with limma objects [33] and enables the use of pre-processing (and other) functions therein. RAN-aCGH is an R graphical user interface (GUI) for analysis and visualisation of aCGH data and includes several of the packages in R [34]. There are also a number of web-based applications, such as ADaCGH [26] and ISACGH,[27] for viewing and comparing outputs from multiple algorithms (Table 1).

Discussion and conclusion

Most of the methods do well in detecting the existence and the width of aberrations for large changes and high signal-to-noise ratio. None of the algorithms, however, reliably detected aberrations with small width and low signal-to-noise ratio [35–38]. Several previous studies have compared the performance of these methods, as well as the segmentation schemes [17, 19, 39].

Lockwood et al.[39] reviewed 16 different tools that were used in visualisation or analysis of ACGH data.

Lai et at.[17] compared ten different methods and found that HMM [11] performed poorly, with a high false-positive rate (~0.40-0.60) and low sensitivity (~50-80 per cent) with copy number segments [17]. These authors showed that DNAcopy [29] generally performed better than GLAD [25] and HMM with regard to detection of copy number alterations. Their results also indicated that HMM performed best for small aberrations (given a sufficient signal-to-noise ratio), and that GLAD did better than HMM for wider aberrations [17]. They showed that simple smoothing algorithms such as lowess and wavelets are the fastest, and the HMM and CBS [16] were the slowest. They also noted that only CLAC [21] and the array CGH expression integration tool (ACE)[40] incorporate the FDR. They also noted that some of the segmentation methods, such as CGHseg [41] and CBS, consistently performed well [17].

Wang compared several different segmentation methods and found that CGHseg appeared to be overly sensitive to outlier measurements, and thus would be more suitable for detecting single gene copy number changes [18]. Her result showed that CLAC was conservative in handling outliers with opposite signs in the same alteration region and therefore tended to break large alteration segments into small blocks. CBS provided clean solutions for segmentation but had the limitation of detecting break points whose alteration signals were weak [18].

The few early methods employed automatically to call gains and losses from aCGH data involved smoothing the log2 ratio vectors followed by applying certain thresholds [38, 42, 43]. A common drawback of these methods was not taking into account the biological covariates, such as the distance between adjacent clones or clone quality, which are likely to affect the segmentation (ie some regions of the genome being densely covered, while others have larger gaps between probes).

aCGH analysis has come a long way, and the software packages have become more accurate and user friendly, but we are likely to see even more improvements in these software packages in the future.

References

  1. Aitman TJ, Dong R, Vyse TJ, Norsworthy PJ, et al: Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature. 2006, 439: 851-855. 10.1038/nature04489.

    Article  CAS  PubMed  Google Scholar 

  2. Fanciulli M, Norsworthy PJ, Petretto E, Dong R, et al: FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet. 2007, 39: 721-723. 10.1038/ng2046.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Gonzalez E, Kulkarni H, Bolivar H, Mangano A, et al: The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science. 2005, 307: 1434-1440. 10.1126/science.1101160.

    Article  CAS  PubMed  Google Scholar 

  4. McCarroll ME, Shi Y, Harris S, Puli S, et al: Computational prediction and experimental evaluation of a photoinduced electron-transfer sensor. J Phys Chem B. 2006, 110: 22991-22994. 10.1021/jp065876s.

    Article  CAS  PubMed  Google Scholar 

  5. Stranger BE, Forrest MS, Dunning M, Ingle CE, et al: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315: 848-853. 10.1126/science.1136678.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Dumas L, Kim YH, Karimpour-Fard A, Cox M, et al: Gene copy number variation spanning 60 million years of human and primate evolution. Genome Res. 2007, 17: 1266-1277. 10.1101/gr.6557307.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Fortna A, Kim Y, MacLaren E, Marshall K, et al: Lineage-specific gene duplication and loss in human and great ape evolution. PLoS Biol. 2004, 2: E207-10.1371/journal.pbio.0020207.

    Article  PubMed Central  PubMed  Google Scholar 

  8. Hurles M: How homologous recombination generates a mutable genome. Hum Genomics. 2005, 2: 179-186.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Sharp AJ, Cheng Z, Eichler EE: Structural variation of the human genome. Annu Rev Genomics Hum Genet. 2006, 7: 407-442. 10.1146/annurev.genom.7.080505.115618.

    Article  CAS  PubMed  Google Scholar 

  10. Shaw CJ, Lupski JR: Implications of human genome architecture for rearrangement-based disorders: The genomic basis of disease. Hum Mol Genet. 2004, 13 (Spec No 1): R57-R64.

    Article  CAS  PubMed  Google Scholar 

  11. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, et al: Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.

    Article  Google Scholar 

  12. Paris PL, Andaya A, Fridlyand J, Jain AN, et al: Whole genome scanning identifies genotypes associated with recurrence and metastasis in prostate tumors. Hum Mol Genet. 2004, 13: 1303-1313. 10.1093/hmg/ddh155.

    Article  CAS  PubMed  Google Scholar 

  13. Rossi MR, Gaile D, Laduca J, Matsui S, et al: Identification of consistent novel submegabase deletions in low-grade oligodendroglio-mas using array-based comparative genomic hybridization. Genes Chromosomes Cancer. 2005, 44: 85-96. 10.1002/gcc.20218.

    Article  CAS  PubMed  Google Scholar 

  14. CNAM: [http://www.helixtree.com/SNP_Variation/CNAM/index.html]

  15. Marioni JC, Thorne NP, Tavare S: BioHMM: A heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22: 1144-1146. 10.1093/bioinformatics/btl089.

    Article  CAS  PubMed  Google Scholar 

  16. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.

    Article  PubMed  Google Scholar 

  17. Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 21: 3763-3770. 10.1093/bioinformatics/bti611.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  18. Wang P: Algorithms for Calling Gains and Losses in Array CGH Data. 2009, Humana Press, Hatfield, UK, 556:

    Google Scholar 

  19. Willenbrock H, Fridlyand J: A comparison study: Applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 4084-4091. 10.1093/bioinformatics/bti677.

    Article  CAS  PubMed  Google Scholar 

  20. Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657-663. 10.1093/bioinformatics/btl646.

    Article  CAS  PubMed  Google Scholar 

  21. Wang P, Kim Y, Pollack J, Narasimhan B, et al: A method for calling gains and losses in array CGH data. Biostatistics. 2005, 6: 45-58. 10.1093/biostatistics/kxh017.

    Article  PubMed  Google Scholar 

  22. Hsu L, Self SG, Grove D, Randolph T, et al: Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005, 6: 211-226. 10.1093/biostatistics/kxi004.

    Article  PubMed  Google Scholar 

  23. Tibshirani R, Wang P: Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2008, 9: 18-29.

    Article  PubMed  Google Scholar 

  24. Jong K, Marchiori E, Meijer G, Vaart AV, et al: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004, 20: 3636-3637. 10.1093/bioinformatics/bth355.

    Article  CAS  PubMed  Google Scholar 

  25. Hupe P, Stransky N, Thiery JP, Radvanyi F, et al: Analysis of array CGH data: From signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20: 3413-3422. 10.1093/bioinformatics/bth418.

    Article  CAS  PubMed  Google Scholar 

  26. Diaz-Uriarte R, Rueda OM: ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data. PLoS One. 2007, 2: e737-10.1371/journal.pone.0000737.

    Article  PubMed Central  PubMed  Google Scholar 

  27. Conde L, Montaner D, Burguet-Castell J, Tarraga J, et al: ISACGH: A web-based environment for the analysis of Array CGH and gene expression which includes functional profiling. Nucleic Acids Res. 2007, W81-W85. 35 Web Server

  28. The Comprehensive R Archive Network. [http://www.bioconductor.org/CRAN/]

  29. DNAcopy: DNA copy number data analysis. [http://bioconductorfhcrcorg/packages/19/bioc/html/DNAcopyhtml]

  30. Aguirre AJ, Brennan C, Bailey G, Sinha R, Feng B, et al: High resolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci USA. 2004, 101 (24): 9067-9072. 10.1073/pnas.0402932101.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Huang T, Wu B, Lizardi P, Zhao H: Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics. 2005, 21: 3811-3817. 10.1093/bioinformatics/bti646.

    Article  CAS  PubMed  Google Scholar 

  32. Smith ML, Marioni JC, Hardcastle TJ, Thorne NP: snapCGH: Segmentation, normalization and processing of aCGH data. Bioconductor Users Guide. 2006

    Google Scholar 

  33. Smyth GK, Michaud J, Scott HS: Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005, 21: 2067-2075. 10.1093/bioinformatics/bti270.

    Article  CAS  PubMed  Google Scholar 

  34. Kim S, Kim B-S: RAN-aCGH: R GUI tools for analysis and visualization of an array-CGH experiment. Genomics Informatics. 2007, 5: 137-139.

    Google Scholar 

  35. Ferreira BI, Alonso J, Carrillo J, Acquadro F, et al: Array CGH and gene-expression profiling reveals distinct genomic instability patterns associated with DNA repair and cell-cycle checkpoint pathways in Ewing's sarcoma. Oncogene. 2008, 27: 2084-2090. 10.1038/sj.onc.1210845.

    Article  CAS  PubMed  Google Scholar 

  36. Hodgson G, Hager JH, Volik S, Hariono S, et al: Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nat Genet. 2001, 29: 459-464. 10.1038/ng771.

    Article  CAS  PubMed  Google Scholar 

  37. Lassmann S, Weis R, Makowiec F, Roth J, et al: Array CGH identifies distinct DNA copy number profiles of oncogenes and tumor suppressor genes in chromosomal- and microsatellite-unstable sporadic colorectal carcinomas. J Mol Med. 2007, 85: 293-304. 10.1007/s00109-006-0126-5.

    Article  CAS  PubMed  Google Scholar 

  38. Pollack JR, Sorlie T, Perou CM, Rees CA, et al: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA. 2002, 99: 12963-12968. 10.1073/pnas.162471999.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  39. Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet. 2006, 14: 139-148. 10.1038/sj.ejhg.5201531.

    Article  CAS  PubMed  Google Scholar 

  40. van Wieringen WN, Belien JA, Vosse SJ, Achame EM, et al: ACE-it: A tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics. 2006, 22: 1919-1920. 10.1093/bioinformatics/btl269.

    Article  CAS  PubMed  Google Scholar 

  41. Picard F, Robin S, Lavielle M, Vaisse C, et al: A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005, 6: 27-10.1186/1471-2105-6-27.

    Article  PubMed Central  PubMed  Google Scholar 

  42. Cheng C, Kimmel R, Neiman P, Zhao LP: Array rank order regression analysis for the detection of gene copy-number changes in human cancer. Genomics. 2003, 82: 122-129. 10.1016/S0888-7543(03)00122-8.

    Article  CAS  PubMed  Google Scholar 

  43. Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, et al: CGH-Explorer: A program for analysis of array-CGH data. Bioinformatics. 2005, 21: 821-822. 10.1093/bioinformatics/bti113.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This study was supported by NIH 5T15LM009451-03 for A.K.F., 5R01MH81203-2 and 2R01AA11853-11 for J.M.S. and L.D., P30CA46934 for T.P. and R01LM009254 and R01LM008111 for L.E.H.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anis Karimpour-Fard.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karimpour-Fard, A., Dumas, L., Phang, T. et al. A survey of analysis software for array-comparative genomic hybridisation studies to detect copy number variation. Hum Genomics 4, 421 (2010). https://doi.org/10.1186/1479-7364-4-6-421

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1479-7364-4-6-421

Keywords