A survey of analysis software for array-comparative genomic hybridisation studies to detect copy number variation

Copy number variants (CNVs) create a major source of variation among individuals and populations. Array-based comparative genomic hybridisation (aCGH) is a powerful method used to detect and compare the copy numbers of DNA sequences at high resolution along the genome. In recent years, several informatics tools for accurate and efficient CNV detection and assessment have been developed. In this paper, most of the well known algorithms, analysis software and the limitations of that software will be briefly reviewed.


Background
Copy number variants (CNVs) are DNA sequences that are present in different amounts among individuals in a population. Copy number differences can confer a change in gene expression, phenotypic variation, disease susceptibility, 1 -5 and gene and genome evolution. 6,7 Repetitive sequences that flank a specific genomic region can further facilitate a duplication or deletion of that region via the mechanism of non-allelic homologous recombination, which can occur when paralogous sequences in the genome mis-pair during meiois. 8 -10 A key method used to study CNVs across individuals is that of array-based comparative genomic hybridisation (aCGH). The goal of aCGH experiments is to detect and compare the copy numbers of DNA sequences at high resolution along the genome. Several informatics tools currently exist for accurate and efficient CNV detection and assessment.
These tools assist in automated analysis of array CGH data and user-friendly copy number reporting for individual samples. The goal of the statistical algorithms used in these software programs is to call aberrations reliably, accurately and precisely.
The analysis of CNVs is broken down into several steps, including: (i) pre-processing and normalisation of the raw data; (ii) aligning data with its genome location, conducting segmentation analysis and providing statistical analysis to ensure the reliability of detection; and (iii) post-processing to assign biological meaning to the different states.
(i) Normalisation of the log 2 ratios is typically conducted in an attempt to adjust for sources of systematic variation. Since these effects are often not known or measured, most aCGH methodologies incorporate global normalisation techniques, centring the data about the sample mean or median for a given hybridisation. 11 Normalisation remains imperfect, and an accurate estimation of the copy number is unlikely. It is assumed, however, that changes in the observed, normalised log 2 ratios correspond directly to changes in the true copy numbers.
(ii) From a statistical perspective, segmentation has received most attention, and many different schemes have been proposed. Three main methods include: a) segmenting chromosome-arrayed genotypes into discrete regions, with probes in each region presenting different signal intensity patterns to adjacent regions; and b) labelling particular segments that are inherently different in copy number from their expected value. Segmentation methods seek to identify the locations of log 2 ratio mean change (ie change points or breakpoints) and to estimate the values of those means. All of these segmentation methods provide breakpoint locations but do not identify the associated genomic alterations as gains or losses. Because a primary objective of aCGH analysis is to identify regions of copy number gain and loss, follow-up methods have been proposed for this from segmentation results. Some of the studies have used a non-parametric estimate of the standard deviation to identify a global threshold for categorising segments. 12,13 Another approach for identifying gains and losses based on segmentation results entails the combination of identified segments across chromosomes and a subsequent establishment of a no-change baseline.
The challenge with segmentation methods is that, in order to find the optimal segmentation, all possible change-points need to be evaluated, creating a combinatorial explosion. For example, if there are ten copy number segments positioned across 2,000 probe intensities, then there are 2000 10 places in which these segments may lie (roughly a 1 followed by 33 zeros, or one decillion). Given that a chromosome may have as many as a hundred change-points, and that today's whole-genome arrays contain over 50,000 intensity values for a given chromosome, the search space can easily exceed 50,000 100 possible change-points. 14 This seemingly impossible series of calculations is what has led researchers to adopt various heuristics to make this process computationally viable, as was done with circular binary segmentation (CBS), or by avoiding segmentation altogether.
(iii) Finally, a post-processing step is necessary to assign biological meaning to the different states.
In this paper, a survey is presented of currently available analysis tools for aCGH data to detect copy number variation. Several of these tools provide user-friendly software, with visualisation tools and links to other databases ( Table 1). Comparison of these methods is difficult because array platforms differ in probe type, size, varying resolutions and noise levels. A series of methods for performing this analysis are described below.

Hidden Markov model (HMM) and BioHMM
Fridlyand et al. 11 proposed an unsupervised HMM for identifying copy number changes on chromosomes. Marioni et al. 15 described a new segmentation scheme, BioHMM, which extends the HMM approach of Fridlyand et al., 11 to take account of the distance between adjacent clones or of clone quality that are likely to affect the segmentation.

CBS
Olshen et al. 16 introduced another sophisticated method for aCGH analysis, CBS. This is a modification of the change-point approach, allowing for tertiary splits by connecting the two chromosomal ends. It splits the chromosomes into contiguous regions of equal copy number by modelling discrete copy number gains and losses. It then assesses the significance of the proposed splits by using a permutation reference distribution.
Several studies have shown that CBS is the most efficient method. 16 -19 They have also shown that optimal combination of the smoothing step and the segmentation step may result in improved performance.
Willenbrock and Fridlyand 19 compared three publicly available methods for the analysis of aGGH data -DNAcopy (CBS), the gain and loss analysis of DNA Gaussian-based approach (GLAD) and the 'cluster along chromosomes' (CLAC) approachand they showed that segmentation by any of the

Multivariate method
The multivariate method segments all samples simultaneously, finding general copy number regions that may be similar across all samples. This method is preferable for finding very small copy number regions, and for finding conserved regions, possibly useful for association studies. The copy number analysis model (CNAM) is a commercial tool that uses two types of segmentation: univariate (on a per-sample basis) and multivariate (on a multisample basis). 14

Other methods
While most segmentation methods employ parametric models for array CGH data, some non-parametric approaches that are free of distribution assumptions have also shown success in calling gains and losses in array CGH data. Hsu et al. 22 proposed to minimise noise from the array CGH data using wavelets before making inferences on the aberrations. Tibshirani et al. 23 developed a spatial smoothing approach using fused lasso regression for calling gains and losses. The regression framework of fused lasso brings great computational efficiency and can be easily generalised to other analyses involving CGH data. Jong et al. 24 used genetic local search algorithms and Willenbrock et al. 19 used the adaptive weights smoothing method, GLADmerge (a modified version of GLAD 25 ), for combining segments obtained from GLAD, first within and then across chromosomes through hierarchical clustering in which clusters of segments are identified from the resultant dendrograms. An ideal tool for the analysis of aCGH data should allow the user to choose among several of the algorithms. For the end-users, the web-based applications are the most suitable, since they do not require software installation and there are no concerns about the hardware. Some of the available tools are analysis of array-based comparative genomic hybridisation (ADaCGH) 26 and in silico array-CGH (ISACGH) 27 (see Table 1). Some of the tools are implemented in MATLAB or an executable file with a very simple interface which guides the user through the analysis (Table 1). A very helpful feature that exists in some of the tools is the ability to estimate the statistical significance of the detected copy number changes and then rank them accordingly. R (http://www.r-project.org) also has several packages for the analysis of aCGH data.

R packages
R is a powerful, yet flexible, statistical computing/ programming environment. Its object-orientation programming scheme has made algorithm development easy and flexible and has attracted a huge developer community. R is platform independent, and works on all major computer operating systems. R has several packages for the analysis of aCGH data. These packages are freely available at the Comprehensive R Archive Network (CRAN) section of the website. 28 They include a CBS method (DNACopy) 16,29 an unsupervised HMM approach 11 GLAD, 25 cghMCR, 30 the CLAC and method using the hierarchical clustering algorithm, 21 a penalised least-squares regression 31 and the wavelet approach. 22 BioHMM is another integral part of the segmentation, normalisation and processing of aCGH data (snapCGH) 32 R library. 15 This library lets the user apply other segmentation schemes using common input and output data objects. Additionally, snapCGH works seamlessly with limma objects 33 and enables the use of pre-processing (and other) functions therein. RAN-aCGH is an R graphical user interface (GUI) for analysis and visualisation of aCGH data and includes several of the packages in R. 34 There are also a number of web-based applications, such as ADaCGH 26 and ISACGH, 27 for viewing and comparing outputs from multiple algorithms (Table 1).

Discussion and conclusion
Most of the methods do well in detecting the existence and the width of aberrations for large changes and high signal-to-noise ratio. None of the algorithms, however, reliably detected aberrations with small width and low signal-to-noise ratio. 35 -38 Several previous studies have compared the performance of these methods, as well as the segmentation schemes. 17,19,39 Lockwood et al. 39 reviewed 16 different tools that were used in visualisation or analysis of ACGH data.
Lai et at. 17 compared ten different methods and found that HMM 11 performed poorly, with a high false-positive rate ( 0.40 -0.60) and low sensitivity ( 50 -80 per cent) with copy number segments. 17 These authors showed that DNAcopy 29 generally performed better than GLAD 25 and HMM with regard to detection of copy number alterations. Their results also indicated that HMM performed best for small aberrations (given a sufficient signal-to-noise ratio), and that GLAD did better than HMM for wider aberrations. 17 They showed that simple smoothing algorithms such as lowess and wavelets are the fastest, and the HMM and CBS 16 were the slowest. They also noted that only CLAC 21 and the array CGH expression integration tool (ACE) 40 incorporate the FDR. They also noted that some of the segmentation methods, such as CGHseg 41 and CBS, consistently performed well. 17 Wang compared several different segmentation methods and found that CGHseg appeared to be overly sensitive to outlier measurements, and thus would be more suitable for detecting single gene copy number changes. 18 Her result showed that CLAC was conservative in handling outliers with opposite signs in the same alteration region and therefore tended to break large alteration segments into small blocks. CBS provided clean solutions for segmentation but had the limitation of detecting break points whose alteration signals were weak. 18 The few early methods employed automatically to call gains and losses from aCGH data involved smoothing the log 2 ratio vectors followed by applying certain thresholds. 38,42,43 A common drawback of these methods was not taking into account the biological covariates, such as the distance between adjacent clones or clone quality, which are likely to affect the segmentation (ie some regions of the genome being densely covered, while others have larger gaps between probes). aCGH analysis has come a long way, and the software packages have become more accurate and user friendly, but we are likely to see even more improvements in these software packages in the future.