A review of software for microarray genotyping

The focus of this review is software for the genotyping of microarray single nucleotide polymorphisms, in particular software for Affymetrix and Illumina arrays. Different statistical principles and ideas have been applied to the construction of genotyping algorithms -- for example, likelihood versus Bayesian modelling, and whether to genotype one or all arrays at a time. The release of new arrays is generally followed by new, or updated, algorithms.


Introduction
The use of microarrays and microarray technology in research is now more than 15 years old and has had a tremendous impact on many aspects of research. Suddenly, it became possible to profile and survey whole genomes and to compare genomes across individuals and species to an extent that was hardly possible before. The perception of the genome changed as genome-wide data became available to everyone.
This review focuses narrowly on software used for genotyping of single nucleotide polymorphisms (SNPs) in connection with SNP microarrays (or 'arrays' for short). There are an estimated ten million or more SNPs in the human genome. 1 For each of these, there are three possible genotypes (assuming diploidy), AA, BB (homozygous) and AB (heterozygous), where A and B denote the two possible alleles. The first commercial SNP array was released in 1996 by Affymetrix (Santa Clara, CA) and targeted about 1,500 human SNPs, 2 a tiny fraction of all SNPs. Since then, many different manufacturers have developed microarrays for genome-wide genotyping, including Affymetrix, Agilent (Santa Clara, CA), Illumina (San Diego, CA) and Nimblegen (Madison, WI), with arrays designed for many different organisms.
SNP arrays have found uses in many research areas and contexts -for example, association mapping, 3 linkage disequilibrium mapping, 4 phasing, 5 inference on demography and ancestry, 6 evolution 7 and loss-of-heterozygosity analysis in cancer. 8 Early usage of SNP arrays sought to estimate loss of heterozygosity in cancer by comparing DNA from germline and tumour cells. 9 In addition, SNP arrays have been used to estimate copy numbers in cancers 10 (similar to the use of comparative genomic hybridisation [CGH] arrays) and copy number variants (CNVs) in populations. 11 The newest arrays from Affymetrix and Illumina both contain probes for CNVs and copy number polymorphisms (CNPs). Today, SNP microarrays are able to genotype more than a million SNPs simultaneously (Table 1). This large number of SNPs poses a number of statistical, as well as computational, problems and has attracted the attention of many statisticians and bioinformaticians. Interestingly, the problems themselves have led to many new developments in statistics and have fostered what we might term 'informatics of large datasets'. There are a number of statistical issues that are shared between microarrays, irrespective of the platform, chemistry and design principles. These include: (i) Normalisation of raw intensities (ii) Background correction and outlier detection (iii) Genotyping The statistical methods applied at each step are, to some extent, transferable between platforms and array types, in particular the parts relating to (i) and (ii). Normalisation of array intensities is important in order to make comparisons across arrays. 12,13 Background correction and outlier detection (individual 'bad' SNPs, as well as 'bad' arrays) are essential for correct interpretation of the data 12,13 (ie to reduce the number of false and missing calls).
A general review of SNP array platforms and their history and use is given by LaFramboise. 14

Software
We focus on the most commonly used platforms, Affymetrix and Illumina, and do not discuss software for normalisation and background correction.

Problem formulation
Both platforms represent a SNP by a number of probes (Affymetrix) or beads (Illumina) for each allele. The probes/beads have different affinities, depending on the DNA sequence they target, and thus produce signals of various strengths (see Figure 1). The newest arrays, Human SNP Array 6.0 from Affymetrix and HumanOmni2.5-Quad BeadChip from Illumina, use three probes and around 20 beads for each allele, respectively. Earlier Affymetrix arrays additionally used mismatch probes; probes that were designed to capture nonspecific binding. A first step in many algorithms for genotyping is to summarise the probe intensities for each allele and SNP, and in a second step to make a call based on the summarised intensities.
SNP calling software for Affymetrix SNP arrays Following its release of new SNP arrays (called GeneChips), Affymetrix has developed accompanying software that takes into account the properties of the new arrays. The first program, Modified Partitioning Around Medoids (MPAM) 16 and Dynamic Model (DM), 17 were able to genotype one SNP on one chip at a time. The next generation of software, Robust Linear Model with Mahalanobis Distance Classifier (RLMM), 18 BRLMM 19 (which adds a Bayesian step to RLMM), BRLMM-P 20 (which uses perfect match probes only) and Birdseed, 21 increased accuracy and performance using a multi-chip approach.
One SNP, one chip For the first SNP arrays, Affymetrix designed software modules (MPAM, DM) to genotype individual SNPs, one array at a time. The DM software 17 was introduced with the release of the 100K GeneChip and is based on statistical modelling of quartets. A quartet consists of match and mismatch probes for the two alleles. This software does not require any normalisation step and does not summarise the probe intensities. A score is assigned for each quartet and the Wilcoxon signed-ranked test is used to give a call.
Multi-chip approach Several groups 3,18,22 -28 have designed SNP calling algorithms using a multi-chip approach; the first algorithm was RLMM. 18 This approach requires pre-processing steps, such as array normalisation, in order to compare data across arrays and summation of the probe intensities for each allele. For each SNP, the two allele intensities are then clustered into three clouds, representing the different genotypes across many chips ( Figure 2). Affymetrix designed the BRLMM algorithm for the 500K SNP arrays. 19 This algorithm was a significant improvement over the DM algorithm used for the previous arrays. The BRLMM algorithm is an extension of the RLMM software and it uses a Bayesian step to define cluster centres and variances of SNP intensities. Briefly, after normalisation and allelic summation, genotypes are clustered using a Bayesian prior on cluster centres and variances and a pre-clustering made by the DM algorithm. The prior is based on a random set of SNPs, with a minimum number of individuals in each cluster. This allows for a better definition of the genotype clusters with few ( potentially no) individuals. Further, new arrays can be genotyped using predefined parameters obtained from other arrays.
For the SNP5.0 GeneChip, Affymetrix designed a new version of the BRLMM algorithm, named BRLMM-P, as the array does not have mismatch probes. 20 The DM step of BRLMM is replaced by a maximum likelihood-based division into genotype cluster. Further, the prior can be a generic prior common to all SNPs or a SNP-specific prior defined using a set of training data (such as HapMap data).
For SNP6.0, the Broad Institute, in collaboration with Affymetrix, developed the Birdsuite software. 21,29 The novelty comes from relaxing the assumption that all SNPs are diploid and introducing known CNPs. Birdseed is actually composed of four applications: Canary, Birdseed, Birdseye and Fawkes. Canary can give an estimate of the copy number of known CNPs. Birdseed is a genotyping software with use restricted to diploid genomic regions. It is similar to BRLMM. Clusters are predefined using training data and then further optimised. The Birdseye software can detect rare CNVs and genotype SNPs in CNVs. Finally, Fawkes combines the output of the three previous applications to assign a comprehensive genotype (A-null, AA, AB, BB, AAB,. . .).
Other software can be used to genotype SNPs from Affymetrix GeneChips, such as Corrected Robust Model with Maximum Likelihood Distance (CRLMM), 23 Genotype calling with Empirical Likelihood (GEL), 24 SNiPer-High Density (SNiPer-HD) 25 , Probe-Level Allele-specific Quantization (PLASQ), 26 28 Some work has been done to compare algorithms 30 -32 and, generally, the performance of algorithms are compared with HapMap data in the original papers.

SNP calling software for Illumina BeadChips
Illumina has developed its own software to genotype SNPs on the BeadChip array. The software is called GenCall and has not been through the same chain of transformations as the Affymetrix software. The GenCall algorithm was implemented within the BeadStudio application (latest version v3.2.2) 33 but it is now part of the GenomeStudio application (the current version is 1.1.0). It relies on a specific normalisation occurring automatically within the Illumina GenomeStudio software and consists of several steps (including outlier removal and background estimation). The normalised intensities are then summarised, such that each SNP is assigned a pair of values corresponding to each allele. This pair represents the allele intensities in polar coordinates; the R-coordinate represents the copy number of the SNP and the theta-coordinate represents the angle from the x-axis ( Figure 1). This is a multi-array approach, using information from all arrays simultaneously.
The call is made using a cluster file supplied by Illumina, based on a reference set of samples. There is an option to make the call without using the reference set, instead relying exclusively on the sampled arrays, however. This dichotomy is similar to the BRLMM (and subsequent Affymetrix software), where a call can be made with pre-defined parameters, corresponding to a reference population. Whether one should use the reference set for genotype calling depends on the number of sampled arrays, the quality of the DNA and the minimal allele frequency (MAF) of interest, as the size of the reference set determines the MAF detectable. 34 For SNPs with fewer than three genotype clusters, the locations and variations of the missing genotype clusters are estimated using artificial neural networks. It is also possible manually to change the call of any SNP using Illumina's visualisation tool. For CNV analysis, Illumina has developed a series of tools which are available as plug-ins to the GenomeStudio genotyping module. Software for estimation of copy numbers (cnvPartition), detection and annotation of homozygosity in single samples (Homozygosity Detector), detection and annotation of chromosomal aberrations in single samples (ChromoZone) and for calculating a likelihood score for strength of loss-of-heterozygosity (LOH Score) is available.
Other methods have been proposed for the BeadChip arrays. Teo et al. designed a multi-array genotype calling algorithm (Illuminus) that does not rely on a reference population. 35 By contrast, Giannoulatou et al. developed a method that works entirely within each sample, thereby making the performance independent of sample size and of any outside control samples. 34 Both methods rely on an expectation-maximisation (EM) algorithm. The CRLMM algorithm 23 is also available for Illumina data as a package (GenoSNP) for R/Bioconductor. 36

Discussion and conclusions
The accuracy of genotype calling is usually reported to be above 99 per cent. This is typically the case when samples and DNA of good quality are available. Many cancer laboratories are interested in genotyping SNPs and CNPs, however, as well as estimating copy numbers, from tumour tissue.
Here, there are a number of problems that have not yet fully been overcome: tumour tissue typically contains normal cells that are difficult to remove prior to analysis; also, tumour tissue tends to be heterogeneous, in the sense that different tumour cells have different copy number aberrations. These issues affect the possibility of accurately estimating genotypes and copy numbers, and significantly reduce the accuracy of calling algorithms.
We have discussed software for genotyping; however, much software also has been developed for further downstream analysis, to accommodate specific questions and needs. 37,38 Software for normalisation and background correction has likewise received much attention. These methods are also generally applicable to other types of arrays, and borrowing of ideas between array types is common.
The future of SNP arrays, in addition to many other microarray types, such as gene (RNA) expression and microRNA expression arrays, is uncertain. For the individual laboratory, the common microarray platforms are still more costefficient than the new platforms built on nextgeneration sequencing (NGS) technologies. NGS is already dominating research to an extent that few foresaw five years ago, however. In addition, it is possible to have samples sequenced through commercial organisations or scientific collaborations.
SNP and other arrays are still in use, however. They have transformed the field of genomics and sparked an intense interest among the statistics and bioinformatics communities to provide solutions to large-scale data problems. These solutions are the foundation for solving the similar large-scale data problems encountered with NGS.