The focus of this review is software for the genotyping of microarray single nucleotide polymorphisms, in particular software for Affymetrix and Illumina arrays. Different statistical principles and ideas have been applied to the construction of genotyping algorithms -- for example, likelihood versus Bayesian modelling, and whether to genotype one or all arrays at a time. The release of new arrays is generally followed by new, or updated, algorithms.
The use of microarrays and microarray technology in research is now more than 15 years old and has had a tremendous impact on many aspects of research. Suddenly, it became possible to profile and survey whole genomes and to compare genomes across individuals and species to an extent that was hardly possible before. The perception of the genome changed as genome-wide data became available to everyone.
This review focuses narrowly on software used for genotyping of single nucleotide polymorphisms (SNPs) in connection with SNP microarrays (or 'arrays' for short). There are an estimated ten million or more SNPs in the human genome . For each of these, there are three possible genotypes (assuming diploidy), AA, BB (homozygous) and AB (heterozygous), where A and B denote the two possible alleles. The first commercial SNP array was released in 1996 by Affymetrix (Santa Clara, CA) and targeted about 1,500 human SNPs, a tiny fraction of all SNPs. Since then, many different manufacturers have developed microarrays for genome-wide genotyping, including Affymetrix, Agilent (Santa Clara, CA), Illumina (San Diego, CA) and Nimblegen (Madison, WI), with arrays designed for many different organisms.
SNP arrays have found uses in many research areas and contexts -- for example, association mapping, linkage disequilibrium mapping, phasing, inference on demography and ancestry, evolution  and loss-of-heterozygosity analysis in cancer . Early usage of SNP arrays sought to estimate loss of heterozygosity in cancer by comparing DNA from germline and tumour cells . In addition, SNP arrays have been used to estimate copy numbers in cancers  (similar to the use of comparative genomic hybridisation [CGH] arrays) and copy number variants (CNVs) in populations . The newest arrays from Affymetrix and Illumina both contain probes for CNVs and copy number polymorphisms (CNPs).
Today, SNP microarrays are able to genotype more than a million SNPs simultaneously (Table 1). This large number of SNPs poses a number of statistical, as well as computational, problems and has attracted the attention of many statisticians and bioinformaticians. Interestingly, the problems themselves have led to many new developments in statistics and have fostered what we might term 'informatics of large datasets'. There are a number of statistical issues that are shared between microarrays, irrespective of the platform, chemistry and design principles. These include:
Normalisation of raw intensities
Background correction and outlier detection
The statistical methods applied at each step are, to some extent, transferable between platforms and array types, in particular the parts relating to (i) and (ii). Normalisation of array intensities is important in order to make comparisons across arrays [12, 13]. Background correction and outlier detection (individual 'bad' SNPs, as well as 'bad' arrays) are essential for correct interpretation of the data [12, 13] (ie to reduce the number of false and missing calls).
A general review of SNP array platforms and their history and use is given by LaFramboise .
We focus on the most commonly used platforms, Affymetrix and Illumina, and do not discuss software for normalisation and background correction.
Both platforms represent a SNP by a number of probes (Affymetrix) or beads (Illumina) for each allele. The probes/beads have different affinities, depending on the DNA sequence they target, and thus produce signals of various strengths (see Figure 1). The newest arrays, Human SNP Array 6.0 from Affymetrix and HumanOmni2.5-Quad BeadChip from Illumina, use three probes and around 20 beads for each allele, respectively. Earlier Affymetrix arrays additionally used mismatch probes; probes that were designed to capture non-specific binding. A first step in many algorithms for genotyping is to summarise the probe intensities for each allele and SNP, and in a second step to make a call based on the summarised intensities.
SNP calling software for Affymetrix SNP arrays
Following its release of new SNP arrays (called GeneChips), Affymetrix has developed accompanying software that takes into account the properties of the new arrays. The first program, Modified Partitioning Around Medoids (MPAM) and Dynamic Model (DM), were able to genotype one SNP on one chip at a time. The next generation of software, Robust Linear Model with Mahalanobis Distance Classifier (RLMM), BRLMM  (which adds a Bayesian step to RLMM), BRLMM-P  (which uses perfect match probes only) and Birdseed, increased accuracy and performance using a multi-chip approach.
One SNP, one chip
For the first SNP arrays, Affymetrix designed software modules (MPAM, DM) to genotype individual SNPs, one array at a time. The DM software  was introduced with the release of the 100 K GeneChip and is based on statistical modelling of quartets. A quartet consists of match and mismatch probes for the two alleles. This software does not require any normalisation step and does not summarise the probe intensities. A score is assigned for each quartet and the Wilcoxon signed-ranked test is used to give a call.
Several groups [3, 18, 22–28] have designed SNP calling algorithms using a multi-chip approach; the first algorithm was RLMM . This approach requires pre-processing steps, such as array normalisation, in order to compare data across arrays and summation of the probe intensities for each allele. For each SNP, the two allele intensities are then clustered into three clouds, representing the different genotypes across many chips (Figure 2).
Affymetrix designed the BRLMM algorithm for the 500 K SNP arrays . This algorithm was a significant improvement over the DM algorithm used for the previous arrays. The BRLMM algorithm is an extension of the RLMM software and it uses a Bayesian step to define cluster centres and variances of SNP intensities. Briefly, after normalisation and allelic summation, genotypes are clustered using a Bayesian prior on cluster centres and variances and a pre-clustering made by the DM algorithm. The prior is based on a random set of SNPs, with a minimum number of individuals in each cluster. This allows for a better definition of the genotype clusters with few (potentially no) individuals. Further, new arrays can be genotyped using pre-defined parameters obtained from other arrays.
For the SNP5.0 GeneChip, Affymetrix designed a new version of the BRLMM algorithm, named BRLMM-P, as the array does not have mismatch probes . The DM step of BRLMM is replaced by a maximum likelihood-based division into genotype cluster. Further, the prior can be a generic prior common to all SNPs or a SNP-specific prior defined using a set of training data (such as HapMap data).
For SNP6.0, the Broad Institute, in collaboration with Affymetrix, developed the Birdsuite software [21, 29]. The novelty comes from relaxing the assumption that all SNPs are diploid and introducing known CNPs. Birdseed is actually composed of four applications: Canary, Birdseed, Birdseye and Fawkes. Canary can give an estimate of the copy number of known CNPs. Birdseed is a genotyping software with use restricted to diploid genomic regions. It is similar to BRLMM. Clusters are pre-defined using training data and then further optimised. The Birdseye software can detect rare CNVs and genotype SNPs in CNVs. Finally, Fawkes combines the output of the three previous applications to assign a comprehensive genotype (A-null, AA, AB, BB, AAB,...).
Other software can be used to genotype SNPs from Affymetrix GeneChips, such as Corrected Robust Model with Maximum Likelihood Distance (CRLMM), Genotype calling with Empirical Likelihood (GEL), SNiPer-High Density (SNiPer-HD), Probe-Level Allele-specific Quantization (PLASQ), MAMS  (combines Single-Array Multi-SNP [SAMS] with Multi-Array Single-SNP [MASS]), Chiamo  and JAPL . Some work has been done to compare algorithms [30–32] and, generally, the performance of algorithms are compared with HapMap data in the original papers.
SNP calling software for Illumina BeadChips
Illumina has developed its own software to genotype SNPs on the BeadChip array. The software is called GenCall and has not been through the same chain of transformations as the Affymetrix software. The GenCall algorithm was implemented within the BeadStudio application (latest version v3.2.2) but it is now part of the GenomeStudio application (the current version is 1.1.0). It relies on a specific normalisation occurring automatically within the Illumina GenomeStudio software and consists of several steps (including outlier removal and background estimation). The normalised intensities are then summarised, such that each SNP is assigned a pair of values corresponding to each allele. This pair represents the allele intensities in polar coordinates; the R-coordinate represents the copy number of the SNP and the theta-coordinate represents the angle from the x-axis (Figure 1). This is a multi-array approach, using information from all arrays simultaneously.
The call is made using a cluster file supplied by Illumina, based on a reference set of samples. There is an option to make the call without using the reference set, instead relying exclusively on the sampled arrays, however. This dichotomy is similar to the BRLMM (and subsequent Affymetrix software), where a call can be made with pre-defined parameters, corresponding to a reference population. Whether one should use the reference set for genotype calling depends on the number of sampled arrays, the quality of the DNA and the minimal allele frequency (MAF) of interest, as the size of the reference set determines the MAF detectable .
For SNPs with fewer than three genotype clusters, the locations and variations of the missing genotype clusters are estimated using artificial neural networks. It is also possible manually to change the call of any SNP using Illumina's visualisation tool. For CNV analysis, Illumina has developed a series of tools which are available as plug-ins to the GenomeStudio genotyping module. Software for estimation of copy numbers (cnvPartition), detection and annotation of homo-zygosity in single samples (Homozygosity Detector), detection and annotation of chromosomal aberrations in single samples (ChromoZone) and for calculating a likelihood score for strength of loss-of-heterozygosity (LOH Score) is available.
Other methods have been proposed for the BeadChip arrays. Teo et al. designed a multi-array genotype calling algorithm (Illuminus) that does not rely on a reference population . By contrast, Giannoulatou et al. developed a method that works entirely within each sample, thereby making the performance independent of sample size and of any outside control samples . Both methods rely on an expectation-maximisation (EM) algorithm. The CRLMM algorithm  is also available for Illumina data as a package (GenoSNP) for R/Bioconductor .
Discussion and conclusions
The accuracy of genotype calling is usually reported to be above 99 per cent. This is typically the case when samples and DNA of good quality are available. Many cancer laboratories are interested in genotyping SNPs and CNPs, however, as well as estimating copy numbers, from tumour tissue. Here, there are a number of problems that have not yet fully been overcome: tumour tissue typically contains normal cells that are difficult to remove prior to analysis; also, tumour tissue tends to be heterogeneous, in the sense that different tumour cells have different copy number aberrations. These issues affect the possibility of accurately estimating genotypes and copy numbers, and significantly reduce the accuracy of calling algorithms.
We have discussed software for genotyping; however, much software also has been developed for further downstream analysis, to accommodate specific questions and needs [37, 38]. Software for normalisation and background correction has likewise received much attention. These methods are also generally applicable to other types of arrays, and borrowing of ideas between array types is common.
The future of SNP arrays, in addition to many other microarray types, such as gene (RNA) expression and microRNA expression arrays, is uncertain. For the individual laboratory, the common microarray platforms are still more cost-efficient than the new platforms built on next-generation sequencing (NGS) technologies. NGS is already dominating research to an extent that few foresaw five years ago, however. In addition, it is possible to have samples sequenced through commercial organisations or scientific collaborations.
SNP and other arrays are still in use, however. They have transformed the field of genomics and sparked an intense interest among the statistics and bioinformatics communities to provide solutions to large-scale data problems. These solutions are the foundation for solving the similar large-scale data problems encountered with NGS.
Kruglyak L, Nickerson DA: Variation is the spice of life. Nat Genet. 2001, 27: 234-236. 10.1038/85776.
Wang DG, Fan JB, Siao CJ, Berno A, et al: Large-scale identification. mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998, 280: 1077-1082. 10.1126/science.280.5366.1077.
Neafsey DE, Schaffner SF, Volkman SK, Park D, et al: Genome-wide SNP genotyping highlights the role of natural selection in Plasmodium falciparum population divergence. Genome Biol. 2008, 9: R171-10.1186/gb-2008-9-12-r171.
Greenman CD, Bignell G, Butler A, Edkins S, et al: PICNIC: An algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11: 164-175. 10.1093/biostatistics/kxp045.
Bolstad BM, Irizarry RA, Åstrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.
Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Model validation. design issues and standard error application. Genome Biol. 2001, 2: research0032.1-0032.11. 10.1186/gb-2001-2-8-research0032.
Di X, Matsuzaki H, Webster TA, Hubbell E, et al: Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics. 2005, 21: 1958-1963. 10.1093/bioinformatics/bti275.
Korn J, Kuruvilla FG, McCarroll SA, Wysoker A, et al: Integrated genotype calling and association analysis of SNPs. common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40: 1253-1260. 10.1038/ng.237.
Giannoulatou E, Yau C, Colella S, Ragoussis J, et al: GenoSNP: A variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics. 2008, 24: 2209-2214. 10.1093/bioinformatics/btn386.
The study was supported by grants from the Danish Strategic Research Council (2101-07-0059) (GEMS consortium), the Danish Cancer Society, the EC project GENICA, the Lundbeck Foundation and the John and Birthe Meyer Foundation.
Authors and Affiliations
Bioinformatics Research Centre, C. F. Mollers Allé 8, Building 1110, DK-8000, Aarhus C, Denmark
Philippe Lamy, Jakob Grove & Carsten Wiuf
Department of Molecular Medicine, Aarhus University Hospital, Skejby, DK-8200, Aarhus N, Denmark
Department of Human Genetics, Aarhus University, The Bartholin Building, Wilhelm Meyers Allé 4, DK-8000, Aarhus C, Denmark