A review of software for microarray genotyping
© Henry Stewart Publications 2011
Received: 1 March 2011
Accepted: 1 March 2011
Published: 1 May 2011
The focus of this review is software for the genotyping of microarray single nucleotide polymorphisms, in particular software for Affymetrix and Illumina arrays. Different statistical principles and ideas have been applied to the construction of genotyping algorithms -- for example, likelihood versus Bayesian modelling, and whether to genotype one or all arrays at a time. The release of new arrays is generally followed by new, or updated, algorithms.
The use of microarrays and microarray technology in research is now more than 15 years old and has had a tremendous impact on many aspects of research. Suddenly, it became possible to profile and survey whole genomes and to compare genomes across individuals and species to an extent that was hardly possible before. The perception of the genome changed as genome-wide data became available to everyone.
This review focuses narrowly on software used for genotyping of single nucleotide polymorphisms (SNPs) in connection with SNP microarrays (or 'arrays' for short). There are an estimated ten million or more SNPs in the human genome . For each of these, there are three possible genotypes (assuming diploidy), AA, BB (homozygous) and AB (heterozygous), where A and B denote the two possible alleles. The first commercial SNP array was released in 1996 by Affymetrix (Santa Clara, CA) and targeted about 1,500 human SNPs, a tiny fraction of all SNPs. Since then, many different manufacturers have developed microarrays for genome-wide genotyping, including Affymetrix, Agilent (Santa Clara, CA), Illumina (San Diego, CA) and Nimblegen (Madison, WI), with arrays designed for many different organisms.
SNP arrays have found uses in many research areas and contexts -- for example, association mapping, linkage disequilibrium mapping, phasing, inference on demography and ancestry, evolution  and loss-of-heterozygosity analysis in cancer . Early usage of SNP arrays sought to estimate loss of heterozygosity in cancer by comparing DNA from germline and tumour cells . In addition, SNP arrays have been used to estimate copy numbers in cancers  (similar to the use of comparative genomic hybridisation [CGH] arrays) and copy number variants (CNVs) in populations . The newest arrays from Affymetrix and Illumina both contain probes for CNVs and copy number polymorphisms (CNPs).
Normalisation of raw intensities
Background correction and outlier detection
The arrays that are currently available for the human genome from Affymetrix and Illumina
GeneChip Human Mapping 10 K 2.0 Array
GeneChip Human Mapping 100 K Set
GeneChip Human Mapping 500 K Array Set
Genome-Wide Human SNP Array 5.0
Genome-Wide Human SNP Array 6.0
HumanCytoSNP-12 DNA Analysis BeadChip
Human660W-Quad v1 DNA Analysis BeadChip
Human1M-Duo DNA Analysis BeadChip
The statistical methods applied at each step are, to some extent, transferable between platforms and array types, in particular the parts relating to (i) and (ii). Normalisation of array intensities is important in order to make comparisons across arrays [12, 13]. Background correction and outlier detection (individual 'bad' SNPs, as well as 'bad' arrays) are essential for correct interpretation of the data [12, 13] (ie to reduce the number of false and missing calls).
A general review of SNP array platforms and their history and use is given by LaFramboise .
We focus on the most commonly used platforms, Affymetrix and Illumina, and do not discuss software for normalisation and background correction.
SNP calling software for Affymetrix SNP arrays
Following its release of new SNP arrays (called GeneChips), Affymetrix has developed accompanying software that takes into account the properties of the new arrays. The first program, Modified Partitioning Around Medoids (MPAM) and Dynamic Model (DM), were able to genotype one SNP on one chip at a time. The next generation of software, Robust Linear Model with Mahalanobis Distance Classifier (RLMM), BRLMM  (which adds a Bayesian step to RLMM), BRLMM-P  (which uses perfect match probes only) and Birdseed, increased accuracy and performance using a multi-chip approach.
One SNP, one chip
For the first SNP arrays, Affymetrix designed software modules (MPAM, DM) to genotype individual SNPs, one array at a time. The DM software  was introduced with the release of the 100 K GeneChip and is based on statistical modelling of quartets. A quartet consists of match and mismatch probes for the two alleles. This software does not require any normalisation step and does not summarise the probe intensities. A score is assigned for each quartet and the Wilcoxon signed-ranked test is used to give a call.
Affymetrix designed the BRLMM algorithm for the 500 K SNP arrays . This algorithm was a significant improvement over the DM algorithm used for the previous arrays. The BRLMM algorithm is an extension of the RLMM software and it uses a Bayesian step to define cluster centres and variances of SNP intensities. Briefly, after normalisation and allelic summation, genotypes are clustered using a Bayesian prior on cluster centres and variances and a pre-clustering made by the DM algorithm. The prior is based on a random set of SNPs, with a minimum number of individuals in each cluster. This allows for a better definition of the genotype clusters with few (potentially no) individuals. Further, new arrays can be genotyped using pre-defined parameters obtained from other arrays.
For the SNP5.0 GeneChip, Affymetrix designed a new version of the BRLMM algorithm, named BRLMM-P, as the array does not have mismatch probes . The DM step of BRLMM is replaced by a maximum likelihood-based division into genotype cluster. Further, the prior can be a generic prior common to all SNPs or a SNP-specific prior defined using a set of training data (such as HapMap data).
For SNP6.0, the Broad Institute, in collaboration with Affymetrix, developed the Birdsuite software [21, 29]. The novelty comes from relaxing the assumption that all SNPs are diploid and introducing known CNPs. Birdseed is actually composed of four applications: Canary, Birdseed, Birdseye and Fawkes. Canary can give an estimate of the copy number of known CNPs. Birdseed is a genotyping software with use restricted to diploid genomic regions. It is similar to BRLMM. Clusters are pre-defined using training data and then further optimised. The Birdseye software can detect rare CNVs and genotype SNPs in CNVs. Finally, Fawkes combines the output of the three previous applications to assign a comprehensive genotype (A-null, AA, AB, BB, AAB,...).
Other software can be used to genotype SNPs from Affymetrix GeneChips, such as Corrected Robust Model with Maximum Likelihood Distance (CRLMM), Genotype calling with Empirical Likelihood (GEL), SNiPer-High Density (SNiPer-HD), Probe-Level Allele-specific Quantization (PLASQ), MAMS  (combines Single-Array Multi-SNP [SAMS] with Multi-Array Single-SNP [MASS]), Chiamo  and JAPL . Some work has been done to compare algorithms [30–32] and, generally, the performance of algorithms are compared with HapMap data in the original papers.
SNP calling software for Illumina BeadChips
Illumina has developed its own software to genotype SNPs on the BeadChip array. The software is called GenCall and has not been through the same chain of transformations as the Affymetrix software. The GenCall algorithm was implemented within the BeadStudio application (latest version v3.2.2) but it is now part of the GenomeStudio application (the current version is 1.1.0). It relies on a specific normalisation occurring automatically within the Illumina GenomeStudio software and consists of several steps (including outlier removal and background estimation). The normalised intensities are then summarised, such that each SNP is assigned a pair of values corresponding to each allele. This pair represents the allele intensities in polar coordinates; the R-coordinate represents the copy number of the SNP and the theta-coordinate represents the angle from the x-axis (Figure 1). This is a multi-array approach, using information from all arrays simultaneously.
The call is made using a cluster file supplied by Illumina, based on a reference set of samples. There is an option to make the call without using the reference set, instead relying exclusively on the sampled arrays, however. This dichotomy is similar to the BRLMM (and subsequent Affymetrix software), where a call can be made with pre-defined parameters, corresponding to a reference population. Whether one should use the reference set for genotype calling depends on the number of sampled arrays, the quality of the DNA and the minimal allele frequency (MAF) of interest, as the size of the reference set determines the MAF detectable .
For SNPs with fewer than three genotype clusters, the locations and variations of the missing genotype clusters are estimated using artificial neural networks. It is also possible manually to change the call of any SNP using Illumina's visualisation tool. For CNV analysis, Illumina has developed a series of tools which are available as plug-ins to the GenomeStudio genotyping module. Software for estimation of copy numbers (cnvPartition), detection and annotation of homo-zygosity in single samples (Homozygosity Detector), detection and annotation of chromosomal aberrations in single samples (ChromoZone) and for calculating a likelihood score for strength of loss-of-heterozygosity (LOH Score) is available.
Other methods have been proposed for the BeadChip arrays. Teo et al. designed a multi-array genotype calling algorithm (Illuminus) that does not rely on a reference population . By contrast, Giannoulatou et al. developed a method that works entirely within each sample, thereby making the performance independent of sample size and of any outside control samples . Both methods rely on an expectation-maximisation (EM) algorithm. The CRLMM algorithm  is also available for Illumina data as a package (GenoSNP) for R/Bioconductor .
Discussion and conclusions
The accuracy of genotype calling is usually reported to be above 99 per cent. This is typically the case when samples and DNA of good quality are available. Many cancer laboratories are interested in genotyping SNPs and CNPs, however, as well as estimating copy numbers, from tumour tissue. Here, there are a number of problems that have not yet fully been overcome: tumour tissue typically contains normal cells that are difficult to remove prior to analysis; also, tumour tissue tends to be heterogeneous, in the sense that different tumour cells have different copy number aberrations. These issues affect the possibility of accurately estimating genotypes and copy numbers, and significantly reduce the accuracy of calling algorithms.
We have discussed software for genotyping; however, much software also has been developed for further downstream analysis, to accommodate specific questions and needs [37, 38]. Software for normalisation and background correction has likewise received much attention. These methods are also generally applicable to other types of arrays, and borrowing of ideas between array types is common.
The future of SNP arrays, in addition to many other microarray types, such as gene (RNA) expression and microRNA expression arrays, is uncertain. For the individual laboratory, the common microarray platforms are still more cost-efficient than the new platforms built on next-generation sequencing (NGS) technologies. NGS is already dominating research to an extent that few foresaw five years ago, however. In addition, it is possible to have samples sequenced through commercial organisations or scientific collaborations.
SNP and other arrays are still in use, however. They have transformed the field of genomics and sparked an intense interest among the statistics and bioinformatics communities to provide solutions to large-scale data problems. These solutions are the foundation for solving the similar large-scale data problems encountered with NGS.
The study was supported by grants from the Danish Strategic Research Council (2101-07-0059) (GEMS consortium), the Danish Cancer Society, the EC project GENICA, the Lundbeck Foundation and the John and Birthe Meyer Foundation.
- Kruglyak L, Nickerson DA: Variation is the spice of life. Nat Genet. 2001, 27: 234-236. 10.1038/85776.View ArticlePubMedGoogle Scholar
- Wang DG, Fan JB, Siao CJ, Berno A, et al: Large-scale identification. mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998, 280: 1077-1082. 10.1126/science.280.5366.1077.View ArticlePubMedGoogle Scholar
- Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Pe'er I, de Bakker PI, Maller J, Yelensky R, et al: Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet. 2006, 38: 663-667. 10.1038/ng1816.View ArticlePubMedGoogle Scholar
- Niu T: Algorithms for inferring haplotypes. Genet Epidemiol. 2004, 27: 334-347. 10.1002/gepi.20024.View ArticlePubMedGoogle Scholar
- Price AL, Butler J, Patterson N, Capelli C, et al: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4: e236-10.1371/journal.pgen.0030236.PubMed CentralView ArticlePubMedGoogle Scholar
- Neafsey DE, Schaffner SF, Volkman SK, Park D, et al: Genome-wide SNP genotyping highlights the role of natural selection in Plasmodium falciparum population divergence. Genome Biol. 2008, 9: R171-10.1186/gb-2008-9-12-r171.PubMed CentralView ArticlePubMedGoogle Scholar
- Koed K, Wiuf C, Christensen LL, Wikman FP, et al: High-density single nucleotide polymorphism array defines novel stage and location dependent allelic imbalances in human bladder tumors. Cancer Res. 2005, 65: 34-45.PubMedGoogle Scholar
- Lindblad-Toh K, Tanenbaum DM, Daly MJ, Winchester E, et al: Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays. Nat Biotechnol. 2000, 18: 1001-1005. 10.1038/79269.View ArticlePubMedGoogle Scholar
- Greenman CD, Bignell G, Butler A, Edkins S, et al: PICNIC: An algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11: 164-175. 10.1093/biostatistics/kxp045.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang F, Gu W, Hurles ME, Lupski JR: Copy number variation in human health. disease, and evolution. Annu Rev Genomics Hum Genet. 2009, 10: 451-481. 10.1146/annurev.genom.9.081307.164217.PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Åstrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.View ArticlePubMedGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Model validation. design issues and standard error application. Genome Biol. 2001, 2: research0032.1-0032.11. 10.1186/gb-2001-2-8-research0032.Google Scholar
- LaFramboise T: Single nucleotide polymorphism arrays: A decade of biological. computational and technological advances. Nucleic Acids Res. 2009, 37: 4181-4193. 10.1093/nar/gkp552.PubMed CentralView ArticlePubMedGoogle Scholar
- Gunderson KL, Kuhn KM, Steemers FJ, Ng P, et al: Genotype clustering on HumanHap300 BeadChip™. Pharmacogenomics. 2006, 7: 641-648. 10.2217/14622418.104.22.1681.View ArticlePubMedGoogle Scholar
- Liu W, Di X, Yang G, Matsuzaki H, et al: Algorithms for large-scale genotyping microarrays. Bioinformatics. 2003, 19: 2397-2403. 10.1093/bioinformatics/btg332.View ArticlePubMedGoogle Scholar
- Di X, Matsuzaki H, Webster TA, Hubbell E, et al: Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics. 2005, 21: 1958-1963. 10.1093/bioinformatics/bti275.View ArticlePubMedGoogle Scholar
- Rabbee N, Speed TP: A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics. 2006, 22: 7-12. 10.1093/bioinformatics/bti741.View ArticlePubMedGoogle Scholar
- Affymetrix Inc: BRLMM: An improved genotype calling method for the mapping 500 K array set. 2006, (last accessed 30th April, 2011)., [http://www.affymetrix.com/support/technical/whitepapers/brlmm_whitepaper.pdf]Google Scholar
- Affymetrix Inc: BRLMM-P: A genotype calling method for the SNP 5.0 array. 2007, (last accessed 30th April, 2011)., [http://www.affymetrix.com/support/technical/whitepapers/brlmmp_whitepaper.pdf]Google Scholar
- Korn J, Kuruvilla FG, McCarroll SA, Wysoker A, et al: Integrated genotype calling and association analysis of SNPs. common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40: 1253-1260. 10.1038/ng.237.PubMed CentralView ArticlePubMedGoogle Scholar
- Lamy P, Andersen CL, Wikman FP, Wiuf C: Genotyping and annotation of Affymetrix SNP arrays. Nucleic Acids Res. 2006, 34: e100-10.1093/nar/gkl475.PubMed CentralView ArticlePubMedGoogle Scholar
- Carvalho B, Speed TP, Irizarry RA: Exploration. normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007, 8: 485-499.View ArticlePubMedGoogle Scholar
- Nicolae DL, Wu X, Miyake K, Cox NJ: GEL: A novel genotype calling algorithm using empirical likelihood. Bioinformatics. 2006, 22: 1942-1947. 10.1093/bioinformatics/btl341.View ArticlePubMedGoogle Scholar
- Hua J, Craig DW, Brun M, Webster J, et al: SNiPer-HD: Improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics. 2006, 23: 57-63.View ArticlePubMedGoogle Scholar
- LaFramboise T, Weir BA, Zhao X, Beroukhim R, et al: Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput Biol. 2005, 1: e65-10.1371/journal.pcbi.0010065.PubMed CentralView ArticlePubMedGoogle Scholar
- Xiao Y, Segal MR, Yang YH, Yeh R-F: A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays. Bioinformatics. 2007, 23: 1459-1467. 10.1093/bioinformatics/btm131.View ArticlePubMedGoogle Scholar
- Plagnol V, Cooper JD, Todd JA, Clayton DG: A method to address differential bias in genotyping in large-scale association studies. PLoS Genet. 2007, 3: e74-10.1371/journal.pgen.0030074.PubMed CentralView ArticlePubMedGoogle Scholar
- McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, et al: Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008, 40: 1166-1174. 10.1038/ng.238.View ArticlePubMedGoogle Scholar
- Lin S, Carvalho B, Cutler D, Arking D, et al: Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biol. 2008, 9: 1-12.View ArticleGoogle Scholar
- Kim J-H, Jung S-H, Hu H-J, Yim S-H, et al: Comparison of the Affymetrix SNP Array 5.0 and oligoarray platforms for defining CNV. Genomics Informatics. 2010, 8: 138-141. 10.5808/GI.2010.8.3.138.View ArticleGoogle Scholar
- Vens M, Schillert A, König IR, Ziegler A: Look who is calling: A comparison of genotype calling algorithms. BMC Proc. 2009, 3: S59-10.1186/1753-6561-3-s7-s59.PubMed CentralView ArticlePubMedGoogle Scholar
- Steemers FJ, Gunderson KL: Whole genome genotyping technologies on the BeadArray platform. Biotechnol J. 2007, 2: 41-49. 10.1002/biot.200600213.View ArticlePubMedGoogle Scholar
- Giannoulatou E, Yau C, Colella S, Ragoussis J, et al: GenoSNP: A variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics. 2008, 24: 2209-2214. 10.1093/bioinformatics/btn386.View ArticlePubMedGoogle Scholar
- Teo YY, Inouye M, Small KS, Gwilliam R, et al: A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics. 2007, 23: 2741-2746. 10.1093/bioinformatics/btm443.PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie ME, Carvalho BS, Hetrick KN, Tavaré S: R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009, 25: 2621-2623. 10.1093/bioinformatics/btp470.PubMed CentralView ArticlePubMedGoogle Scholar
- Aroma.affymetrix: (last accessed 30th April, 2011), [http://groups.google.com/group/aroma-affymetrix/web/software?version=5&pli=1]
- Cheng Li Lab: (last accessed 30th April, 2011), [http://www.biostat.harvard.edu/complab/dchip/]