# A survey of current software for haplotype phase inference

- Michael E. Weale
^{1}Email author

**Received: **9 November 2003

**Accepted: **9 November 2003

**Published: **1 January 2004

## Abstract

In the past two years, tracking the explosion in data due to ever-improving single nucleotide polymorphism (SNP) maps and cheaper high-throughput genotyping technologies, a bewildering array of new algorithms and relevant software have appeared for haplotype phase inference. The alternatives to haplotype inference are to resolve haplotypes completely, either by *in vitro* methods or by typing close pedigrees, which is expensive and is not guaranteed in pedigrees, or to ignore haplotype-level analysis in favour of genotype-level analysis, which avoids the danger of treating inferred haplotypes as real but denies the researcher, potentially, any valuable analytic insights. This review attempts a snapshot of this rapidly moving field as it stands at present, and is mainly restricted, given the current predominance of SNP genotyping, to the consideration of diallelic data. For completeness, the review will occasionally refer to algorithms for which no software exists.

## Keywords

## Introduction

Haplotype phase algorithms can be conveniently split into three main types: parsimony, maximum likelihood and Bayesian. The researcher may either want to infer haplotype frequencies in the population, impute the haplotypes possessed by given individuals, or both. In general, parsimony methods most naturally estimate individual haplotypes, maximum likelihood methods most naturally estimate population frequencies and Bayesian methods can do both.

Parsimony algorithms avoid explicit likelihood calculations by minimising a 'costly' constraint. The grandfather of all haplotype phase algorithms (an elderly 13 year old) is Clark's method [1], a simple iterative procedure inspired by the constraint 'minimise the number of new haplotypes you have to invent'. (To obtain 'HAPINFERX' software, apply to ac347@cornell.edu.) The method can either suffer from having too many solutions or from having none (although the general problem of convergence is a common issue with all haplotype inference algorithms). There is also no guarantee that the global minimum for the 'minimise haplotype number' constraint is reached by Clark's algorithm. This latter problem is fixed in a more recent algorithm [2] ('HAPAR'; apply to lwang@cs.cityu.edu.hk). Phylogenetic parsimony methods have been explored by Daniel Gusfield and colleagues ('GPPH', 'DPPH' and 'BPPH'; http://wwwcsif.cs.ucdavis.edu/~gusfield/). The constraint here is 'minimise the number of ancestral recombination events required to link the new invented haplotypes'. As one might expect, this constraint works well in small, tightly-linked genomic regions and less well in bigger regions [3].

Because parsimony algorithms avoid explicit likelihood calculations, they do not provide any natural way to measure uncertainty in the estimates. Maximum likelihood and Bayesian methods provide a way around this problem.

Maximum likelihood estimation is predominantly undertaken via Expectation-Maximisation (EM) algorithms. These use an explicit but very simple likelihood model for the data (the so-called 'gene counting' model). Observed (or partially observed) haplotype counts follow a multinomial distribution conditional on the haplotype population frequencies. Random assortment of haplotypes to individuals is assumed (a standard assumption for all algorithms, whether maximum likelihood or Bayesian, working with likelihood functions). The EM algorithm avoids making assumptions about the mutational and recombinatorial relationships of the final set of inferred haplotypes, which some see as an advantage and others as a disadvantage. The original EM algorithm citation here is usually Excoffier and Slatkin [4], but see also Hawley and Kidd [5] ('HAPLO'; http://krunch.med.yale.edu/haplo/). Some well-used implementations of the algorithm are: 'EM-decoder' [6] (http://www.people.fas.harvard.edu/~junliu/em/em.htm), 'EH+' [7] (http://www.iop.kcl.ac.uk/IoP/Departments/PsychMed/GEpiBSt/software.shtml), 'GENECOUNTING' [8] (same website as 'EH+') and 'snphap' (D. Clayton; http://www-gene.cimr.cam.ac.uk/clayton/software/). Of these, 'GeneCounting' and 'snphap' have the added refinement of allowing for missing data. 'snphap' has the additional refinement that, once haplotype frequencies are estimated, the program swaps from likelihood-based to posterior probability-based imputation and calculates haplotype-pair probabilities -- conditional on the estimated haplotype frequencies -- for all pairs consistent with an individual's genotype. 'snphap' works only on diallelic data. The extension 'hap' (J. H. Zhao, same website as 'EH+' and 'GeneCounting') runs the same algorithm but accepts multiallelic data.

Bayesian algorithms have the potential to address the issue, missing from EM algorithms, of how to guide the haplotype inference process so as to favour solutions which make sense in terms of an underlying genealogy connecting the haplotypes, via manipulation of the prior. The first proposed Bayesian algorithm, and still one of the best, is that implemented in the 'PHASE' program [9] (http://www.stat.washington.edu/stephens/phase.html). The proposed prior is derived, approximately, from coalescent theory, and ensures that new 'invented' haplotypes look mutationally similar to the others at any one stage of the iterative (Gibbs sampler) stochastic convergence process. The main disadvantage of the original version of 'PHASE' was its plodding speed of convergence for datasets of any reasonable size.

## Extensions

The key developments have been towards improving speed as datasets increase in size, and coping with ever larger genomic regions, where it becomes impossible to infer unbroken haplotypes over the entire region because their estimated frequencies become vanishingly small. For parsimony algorithms, Gusfield shows how to implement a speeded up version of Clarke's algorithm [3]. Eskin and colleagues also illustrate the considerable speed advantages of this simple algorithm in cases where a simple 'block'-like structure of the genome is observed [10] ('HAP'; http://www1.cs.columbia.edu/compbio/hap/).

For Bayesian algorithms, one key idea that has since been implemented in several new extensions is the Partition-Ligation strategy proposed by Niu and colleagues [6]. Here, the genome region is split into a number of smaller regions (either arbitrarily or by some process that attempts to maximise linkage disequilibrium within each region). The haplotype inference method is then applied separately to two adjacent sub-regions and allowed to converge separately. Larger hap-lotypes are then formed by allowing haplotypes to merge at random across the boundary, using current estimates of their respective frequencies. The haplotype inference method is then applied to the new larger region (and allowed to converge), and separately to another adjacent sub-region. The process repeats until all sub-regions are merged. Niu and colleagues also speeded up convergence steps by 'prior annealing', in which jumps in posterior probability space are allowed to be larger at first, then progressively smaller.

Stephens and Donnelly have implemented these ideas in a new faster 'PHASE' program [11], and Niu and colleagues have implemented them in a new Bayesian algorithm 'HAPLO-TYPER' [6] (http://www.people.fas.harvard.edu/~junliu/Haplo/docMain.htm). This latter algorithm abandons the idea of a prior favouring mutational similarity among inferred haplotypes, and instead applies a Dirichlet prior. This prior functions in a similar way to the multinomial in EM algorithms, in that it avoids making assumptions about mutational and recombinatorial relationships among inferred haplotypes. Lin and colleagues have also proposed a separate Bayesian algorithm with a Dirichlet prior [12]. The issue of what constitutes a good prior for a Bayesian model remains unresolved [13]. While the Dirichlet is computationally convenient, there is valuable extra information in the mutational and recombina-torial relationships that should lead to more accurate inferences of haplotypes, provided that the models dealing with both of these phenomena are reasonable. Eronen and colleagues propose a new prior allowing for recombination, designed explicitly for long-range genotype data [14] ('MC-VL'; http://www.cs.helsinki.fi/group/genetics/haplotyping.html). Another promising new algorithm that explicitly incorporates recombination into its strategy and is also designed specifically for long-range genotype data is the 'ELB' algorithm proposed by Excoffier and colleagues [15]. The latest version of 'PHASE' also optionally incorporates a recombination model [16].

For EM algorithms, Qin and colleagues have implemented the above partition-ligation ideas into an EM context [17] ('PL-EM', http://www.people.fas.harvard.edu/~junliu/plem/). A very similar algorithm has been proposed by Li and colleagues [18] ('HPlus'; http://qge.fhcrc.org/hplus/). Zhang and colleagues propose an improvement to the speed of the E-step in the EM algorithm [19] ('OSLEM'; http://genome3.cpmc.columbia.edu/cgi-bin/GENOME/oslem/doHaplo.cgi), and Thomas proposes other approximations to increase EM algorithm speed [20] ('GCHap'; http://episun7.med.utah.edu/~alun/gchap/). David Clayton's 'snphap' tackles the large data set problem by starting with two-locus haplotypes, extending the haplotype one locus at a time, and culling low-frequency haplotypes at an early stage. The effect of these short cuts on the optimality of the final solution is unclear.

A number of researchers have proposed EM algorithms that take advantage of the increased (but not complete) certainty in haplotype phase afforded by simple pedigree data, especially trios. These include Rohde and Fuerst [21] (apply to rohde@mdc-berlin.de for software), Li and Jiang [22] ('PedPhase'; http://www.cs.ucr.edu/~jili/haplotyping.html), Dudbridge [23], and Weale and colleagues [24] ('EMtrio', part of the 'TagIT' package; http://popgen.biol.ucl.ac.uk/software.html). 'EMtrio' is designed to cope with partially missing genotype data, in which one homologous chromosome may be phase-resolved and the other not. The Bayesian 'PHASE' program also allows input of phase-resolved data, but does not handle the above partially-missing situations. A front-end to 'PHASE', called 'PHamily', automatically resolves trio data (H. Ackerman and M. Stephens; http://archimedes.well.ox.ac.uk/pise/). Opinion is divided on whether it is worth the extra genotyping effort to type close relatives to help resolve phase [25–28]. Interest has also focused recently on the use of EM algorithms to infer haplotypes from pooled DNA data [29–31].

Regardless of which method of haplotype inference is used, it is generally recognised that any subsequent analyses using such haplotypes (eg association tests against phenotype) should ideally take account of the uncertainty associated with these inferred haplotypes. There has also been a considerable amount of recent literature on this subject, which is not reviewed here. One promising program that allows for this is 'BLADE' [32, 33] (Version 2: http://www.fas.harvard.edu/~junliu/TechRept/03folder/bladev2.tgz).

Despite the assertions of some, it is currently not clear which one of these alternative methods and their extensions will provide the most reliable estimates. All the rival algorithms tend to do well when datasets and genomic regions are small; all do badly when they are large. One prudent measure is to check the results of different methods against each other for consistency. The program 'HIT' brings together four well-used algorithms for this purpose (including two EM algorithms, 'PHASE', and 'HAPLOTYPER'; apply to wangx@udel.edu). The 'HapScope' package [34] (ftp://ftp1.nci.nih.gov/pub/Hap Scope) incorporates versions of both 'PHASE' and 'snphap'. When consistency breaks down in the larger datasets, the way forward is still unclear. The key issue will not be to find a better haplotype inference method *per se*, but rather to find a better strategy for partitioning large genomic regions into manageable sub-regions without losing useful linkage disequilibrium information along the way.

## Authors’ Affiliations

## References

- Clark AG: 'Inference of haplotypes from PCR-amplified samples of diploid populations'. Mol Biol Evol. 1990, 7: 111-122.PubMedGoogle Scholar
- Wang L, Xu Y: 'Haplotype inference by maximum parsimony'. Bioinformatics. 2003, 19: 1773-1780. 10.1093/bioinformatics/btg239.View ArticlePubMedGoogle Scholar
- Gusfield D: 'Haplotype inference by pure parsimony'. Combinatorial Pattern Matching, Proceedings: Lecture Notes in Computer Science. 2003, 2676: 144-155. 10.1007/3-540-44888-8_11.View ArticleGoogle Scholar
- Excoffier L, Slatkin M: 'Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population'. Mol Biol Evol. 1995, 12: 921-927.PubMedGoogle Scholar
- Hawley ME, Kidd KK: 'HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes'. J Hered. 1995, 86: 409-411.PubMedGoogle Scholar
- Niu T, Qin ZS, Xu X, et al: 'Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms'. Am J Hum Genet. 2002, 70: 157-169. 10.1086/338446.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao JH, Curtis D, Sham PC: 'Model-free analysis and permutation tests for allelic associations'. Hum Hered. 2000, 50: 133-139. 10.1159/000022901.View ArticlePubMedGoogle Scholar
- Zhao JH, Lissarrague S, Essioux L, et al: 'GENECOUNTING: Haplotype analysis with missing genotypes'. Bioinformatics. 2002, 18: 1694-1695. 10.1093/bioinformatics/18.12.1694.View ArticlePubMedGoogle Scholar
- Stephens M, Smith NJ, Donnelly P: 'A new statistical method for haplotype reconstruction from population data'. Am J Hum Genet. 2001, 68: 978-989. 10.1086/319501.PubMed CentralView ArticlePubMedGoogle Scholar
- Eskin E, Halperin E, Karp R: 'Large scale reconstruction of haplotypes from genotype data'. Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. 2003, ACM Press, New York, NY, 104-113. (RECOMB-2003)Google Scholar
- Stephens M, Donnelly P: 'A comparison of Bayesian methods for haplotype reconstruction from population genotype data'. Am J Hum Genet. 2003, 73: 1162-1169. 10.1086/379378.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin S, Cutler DJ, Zwick ME, et al: 'Haplotype inference in random population samples'. Am J Hum Genet. 2002, 71: 1129-1137. 10.1086/344347.PubMed CentralView ArticlePubMedGoogle Scholar
- Morris A, Pedder A, Ayres K: 'Linkage disequilibrium assessment via log-linear modeling of SNP haplotype frequencies'. Genet Epidemiol. 2003, 25: 106-114. 10.1002/gepi.10254.View ArticlePubMedGoogle Scholar
- Eronen L, Geerts F, Toivonen H: 'A Markov Chain approach to reconstruction of long haplotypes'. Pacific Symposium on Biocomputing. 2003, see, [http://www-smi.stanford.edu/projects/helix/psb04]Google Scholar
- Excoffier L, Laval G, Balding D: 'Gametic phase estimation over large genomic regions using an adaptive window approach'. Human Genomics. 2003, 1: 7-19.PubMed CentralView ArticlePubMedGoogle Scholar
- Li N, Stephens M: 'Modelling linkage disequilibrium and identifying recombination hotspots using SNP data'. Genetics. 2003.Google Scholar
- Qin ZS, Niu T, Liu JS: 'Partition-ligation-expectation-maximization algorithm for haplotype inference with single nucleotide polymorphisms'. Am J Hum Genet. 2002, 71: 1242-1247. 10.1086/344207.PubMed CentralView ArticlePubMedGoogle Scholar
- Li SS, Khalid N, Carlson C, et al: 'Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms'. Biostatistics. 2003, 4: 513-522. 10.1093/biostatistics/4.4.513.View ArticlePubMedGoogle Scholar
- Zhang P, Sheng H, Morabia A, et al: 'Optimal step length EM algorithm (OSLEM) for the estimation of haplotype frequency and its application in lipoprotein lipase genotyping'. BMC Bioinformatics. 2003, 4: 3-10.1186/1471-2105-4-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas A: 'GCHap: Fast MLEs for haplotype frequencies by gene counting'. Bioinformatics. 2003, 19: 2002-2003. 10.1093/bioinformatics/btg254.View ArticlePubMedGoogle Scholar
- Rohde K, Fuerst R: 'Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information'. Hum Mutat. 2001, 17: 289-295. 10.1002/humu.26.View ArticlePubMedGoogle Scholar
- Li J, Jiang T: 'Efficient inference of haplotypes from a genotype on a pedigree'. J Bioinf Comp Bio. 2003, 1: 41-69. 10.1142/S0219720003000204.View ArticleGoogle Scholar
- Dudbridge F: 'Pedigree disequilibrium tests for multilocus haplotypes'. Genet Epidemiol. 2003, 25: 115-121. 10.1002/gepi.10252.View ArticlePubMedGoogle Scholar
- Weale ME, Depondt C, Macdonald SJ, et al: 'Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: Implications for linkage-disequilibrium gene mapping'. Am J Hum Genet. 2003, 73: 551-565. 10.1086/378098.PubMed CentralView ArticlePubMedGoogle Scholar
- Fallin D, Schork NJ: 'Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data'. Am J Hum Genet. 2000, 67: 947-959. 10.1086/303069.PubMed CentralView ArticlePubMedGoogle Scholar
- Becker T, Knapp M: 'Efficiency of haplotype frequency estimation when nuclear family information is included'. Hum Hered. 2002, 54: 45-53. 10.1159/000066692.View ArticlePubMedGoogle Scholar
- Schaid DJ: 'Relative efficiency of ambiguous vs. directly measured haplotype frequencies'. Genet Epidemiol. 2002, 23: 426-443. 10.1002/gepi.10184.View ArticlePubMedGoogle Scholar
- Cheng R, Ma JZ, Wright FA, et al: 'Nonparametric disequilibrium mapping of functional sites using haplotypes of multiple tightly linked single-nucleotide polymorphism markers'. Genetics. 2003, 164: 1175-1187.PubMed CentralPubMedGoogle Scholar
- Wang S, Kidd KK, Zhao HY: 'On the use of DNA pooling to estimate haplotype frequencies'. Genet Epidemiol. 2003, 24: 74-82. 10.1002/gepi.10195.View ArticlePubMedGoogle Scholar
- Ito T, Chiku S, Inoue E, et al: 'Estimation of haplotype frequencies, linkage-disequilibrium measures and combination of haplotype copies in each pool by use of pooled DNA data'. Am J Hum Genet. 2003, 72: 384-398. 10.1086/346116.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang YN, Zhang JS, Hoh J, et al: 'Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA'. Proc Natl Acad Sci USA. 2003, 100: 7225-7230. 10.1073/pnas.1237858100.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu JS, Sabatti C, Teng J, et al: 'Bayesian analysis of haplotypes for linkage disequilibrium mapping'. Genome Res. 2001, 11: 1716-1724. 10.1101/gr.194801.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu X, Niu TH, Liu JS: 'Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms'. Genome Res. 2003, 13: 2112-2117. 10.1101/gr.586803.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang J, Rowe WL, Struewing JP, et al: 'HapScope: A software system for automated and visual analysis of functionally annotated haplotypes'. Nucleic Acids Res. 2002, 30: 5213-5221. 10.1093/nar/gkf654.PubMed CentralView ArticlePubMedGoogle Scholar