A survey of current software for haplotype phase inference

In the past two years, tracking the explosion in data due to ever-improving single nucleotide polymorphism (SNP) maps and cheaper high-throughput genotyping technologies, a bewildering array of new algorithms and relevant software have appeared for haplotype phase inference. The alternatives to haplotype inference are to resolve haplotypes completely, either by in vitro methods or by typing close pedigrees, which is expensive and is not guaranteed in pedigrees, or to ignore haplotype-level analysis in favour of genotype-level analysis, which avoids the danger of treating inferred haplotypes as real but denies the researcher, potentially, any valuable analytic insights. This review attempts a snapshot of this rapidly moving field as it stands at present, and is mainly restricted, given the current predominance of SNP genotyping, to the consideration of diallelic data. For completeness, the review will occasionally refer to algorithms for which no software exists.


Introduction
Haplotype phase algorithms can be conveniently split into three main types: parsimony, maximum likelihood and Bayesian. The researcher may either want to infer haplotype frequencies in the population, impute the haplotypes possessed by given individuals, or both. In general, parsimony methods most naturally estimate individual haplotypes, maximum likelihood methods most naturally estimate population frequencies and Bayesian methods can do both.
Parsimony algorithms avoid explicit likelihood calculations by minimising a 'costly' constraint. The grandfather of all haplotype phase algorithms (an elderly 13 year old) is Clark's method, 1 a simple iterative procedure inspired by the constraint 'minimise the number of new haplotypes you have to invent'. (To obtain 'HAPINFERX' software, apply to ac347@cornell.edu.) The method can either suffer from having too many solutions or from having none (although the general problem of convergence is a common issue with all haplotype inference algorithms). There is also no guarantee that the global minimum for the 'minimise haplotype number' constraint is reached by Clark's algorithm. This latter problem is fixed in a more recent algorithm 2 ('HAPAR'; apply to lwang@cs.cityu.edu.hk). Phylogenetic parsimony methods have been explored by Daniel Gusfield and colleagues ('GPPH', 'DPPH' and 'BPPH'; http://wwwcsif.cs.ucdavis. edu/, gusfield/). The constraint here is 'minimise the number of ancestral recombination events required to link the new invented haplotypes'. As one might expect, this constraint works well in small, tightly-linked genomic regions and less well in bigger regions. 3 Because parsimony algorithms avoid explicit likelihood calculations, they do not provide any natural way to measure uncertainty in the estimates. Maximum likelihood and Bayesian methods provide a way around this problem.
Maximum likelihood estimation is predominantly undertaken via Expectation -Maximisation (EM) algorithms. These use an explicit but very simple likelihood model for the data (the so-called 'gene counting' model). Observed (or partially observed) haplotype counts follow a multinomial distribution conditional on the haplotype population frequencies. Random assortment of haplotypes to individuals is assumed (a standard assumption for all algorithms, whether maximum likelihood or Bayesian, working with likelihood functions). The EM algorithm avoids making assumptions about the mutational and recombinatorial relationships of the final set of inferred haplotypes, which some see as an advantage and others as a disadvantage. The original EM algorithm citation here is usually Excoffier and Slatkin, 4 but see also Hawley and Kidd 5 ('HAPLO'; http://krunch.med.yale.edu/haplo/). Some wellused implementations of the algorithm are: 'EM-decoder' 6 (http://www.people.fas.harvard.edu/, junliu/em/em.htm), 'EHþ ' 7 (http://www.iop.kcl.ac.uk/IoP/Departments/ PsychMed/GEpiBSt/software.shtml), 'GENECOUNTING' 8 (same website as 'EHþ ') and 'snphap' (D. Clayton; http:// www-gene.cimr.cam.ac.uk/clayton/software/). Of these, 'GeneCounting' and 'snphap' have the added refinement of allowing for missing data. 'snphap' has the additional refinement that, once haplotype frequencies are estimated, the program swaps from likelihood-based to posterior probabilitybased imputation and calculates haplotype-pair probabilitiesconditional on the estimated haplotype frequencies -for all pairs consistent with an individual's genotype. 'snphap' works only on diallelic data. The extension 'hap' ( J. H. Zhao, same website as 'EHþ ' and 'GeneCounting') runs the same algorithm but accepts multiallelic data.
Bayesian algorithms have the potential to address the issue, missing from EM algorithms, of how to guide the haplotype inference process so as to favour solutions which make sense in terms of an underlying genealogy connecting the haplotypes, via manipulation of the prior. The first proposed Bayesian algorithm, and still one of the best, is that implemented in the 'PHASE' program 9 (http://www.stat.washington.edu/ stephens/phase.html). The proposed prior is derived, approximately, from coalescent theory, and ensures that new 'invented' haplotypes look mutationally similar to the others at any one stage of the iterative (Gibbs sampler) stochastic convergence process. The main disadvantage of the original version of 'PHASE' was its plodding speed of convergence for datasets of any reasonable size.

Extensions
The key developments have been towards improving speed as datasets increase in size, and coping with ever larger genomic regions, where it becomes impossible to infer unbroken haplotypes over the entire region because their estimated frequencies become vanishingly small. For parsimony algorithms, Gusfield shows how to implement a speeded up version of Clarke's algorithm. 3 Eskin and colleagues also illustrate the considerable speed advantages of this simple algorithm in cases where a simple 'block'-like structure of the genome is observed 10 ('HAP'; http://www1.cs.columbia.edu/ compbio/hap/).
For Bayesian algorithms, one key idea that has since been implemented in several new extensions is the Partition-Ligation strategy proposed by Niu and colleagues. 6 Here, the genome region is split into a number of smaller regions (either arbitrarily or by some process that attempts to maximise linkage disequilibrium within each region). The haplotype inference method is then applied separately to two adjacent sub-regions and allowed to converge separately. Larger haplotypes are then formed by allowing haplotypes to merge at random across the boundary, using current estimates of their respective frequencies. The haplotype inference method is then applied to the new larger region (and allowed to converge), and separately to another adjacent sub-region. The process repeats until all sub-regions are merged. Niu and colleagues also speeded up convergence steps by 'prior annealing', in which jumps in posterior probability space are allowed to be larger at first, then progressively smaller.
Stephens and Donnelly have implemented these ideas in a new faster 'PHASE' program, 11 and Niu and colleagues have implemented them in a new Bayesian algorithm 'HAPLO-TYPER' 6 (http://www.people.fas.harvard.edu/, junliu/ Haplo/docMain.htm). This latter algorithm abandons the idea of a prior favouring mutational similarity among inferred haplotypes, and instead applies a Dirichlet prior. This prior functions in a similar way to the multinomial in EM algorithms, in that it avoids making assumptions about mutational and recombinatorial relationships among inferred haplotypes. Lin and colleagues have also proposed a separate Bayesian algorithm with a Dirichlet prior. 12 The issue of what constitutes a good prior for a Bayesian model remains unresolved. 13 While the Dirichlet is computationally convenient, there is valuable extra information in the mutational and recombinatorial relationships that should lead to more accurate inferences of haplotypes, provided that the models dealing with both of these phenomena are reasonable. Eronen and colleagues propose a new prior allowing for recombination, designed explicitly for long-range genotype data 14 ('MC-VL'; http:// www.cs.helsinki.fi/group/genetics/haplotyping.html). Another promising new algorithm that explicitly incorporates recombination into its strategy and is also designed specifically for long-range genotype data is the 'ELB' algorithm proposed by Excoffier and colleagues. 15 The latest version of 'PHASE' also optionally incorporates a recombination model. 16 For EM algorithms, Qin and colleagues have implemented the above partition-ligation ideas into an EM context 17 ('PL-EM', http://www.people.fas.harvard.edu/, junliu/ plem/). A very similar algorithm has been proposed by Li and colleagues 18 ('HPlus'; http://qge.fhcrc.org/hplus/). Zhang and colleagues propose an improvement to the speed of the E-step in the EM algorithm 19 ('OSLEM'; http://genome3. cpmc.columbia.edu/cgi-bin/GENOME/oslem/doHaplo.cgi), and Thomas proposes other approximations to increase EM algorithm speed 20 ('GCHap'; http://episun7.med.utah.edu/ , alun/gchap/). David Clayton's 'snphap' tackles the large data set problem by starting with two-locus haplotypes, extending the haplotype one locus at a time, and culling low-frequency haplotypes at an early stage. The effect of these short cuts on the optimality of the final solution is unclear.
A number of researchers have proposed EM algorithms that take advantage of the increased (but not complete) certainty in haplotype phase afforded by simple pedigree data, especially trios. These include Rohde and Fuerst 21 (apply to rohde@mdc-berlin.de for software), Li and Jiang 22 ('PedPhase'; http://www.cs.ucr.edu/, jili/haplotyping.html), Dudbridge, 23 and Weale and colleagues 24 ('EMtrio', part of the 'TagIT' package; http://popgen.biol.ucl.ac.uk/software. html). 'EMtrio' is designed to cope with partially missing genotype data, in which one homologous chromosome may be phase-resolved and the other not. The Bayesian 'PHASE' program also allows input of phase-resolved data, but does not handle the above partially-missing situations. A front-end to 'PHASE', called 'PHamily', automatically resolves trio data (H. Ackerman and M. Stephens; http://archimedes.well.ox.ac. uk/pise/). Opinion is divided on whether it is worth the extra genotyping effort to type close relatives to help resolve phase. 25 -28 Interest has also focused recently on the use of EM algorithms to infer haplotypes from pooled DNA data. 29 -31 Regardless of which method of haplotype inference is used, it is generally recognised that any subsequent analyses using such haplotypes (eg association tests against phenotype) should ideally take account of the uncertainty associated with these inferred haplotypes. There has also been a considerable amount of recent literature on this subject, which is not reviewed here. One promising program that allows for this is 'BLADE' 32,33 (Version 2: http://www.fas.harvard.edu/ , junliu/TechRept/03folder/bladev2.tgz).
Despite the assertions of some, it is currently not clear which one of these alternative methods and their extensions will provide the most reliable estimates. All the rival algorithms tend to do well when datasets and genomic regions are small; all do badly when they are large. One prudent measure is to check the results of different methods against each other for consistency. The program 'HIT' brings together four wellused algorithms for this purpose (including two EM algorithms, 'PHASE', and 'HAPLOTYPER'; apply to wangx@udel.edu). The 'HapScope' package 34 (ftp://ftp1.nci.nih.gov/pub/Hap Scope) incorporates versions of both 'PHASE' and 'snphap'. When consistency breaks down in the larger datasets, the way forward is still unclear. The key issue will not be to find a better haplotype inference method per se, but rather to find a better strategy for partitioning large genomic regions into manageable sub-regions without losing useful linkage disequilibrium information along the way.