A survey of current software for linkage analysis
© Henry Stewart Publications 2003
Received: 18 August 2003
Accepted: 18 August 2003
Published: 1 November 2003
There is now a wide choice of software available for linkage analysis. The most well known packages are briefly reviewed here. The package with the most extensive range of analyses is GENEHUNTER, but for many of its functions there are other programs with better performance. These include FASTLINK and VITESSE for parametric analysis ALLEGRO and MERLIN for non-parametric analysis and SOLAR for variance components analysis. The computational limits of current approaches can be improved with SIMWALK2 and the promising new SUPERLINK program. Directions for future work include improved user interfaces and consensus formats for data input and exchange.
Keywordssoftware linkage analysis programming
There is now a wide choice of methods and software available for mapping genes by linkage. Although the method of analysis is often determined by the experimental design, there is less guidance regarding the most appropriate software. Here, the most well-known packages for linkage analysis will be briefly reviewed and some directions and standards for future work will be suggested.
At one extreme, linkage analysis is applied to a small number of large pedigrees in which the trait exhibits a strongly Mendelian mode of inheritance. Methods for this type of data are usually termed 'parametric' because an explicit penetrance model defining the relationship between genotype and disease must be specified. The most flexible package for these analytical methods remains FASTLINK [1, 2], which is functionally equivalent to the original LINKAGE package . For most pedigree structures, whether one applies single- or multi-point analysis of a disease or quantitative trait, VITESSE is a faster package [4, 5]; however, FASTLINK continues to be more efficient for pedigrees containing inbreeding loops.
At the other extreme, linkage analysis is also applied to a large number of small pedigrees with unknown mode of inheritance. 'Non-parametric' allele-sharing methods are usually preferred here, for which the most well-known program is GENEHUNTER [6, 7]. GENEHUNTER contains an extensive set of linkage and association tests and, as such, is a de facto standard for statistical genetics analysis . A disadvantage of this position is that any new program will aspire to improve on GENEHUNTER, so that for many of its functions there are now other programs with better performance. An important example is ALLEGRO , which is faster for most pedigree structures, includes a wider range of scoring functions and computes more accurate significance levels for non-parametric statistics. The latter feature is also available in GENEHUNTER-PLUS , but this is only available for version 1.3 of GENEHUNTER and so does not access the speed-ups available in later versions.
Another recent competitor is MERLIN , which employs a still faster algorithm that is particularly useful in dense marker maps, for which the number of recombinations allowed between markers can be constrained. The range of analyses is similar to GENEHUNTER, MERLIN also provides the linear-model lod score available in ALLEGRO but not the exponential model. MERLIN does not calculate parametric lod scores -- which are available in GENEHUNTER and ALLEGRO -- but for non-parametric analysis, error checking and haplotyping, it will often be the fastest program. All three of these programs handle X-linked data, although this also is only available in version 1.3 of GENEHUNTER.
An alternative approach for an unknown mode of inheritance is to perform parametric analysis over a range of models and then adjust the best lod score for this optimisation. This approach is implemented in MFLINK . In small pedigrees, there seems to be little to choose between this approach and the allele-sharing methods discussed above ; however, currently MFLINK can only perform two-point analysis.
A promising new model is implemented in SUPERLINK . Fishelson and Geiger show that the algorithms used by FASTLINK and GENEHUNTER are instances of a more general model, under which a more efficient order of computation is determined at run-time according to the input pedigree. For parametric linkage analysis, some impressive speed-ups over VITESSE have been reported. Future versions will include allele sharing and other statistics (M. Fishelson, personal communication).
Quantitative traits are commonly analysed by regression or by variance-components methods. Haseman-Elston regression is a sib-pair method available in GENEHUNTER with heuristic adjustments for general pedigrees. Recently, the regression framework has been extended to more general pedigrees , and this is implemented in MERLIN. This approach now has comparable power to variance-components methods, with less dependence on trait normality and some computational advantages. MERLIN and GENEHUNTER also provide rank-based tests (confusingly also termed 'non-parametric'), which are appropriate for non-normally distributed traits. Again, note that for GENEHUNTER the test is a sib-pair method, with heuristic adjustments for general pedigrees, whereas for MERLIN the test is immediately applicable to general pedigrees.
Variance-components methods are more powerful than regression, provide parameter estimates and easily accommodate a wide range of null hypotheses; the cost is stronger dependence on trait normality and higher computational burden. Implementations are available in MERLIN, provided that no dominance variance is assumed, and in GENEHUNTER. Another very flexible package for variance components model fitting is SOLAR . MERLIN is currently the only program that can perform multipoint variance components analysis on the X chromosome. ALLEGRO also contains undocumented implementations of various quantitative trait methods.
Exact multipoint analysis is limited either by the number of markers that can be included (FASTLINK, VITESSE) or the pedigree size (GENEHUNTER, ALLEGRO, MERLIN). With current microsatellite markers, large pedigrees usually contain enough information from a small number of markers for current software to be adequate. This will change with the move to automated single nucleotide polymorphism typing for linkage studies , so it is becoming more important to have software that can handle large numbers of markers in large pedigrees. Currently, this is only generally possible through the approximation methods of SIMWALK2, which nevertheless has good reported accuracy . Although the program has a lot of tuning parameters, the MEGA2 utility program provides a reasonably easy route to a default analysis which is suitable in most cases . More efficient approximation methods are an area of current research, for example MORGAN , which currently only allows fully penetrant recessive traits but shows promise for more general models.
Modern computing favours graphical user interfaces (GUIs), which allow mouse-driven input; but these are conspicuously absent from linkage software. Descendents of LINKAGE have essentially no user interface, although the terminal-based tool LCP is available to set up analysis scripts; GENEHUNTER and SOLAR run their own interactive command shells, whereas ALLEGRO and MERLIN use a single command with optional arguments and auxiliary input files. On the plus side, all of these interfaces are amenable to scripting -- for example to allow one to repeat the same analysis on multiple input files -- but the single-command interface of ALLEGRO and MERLIN is easily the most convenient to use in scripts. With the availability of Java, HTML and TCL as cross-platform languages for GUI development, it is hoped that future versions of these packages will incorporate simpler user interfaces, as well as scriptable back ends.
The LINKAGE input file format is recognised by many programs but is by no means universal. MEGA2 is a useful utility for converting between formats, but even this requires an additional map file which duplicates information contained in the locus file. It is hoped that the LINKAGE format, however imperfect, will eventually be recognised by all programs that perform linkage analysis, without the need for supplementary conversion scripts.
GENEHUNTER, ALLEGRO, MERLIN and SOLAR can all output multipoint identical-by-descent (IBD) distributions, which are valuable for gaining insights into the segregation patterns in pedigrees. None can input this information, however: it is not possible, say, to calculate the IBD distribution under the recombination restrictions of MERLIN and then use this to obtain an exponential-model lod score from ALLEGRO. Furthermore, sometimes different analyses result in the same distribution, and it is inefficient to recompute it each time. With some caveats, it is possible to avoid this recomputation in SOLAR, but simple input of IBD, haplotype and recombination information would still generally be a useful feature for future versions.
This survey has necessarily been cursory, and there is a wealth of other good linkage software available. Two internet sites provide useful lists of available software. A comprehensive list of statistical genetics software can be found at http://www.nslij-genetics.org/soft/, with links to their sources. This list continues to be mirrored at its previous site, http://linkage.rockefeller.org/soft/. It is perhaps over-inclusive, containing a number of obsolete programs, and it makes no recommendations. By contrast, the collection at http://www.hgmp.mrc.ac.uk/Registered/Menu/linkage.html contains only the most popular programs, but provides executable files, browsable documentation and a web-based graphical interface for the most common applications.
- Cottingham RW, Idury RM, Schäffer AA: 'Faster sequential genetic linkage computations'. Am J Hum Genet. 1993, 53: 252-263.PubMed CentralPubMedGoogle Scholar
- Schäffer AA, Gupta SK, Shiram K, Cottingham RW: 'Avoiding recomputation in linkage analysis'. Hum Hered. 1994, 44: 225-237. 10.1159/000154222.View ArticlePubMedGoogle Scholar
- Lathrop GM, Lalouel JM: 'Easy calculations of lod scores and genetic risks on small computers'. Am J Hum Genet. 1984, 36: 460-465.PubMed CentralPubMedGoogle Scholar
- O'Connell JR, Weeks DE: 'The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype and set-recoding and fuzzy inheritance'. Nat Genet. 1995, 11: 402-408. 10.1038/ng1295-402.View ArticlePubMedGoogle Scholar
- O'Connell JR: 'Rapid multipoint linkage analysis via inheritance vectors in the Elston-Stewart algorithm'. Hum Hered. 2001, 51: 226-240. 10.1159/000053346.View ArticlePubMedGoogle Scholar
- Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: 'Parametric and non-parametric linkage analysis: A unified multipoint approach'. Am J Hum Genet. 1996, 58: 1347-1363.PubMed CentralPubMedGoogle Scholar
- Markianos K, Daly MJ, Kruglyak L: 'Efficient multipoint linkage analysis through reduction of inheritance space'. Am J Hum Genet. 2001, 68: 963-977. 10.1086/319507.PubMed CentralView ArticlePubMedGoogle Scholar
- Nyholt DR: 'GENEHUNTER: Your 'one-stop shop' for statistical genetic analysis?'. Hum Hered. 2002, 53: 2-7. 10.1159/000048598.View ArticlePubMedGoogle Scholar
- Gudbjartsson DF, Jonasson K, Frigge M, Kong A: 'Allegro, a new computer program for multipoint linkage analysis'. Nat Genet. 2000, 25: 12-13. 10.1038/75514.View ArticlePubMedGoogle Scholar
- Kong A, Cox NJ: 'Allele-sharing models: LOD scores and accurate linkage tests'. Am J Hum Genet. 1997, 61: 1179-1188. 10.1086/301592.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: 'Merlin - rapid analysis of dense genetic maps using sparse gene flow trees'. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Curtis D, Sham PC: 'Model-free linkage analysis using likelihoods'. Am J Hum Genet. 1995, 57: 703-716.PubMed CentralPubMedGoogle Scholar
- Sham PC, Lin MW, Zhao JH, Curtis D: 'Power comparison of parametric and non-parametric linkage tests in small pedigrees'. Am J Hum Genet. 2000, 66: 1661-1668. 10.1086/302888.PubMed CentralView ArticlePubMedGoogle Scholar
- Fishelson M, Geiger D: 'Exact genetic linkage computations for general pedigrees'. Bioinformatics. 2002, 18 (Suppl 1): S189-S198. 10.1093/bioinformatics/18.suppl_1.S189.View ArticlePubMedGoogle Scholar
- Sham PC, Purcell S, Cherny SS, Abecasis GR: 'Powerful regression-based quantitative-trait linkage analysis of general pedigrees'. Am J Hum Genet. 2002, 71: 238-253. 10.1086/341560.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy L, Blangero J: 'Multipoint quantitative trait linkage analysis in general pedigrees'. Am J Hum Genet. 1998, 62: 1198-1211. 10.1086/301844.PubMed CentralView ArticlePubMedGoogle Scholar
- Matise TC, Sachidanandam R, Clark AG, et al: 'A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set'. Am J Hum Genet. 2003, 73: 271-284. 10.1086/377137.PubMed CentralView ArticlePubMedGoogle Scholar
- Sobel E, Lange K: 'Descent graphs in pedigree analysis: Applications to haplotyping, location scores, and marker-sharing statistics'. Am J Hum Genet. 1996, 58: 1323-1337.PubMed CentralPubMedGoogle Scholar
- Mukhopadyay N, Almasy L, Schroeder M, Mulvihill WP, Weeks DE: 'Mega2, a data-handling program for facilitating genetic linkage and association analyses'. Am J Hum Genet. 1999, 65 (Suppl): A436-Google Scholar
- George AW, Wijsman EM, Thompson EA: 'Detecting disease genes via a new Markov chain Monte Carlo approach for multipoint linkage analysis'. Genet Epidemiol. 2002, 23: 283-Google Scholar