Software for quantitative trait analysis

This paper provides a brief overview of software currently available for the genetic analysis of quantitative traits in humans. Programs that implement variance components, Markov Chain Monte Carlo (MCMC), Haseman-Elston (H-E) and penetrance model-based linkage analyses are discussed, as are programs for measured genotype association analyses and quantitative trait transmission disequilibrium tests. The software compared includes LINKAGE, FASTLINK, PAP, SOLAR, SEGPATH, ACT, Mx, MERLIN, GENEHUNTER, Loki, Mendel, SAGE, QTDT and FBAT. Where possible, the paper provides URLs for acquiring these programs through the internet, details of the platforms for which the software is available and the types of analyses performed.


Introduction
Localisation and characterisation of quantitative trait loci (QTLs) and causal polymorphisms influencing complex phenotypes are of major importance in statistical genetic analyses. Important steps in this process are linkage analyses of QTLs and association studies correlating phenotypes and genotypes. These investigations have been greatly facilitated by the development of a variety of computer software allowing for the fast and efficient analysis of quantitative traits. This paper surveys software programs currently available for quantitative trait analysis. Given the rapid development of new programs, as well as the inevitable obsolescence of others, the focus here is necessarily limited to the software most widely used in the analysis of quantitative traits in humans. Also beyond the scope of this review are programs commonly used in studies of non-human model organisms and in species of agricultural importance, but designed primarily for the analysis of inbred, F2, half-sib and other specialised pedigrees.
The text of this paper has been organised by method, and is broadly divided into linkage methods and association methods. The linkage methods are further divided into parametric modelbased, variance components and Hasemen -Elston (H -E) methods. The association methods are categorised into measured genotype and transmission disequilibrium-based methods. Table 1 lists all of the software discussed in this paper and provides details regarding the methods implemented, the platforms for which software is available and a link to an online site for the program. The text provides a way to survey the available offerings by methodological approach, and Table 1 a way to search by program name. All of the programs discussed are freely available over the internet, with the exception of SAGE, for which there is a charge.

Linkage
The most commonly used methods for quantitative trait linkage analysis in humans are the variance components and H -E approaches. It is also possible to use parametric, modelbased methods for quantitative trait linkage analysis. These require the specification of allele frequencies at the trait locus and genotype-specific trait means. The LINKAGE, FASTLINK, Mendel, PAP and SAGE packages can be used for model-based linkage and joint segregation-linkage analysis of quantitative traits. As with analyses of discrete disease traits using these programs, large and complex family structures are easily accommodated, but computing time for multipoint linkage analyses is exponential with the number of genotyped markers analysed, due to the use of the Elston -Stewart algorithm. 39 A new Java-based jPAP is now available, although some of the functions of the original PAP have yet to be implemented in it.
At their simplest, variance components approaches model the covariance among family members as a function of unspecified aggregate additive genetic effects, effects due to a hypothetical gene in the region being tested for linkage, and a residual component that is uncorrelated among individuals and is sometimes described as an environmental component. 1 has an epistasis option in its oligogenic multipoint linkage routine. SEGPATH makes it very easy to include a spouse correlation. SEGPATH, SOLAR, Mendel and ACT allow multivariate linkage analysis, in which the correlations between genetic and environmental components for multiple traits can be estimated. Mx has specialised routines for the analysis of twin data -its original function -although it has now been expanded to accommodate nuclear families. The MCMC-based approach implemented in Loki also estimates the variance due to a QTL, but adds the number of QTLs influencing the trait and their allele frequencies. 23 This model is easily expanded to incorporate dominance effects, epistatic and gene -environment interactions and other, more complex models. Whereas most QTL linkage routines provide an LOD (logarithm of odds) score as a measure of the evidence in favour of a trait-influencing locus in the region being tested, Loki reports a posterior probability of there being a QTL in the region.
One of the greatest differences between specific implementations of the variance component linkage method is the source of the multipoint identity by descent (IBD) matrices that are used to estimate QTL-specific variance in a linkage analysis. GENEHUNTER and MERLIN use a Lander-Green algorithm, 41 for which computing time becames exponential with the numbers of non-founders in a pedigree. Generally, families larger than 20 or 25 individuals cannot be analysed in these programs without breaking up the pedigrees into smaller units. By using a sparse binary tree, as opposed to a full binary tree, MERLIN is able to accommodate larger families than GENEHUNTER. Mendel uses either the Elston -Stewart or the Lander-Green algorithm, depending on pedigree size. SOLAR uses an extension of the Fulker -Cardon interval approach to estimating multipoint IBDs. 38,42 This allows both pedigrees of unlimited complexity and an unlimited number of genotyped markers. Whereas this approximation performs well for markers that are individually quite informative (such as short tandem repeats), however, it is not suitable for marker sets in which the markers are individually less informative (such as single nucleotide polymorphisms). Markov Chain Mounte Carlo (MCMC) methods are used to estimate IBDs in Loki. These methods are also approximate, but are more precise than the Fulker -Cardon interval approach. Computation time for MCMC IBD estimation is linear in both the number of markers and the size of pedigree, making it suitable for use with large complex pedigrees and unlimited numbers of markers. It is more computationally intensive than the interval approach, however, and may require weeks of computing time in the case of pedigrees of 100 individuals or more. A number of programs (SOLAR, ACT, SEGPATH and SAGE) are set up to use IBD matrices that are generated once per study and then stored, making it possible to import IBD matrices from a variety of sources if they are converted to the proper program-specific format.
Because a model with no QTL effects, with all genetic effects in an unspecified aggregate genetic component, forms the basis of comparison for the likelihood ratio test of linkage, most variance components programs also provide an overall estimate of the trait heritability.
The H -E linkage method, at its most basic, models the squared difference in siblings' trait values as a function of their IBD allele sharing at a particular chromosomal location. 43 There have been a wide variety of extensions to the general H -E method. The 'revisited' H-E uses the mean corrected cross-product of the siblings' trait values. 44 This was found to be less powerful than the original H -E in some cases, 45 which led to the development of a variety of 'weighted' H -E tests using functions of the original and revisited H -E. 46,47 Most recently, the H -E model has been extended to model the full variance -covariance matrix within a family. 48 This latest version of the H -E is very similar to a variance component approach, the primary difference being that the various components are generally estimated by regression rather than maximum likelihood. Regression approaches should be computationally more efficient, whereas maximum likelihood approaches are, in theory, more powerful, although this difference is likely to be negligible in practice. Regressionbased approaches may also be more robust to non-normality of the trait distribution. As with a variance components approach, the latest version of the H -E, in which the full variance-covariance matrix is modelled, is easily extended to include epistatic interactions, gene-environment interactions and so on. The original H -E linkage approach is implemented in SAGE and GENEHUNTER. The latest expansion of the H -E is also available in SAGE. The MERLIN REGRESS routine implements an H -E extension developed by Sham et al. 49 that uses squared trait sums and differences between relative pairs.

Association
The commonly used association methods for quantitative traits generally fall into two main categories: measured genotype approaches and transmission disequilibrium approaches. The measured genotype approach 40,50 is a fixed-effects model in which genotype-specific trait means are estimated. An additive model, in which the heterozygote trait mean is constrained to be halfway between the means of the two homozygotes, provides a single degree of freedom test. This can be Software for quantitative trait analysis Review SOFTWARE REVIEW implemented through a covariate that takes the values of 2 1 and þ 1 for opposing homozygotes and 0 for heterozygotes. Thus, any quantitative trait program that permits the use of covariates can be used to test a measured genotype model. These include PAP, ACT, SEGPATH, SOLAR, SAGE, Mendel and MERLIN. Non-additive models are also easily investigated through the use of different codings of the covariate. Similar analyses could be carried out with any regression program, of course, but the packages listed above have the advantage of dealing with non-independence among family members. Any standard regression routine in a statistics package would be appropriate for measured genotype analyses in unrelated individuals but would provide incorrect p values in the case of family data. In general, linkage programs that permit the use of covariates can be used to perform linkage analyses conditional on measured genotype as a way of testing the contribution of an associated marker to a linkage signal. 51 Note that measured genotype analyses are susceptible to population stratification. That is, if there are subgroups in the data that have different trait means, any marker that differs in allele frequency between those subgroups may show association, regardless of whether or not it is in linkage disequilibrium with a QTL. Such population substructure can be detected through analyses of unlinked markers by programs such as Structure. 52 Transmission disequilibrium-based tests (TDT) were originally developed to provide a test for discrete trait association in the presence of linkage that was robust to population stratification. 53,54 The TDT has been expanded to accommodate quantitative traits in a variety of ways. 55 -57 Essentially, the quantitative trait TDT methods test whether the trait mean in offspring differs according to whether a particular allele was or was not transmitted by a parent heterozygous for that allele. The various methods differ by whether they require assumptions regarding the trait distribution and by the size of the families they can accommodate, from parent -child trios to arbitrary pedigrees. TDT analyses in FBAT require no assumptions regarding the trait distribution and are performed in nuclear families. Larger pedigrees can be used, but they will be decomposed into nuclear families for analysis and an empirical estimate of the variance of the test statistic will be used to account for the non-independence between nuclear families. The program QTDT 33 implements a variety of quantitative trait TDT tests, including those described by Allison 55 and by Rabinowitz. 56 SOLAR has an extended pedigree-compatible TDT. The gamete competition model 58 is a generalisation of the quantitative trait TDT that is applicable to general pedigrees, implemented in Mendel.

Conclusions
No review of this type can be completely comprehensive. The programs outlined above all have additional capabilities and unique subroutines that could not be detailed here. Many of these programs are still under active development, with new features being added all the time. Hopefully, this paper will have provided enough information about each program that the reader will be able to discern which ones are appropriate for their needs and merit further investigation through the internet links and references provided in Table 1. Of course, Table 1 itself is not a complete catalogue of the available software, although the most widely used packages have been included. There are many more quantitative trait analysis programs than can be feasibly discussed within a single brief paper, and there are new programs appearing constantly. There are several websites that maintain general lists of genetic analysis software; perhaps the most comprehensive of these is the genetic analysis software list started by Wentian Li at Columbia University which is now maintained at: http://www.nslij-genetics.org/soft/ with a mirror site at: http://linkage.rockefeller.edu/soft.