A review of the 'Statistical Analysis for Genetic Epidemiology' (SAGE) software package
© Henry Stewart Publications 2004
Received: 22 July 2004
Accepted: 22 July 2004
Published: 1 November 2004
The Erratum to this article has been published in Human Genomics 2005 2:77
The 'Statistical Analysis for Genetic Epidemiology' (S.A.G.E.) software package is an integrated, comprehensive package of computer programs designed to perform many of the different analyses required in the study of genetic epidemiology. It offers a graphical user interface for most platforms and, unlike many programs available in the public domain, is flexible in both receiving many types of input files and in allowing the user to choose among output files. All of the programs accept the same data files and together provide the means to perform familial correlation, segregation, linkage and association analyses, as well as many of the ancillary analyses that help achieve these goals. Many, but not all, of the same or similar analyses can be performed (with more difficulty) using publicly available freeware. The primary limitations of S.A.G.E. at present are the lack of software for estimating haplotypes or for identifying probable double recombinants in linkage analysis. S.A.G.E. is continually being extended and upgraded, however, with automatic downloading of the latest version always available to users.
Keywordsfamilial correlations segregation linkage and association analyses heritability transmission disequilibrium test allele frequency
The 'Statistical Analysis for Genetic Epidemiology' (SAGE.) software package Version 1.0 was introduced in 1987. Since then it has changed and developed, becoming almost unrecognisable, although its function has remained the same: to give genetic epidemiologists the tools they need for the analysis of family and pedigree data. Version 5.0 supports a full graphical user interface (GUI), with dialogue boxes and pulldown menus, on Windows, Digital Unix, Solaris and Linux platforms. Formatting of the data and the naming of variables (including marker loci and alleles) is virtually unrestricted. Reasonable default values of all options are indicated, but the user maintains wide flexibility in the analyses that can be performed. This is unavailable in any other software with similar functions. SAGE is continually being extended and upgraded, with automatic downloading of the latest version always available to users. There is an annual licence fee, which varies according to the number of analyses that can be simultaneously performed, but substantial academic discounts apply. Many (certainly not all) of its functions are available as freeware, but S.A.G.E. offers the advantage, otherwise unavailable to human geneticists and epidemiologists, of an integrated package of programs with a modern GUI and wide flexibility that all accept the same data files. The following is a list of programs currently available and a brief description of what each one does. This is followed by a partial description of how a 'FUNCTION' utility expands the capabilities of these various programs. Finally, the most commonly used freeware available, having similar (although not exactly the same) functions, will be listed.
PEDINFO provides many useful descriptive statistics on pedigree data, including means, variances and histograms of family, sibship and pedigree sizes and counts of each type of relative pair. It identifies consanguineous matings, marriage loops and marriage rings. This allows the user quickly to describe the data that have undergone any particular analysis.
FCOR calculates multivariate familial correlations with their asymptotic standard errors without assuming multivariate normality of the traits across family members . It calculates familial correlations for all relative pair types available in the sample pedigrees. The covariance between any two correlation estimates is available, making it a simple matter to test whether any two correlations, obtained from family data, are significantly different. There is an option to output a file with a tabular structure for the correlations and their standard errors, making it easy to format results into tables for publication.
SEGREG fits and tests Mendelian segregation models in the presence of residual familial correlations. The trait analysed can be continuous (for which regressive models  or the finite polygenic mixed model  can be used), binary (for which a multivariate logistic model  or the finite polygenic mixed model  can be used) or a binary disease trait with variable age of onset (using the finite polygenic mixed model ). In this last case, it is possible to include in the likelihood information about the prevalence of the disease (even if that information is imprecise). This program can also be used for commingling analysis, to predict the major genotype of any pedigree member and to automatically prepare penetrance files needed for model-based linkage analysis.
MARKERINFO detects Mendelian inconsistencies of markers in pedigree data. By default, it assumes that all markers are codominant and error-free, but there is the flexibility to allow for markers that exhibit dominance or that are not error-free. The output is designed to help the user find the source of any inconsistency, even if it can only be detected by examining more than one nuclear family in the pedigree.
FREQ estimates allele frequencies from marker data on related individuals with known pedigree structure  and, provided the markers are codominant, automatically generates marker locus description files used by GENIBD, MLOD and other S.A.G.E. programs. This is done dynamically, the program searching for all the different alleles that occur in the sample for each marker.
GENIBD generates both single- and multi-point identity-by-descent (IBD) distributions for pairs of related individuals, using a variety of algorithms tuned for different types of pedigrees. Exact methods can be used for small pedigrees with loops, and simulation methods are available for large extended pedigrees without loops. IBD sharing can also be interpolated between markers using either the Haldane or the Kosambi map function. The output also contains maternal and paternal IBD sharing values that can be used to assess parent-of-origin effects.
RELTEST indicates how to reclassify pairs of relatives according to their true relationship using full multi-point genome scan data. The method is based on a Markov process model of IBD allele-sharing along chromosomes . This program currently analyses four different types of putative pairs: full sib pairs, half sib pairs, parent offspring pairs and unrelated marital pairs. A summary file is produced that contains the pairs to be reclassified, together with their mean allele-sharing statistic, parent-offspring statistic and, for each individual, the percentage of marker data that is missing. This last feature enables the user to know whether the suggested reclassification should be made, because it can be unreliable if based on data from less than half the genome.
SIBPAL is designed for the analysis of sib pairs, or larger sibships, to detect linkage. In the case of binary traits, mean and proportion tests are performed for affected pairs, unaffected pairs and discordant pairs, using probabilistic estimates of their allele sharing. In the case of quantitative traits (including binary traits as 0,1 variables), the various forms of Haseman-Elston regression [10–12] are available. Analyses can use either single- or multi-point IBD information, and models allow for multiple genetic loci--including epistatic interactions  and covariate effects. Asymptotic p values can be validated by obtaining p values from the appropriate permutation distribution.
LODPAL performs analyses based on the lod score formulation for affected sib pairs . The current implementation is of the general conditional logistic model, including the one-parameter model that allows for the inclusion of all affected relative pairs, covariates  and epistatic interactions. There is also an option to include discordant and/or unaffected pairs in the analysis.
LODLINK performs model-based lod score calculations for two-point linkage between a main trait and each of a set of markers. The main trait may be a marker or any other trait that exhibits Mendelian transmission. In the latter case, an output file from SEGREG, which includes trait allele frequencies and individual specific penetrance probabilities, can be used as input. LODLINK uses the genotype/phase elimination algorithms,[17, 18] together with other enhancements, to perform linkage calculations. Maximised lods are converted to p values, both as upper bounds and based on asymptotic theory. Tests of sex and locus heterogeneity can be performed, the latter based on predefined groups of families  or using a mixture model .
MLOD performs exact multi-point model-based lod score linkage analysis on small pedigrees of arbitrary structure  and an approximate analysis on large pedigrees without loops using a Markov Chain Monte Carlo technique. Again, an output file from SEGREG can be used as input to describe the underlying trait locus inheritance model.
ASSOC analyses the association between a continuous and/or binary trait and covariates, which can include marker phenotypes that have been transformed into quantitative covariates, from pedigree data in the presence of familial correlations. It performs likelihood ratio tests and obtains maximum likelihood estimates assuming, in the case of continuous traits, multivariate normality after either of two transformations (George-Elston  or Box-Cox ), whose parameters can be simultaneously estimated with all the other model parameters. These parameters include polygenic heritability and further familial correlations. Likelihoods can also be corrected for single ascertainment.
TDTEX implements several asymptotic and exact versions of the transmission disequilibrium test (TDT) for testing linkage between a marker and a disease locus in the presence of allelic association or linkage disequilibrium. The exact tests are useful in cases where few data are available or where there are many alleles at the marker locus. Different types of tests are available, including an exact test and a Monte Carlo randomisation test, as well as several exact and asymptotic marginal homogeneity tests .
AGEON fits an age-of-onset distribution  to sibship data comprising both affected and unaffected sibs, allowing for covariates that can influence the mean, variance or skewness of the onset distribution. It then calculates two new traits that can be used to achieve more power in Haseman-Elston regression linkage analysis: disease susceptibility allowing for age  and a measure of age of onset .
FUNCTION is an all-purpose utility that calculates new variables for analysis, eg trimmed, winsorised, mean and/or variance-adjusted variables (the adjustment being done separately for user-defined subclasses); quantitative variables defined on the basis of marker genotypes (dominant, additive and recessive allele indicators); and a transmitted allele indicator that allows ASSOC to perform pedigree TDT analysis .
Some freeware is often used to perform many of the functions performed by SAGE PAP  can be used for segregation analysis, but is based on the usual mixed major gene/polygenic model, rather than on regressive, or finite polygenic mixed, models. It can also simulate phenotypes and estimate expected lod scores; PEDCHECK  and PREST  can be used to find Mendelian inconsistencies in a way similar to MARKERINFO, but without the detailed marker by marker and family by family output. RELCHECK  and RELPAIR  can be used to infer relationships within families, comparable to RELTEST, but consider more relationships and also consider pairs of persons across different families. Several linkage programs, including LINKAGE, MERLIN, GENEHUNTER, GENEHUNTER-PLUS, SOLAR, ALLEGRO, FASTLINK, VITESSE, SIMWALK2 and SUPERLINK, collectively perform analyses comparable to the SAGE linkage programs GENIBD, SIBPAL, LODPAL, LODLINK and MLOD, and have been reviewed recently . The SAGE programs do not currently perform haplotype analysis or identify probable double recombinants, as do the SIMWALK2 and GENEHUNTER programs. FBAT -- and PBAT  and PDT --perform association TDT-type analyses, respectively, on nuclear and extended pedigrees, comparable to the TDT kind of analysis ASSOC can perform when FUNCTION is used to generate transmitted allele indicators. FISHER  can calculate polygenic heritability under a slightly more stringent distributional assumption than that used by ASSOC. EDT  has some of the functions of TDTEX, but is based on logistic regression analysis using asymptotic results. Finally, GAP, GAS  and ACT  are general program packages like S.A.G.E., but each is much more limited in scope. All of these programs, and many others, are listed on the Rockefeller University website .
- Keen KJ, Elston RC: Robust asymptotic sampling theory for correlations in pedigrees. Stat Med. 2003, 22: 3229-3247. 10.1002/sim.1559.View ArticlePubMedGoogle Scholar
- Bonney GE: On the statistical determination of major gene mechanisms in continuous human traits: Regressive models. Am J Med Genet. 1984, 18: 731-749.View ArticlePubMedGoogle Scholar
- Fernando RL, Stricker C, Elston RC: The finite polygenic mixed model: An alternative formulation for the mixed model of inheritance. Theor Appl Genet. 1984, 88: 573-580.View ArticleGoogle Scholar
- Karunaratne P, Elston RC: A multivariate logistic model (MLM) for analyzing binary family data. Am J Med Genet. 1984, 76: 428-437.View ArticleGoogle Scholar
- McLean CJ, Morton NE, Elston RC, et al: Skewness in commingled distributions. Biometrics. 1976, 32: 695-699. 10.2307/2529760.View ArticleGoogle Scholar
- Boehnke M: Allele frequency estimation from data on relatives. Am J Hum Genet. 1991, 48: 22-25.PubMed CentralPubMedGoogle Scholar
- Idury RM, Elston RC: A faster and more general hidden Markov model algorithm for multipoint likelihood calculations. Hum Hered. 1996, 47: 197-202.View ArticleGoogle Scholar
- Olson JM: Relationship estimation by Markov-process models in a sib-pair linkage study. Am J Hum Genet. 1999, 64: 1464-1472. 10.1086/302360.PubMed CentralView ArticlePubMedGoogle Scholar
- Haseman JK, Elston RC: The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972, 2: 3-19. 10.1007/BF01066731.View ArticlePubMedGoogle Scholar
- Elston RC, Buxbaum S, Jacobs KB, Olson JM: Haseman and Elston revisited. Genet Epidemiol. 2000, 19: 1-17. 10.1002/1098-2272(200007)19:1<1::AID-GEPI1>3.0.CO;2-E.View ArticlePubMedGoogle Scholar
- Shete S, Jacobs KB, Elston RC: Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum Hered. 2003, 55: 79-85. 10.1159/000072312.View ArticlePubMedGoogle Scholar
- Tiwari HK, Elston RC: Linkage of multilocus components of variance to polymorphic markers. Ann Hum Genet. 1997, 61: 253-261.View ArticlePubMedGoogle Scholar
- Risch N: Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet. 1990, 46: 222-228.PubMed CentralPubMedGoogle Scholar
- Olson JM: A general conditional-logistic model for affected-relative-pair linkage studies. Am J Hum Genet. 1999, 65: 1760-1769. 10.1086/302662.PubMed CentralView ArticlePubMedGoogle Scholar
- Goddard KA, Witte JS, Suarez BK, et al: Model-free linkage analysis with covariates confirms linkage of prostate cancer to chromosomes 1 and 4. Am J Hum Genet. 2001, 68: 1197-1206. 10.1086/320103.PubMed CentralView ArticlePubMedGoogle Scholar
- Lange K, Boehnke M: Extensions to pedigree analysis: V. Optimal calculation of Mendelian likelihoods. Hum Hered. 1983, 33: 291-301. 10.1159/000153393.View ArticlePubMedGoogle Scholar
- Goradia TM, Lange K, Miller PL, et al: Fast computation of genetic likelihoods on human pedigree data. Hum Hered. 1992, 42: 42-62. 10.1159/000154045.View ArticlePubMedGoogle Scholar
- Morton NE: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood types. Am J Hum Genet. 1956, 8: 80-96.PubMed CentralPubMedGoogle Scholar
- Smith CAB: Testing for heterogeneity of recombination fraction values in human genetics. Ann Hum Genet. 1963, 27: 175-182. 10.1111/j.1469-1809.1963.tb00210.x.View ArticlePubMedGoogle Scholar
- George VT, Elston RC: Generalized modulus power transformations. Commun Statist. 1988, 17: 2933-2952. 10.1080/03610928808829781.View ArticleGoogle Scholar
- Box GEP, Cox DR: An analysis of transformations. J R Stat Soc. 1964, 26: 211-252.Google Scholar
- Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.PubMed CentralPubMedGoogle Scholar
- Bickeböller H, Clerget-Darpoux F: Statistical properties of the allelic and genotypic transmission/disequilibrium test for multiallelic markers. Genet Epidemiol. 1995, 12: 865-870. 10.1002/gepi.1370120656.View ArticlePubMedGoogle Scholar
- Pericak-Vance MA, Elston RC, Conneally PM, et al: Age-of-onset heterogeneity in Huntington disease families. Am J Med Genet. 1983, 14: 49-59. 10.1002/ajmg.1320140109.View ArticlePubMedGoogle Scholar
- Zhu X, Olson JM, Schnell AH, Elston RC: Genetic Analysis Workshop 10: Model-free age-of-onset methods applied to the linkage of bipolar disorder. Genet Epidemiol. 1997, 14: 711-716. 10.1002/(SICI)1098-2272(1997)14:6<711::AID-GEPI27>3.0.CO;2-S.View ArticlePubMedGoogle Scholar
- Hanson RL, Knowler WC: Analytic strategies to detect linkage to a common disorder with genetically determined age of onset: Diabetes mellitus in Pima Indians. Genet Epidemiol. 1998, 15: 299-315. 10.1002/(SICI)1098-2272(1998)15:3<299::AID-GEPI7>3.0.CO;2-#.View ArticlePubMedGoogle Scholar
- George V, Tiwari HK, Zhu X, et al: A test of transmission/disequilibrium for quantitative traits in pedigree data, by multiple regression. Am J Hum Genet. 1999, 65: 236-246. 10.1086/302444.PubMed CentralView ArticlePubMedGoogle Scholar
- Dudbridge F: A survey of current software for linkage analysis. Hum Genomics. 2003, 1: 63-65.PubMed CentralView ArticlePubMedGoogle Scholar