Software review | Open | Published:
PBAT: A comprehensive software package for genome-wide association analysis of complex family-based studies
Human Genomicsvolume 2, Article number: 67 (2005)
The PBAT software package (v2.5) provides a unique set of tools for complex family-based association analysis at a genome-wide level. PBAT can handle nuclear families with missing parental genotypes, extended pedigrees with missing genotypic information, analysis of single nucleotide polymorphisms (SNPs), haplotype analysis, quantitative traits, multivariate/longitudinal data and time to onset phenotypes. The data analysis can be adjusted for covariates and gene/environment interactions. Haplotype-based features include sliding windows and the reconstruction of the haplotypes of the probands. PBAT's screening tools allow the user successfully to handle the multiple comparisons problem at a genome-wide level, even for 100,000 SNPs and more. Moreover, PBAT is computationally fast. A genome scan of 300,000 SNPs in 2,000 trios takes 4 central processing unit (CPU)-days. PBAT is available for Linux, Sun Solaris and Windows XP.
Genetic association studies take advantage of the fact that we can measure genotypes directly via either protein electro-phoretic or molecular genetic methods. The goal is to explain the variation in the disease trait of interest using an individual's genotype as a genetic marker. There are two basic types of study design that are used in genetic association analysis: standard (population-based, case-control or cohort) and family-based. Analytical methods appropriate for these two designs are quite different. The family-based design is attractive for many reasons. For one, the design protects against a finding of spurious association, due to population admixture or stratification. The reason for robustness is that the analysis uses parental genotypes to determine the distribution of the test statistic. The analysis cannot be biased by admixture or stratification because the case and control alleles are drawn from the same subjects; therefore, they have the same genetic background. The other key advantage of family-based studies is the way the multiple testing problem can be handled. Using the conditional mean model approach, [1–3] the data are first analysed in a 'screening step'. The analysis of the screening step does not bias the significance level of sub-sequently computed tests. In this screening step, the scientist can look at all possible associations between the markers and traits and select a subset of 'promising' marker - trait combinations -- typically five combinations . Only the selected subset is then put forward to the hypothesis-testing step.
A general paradigm for testing the association between a response variable (disease trait) and a predictor (genotype as a marker) is a regression analysis, since this can accommodate all types of outcomes and all types of predictors. Although regression analysis has many advantages and is widely used in epidemiological investigations, it does require specifying a model for how the trait depends upon the genotype. If the model is incorrect, the power may be reduced. Depending upon study design and analysis, there may also be consequences for the validity. Cordell and Clayton  have described a unified approach to performing genetic association analysis with nuclear families (or case/control data) in a regression context. Case-parent trios are analysed via conditional logistic regression using the case and three pseudo-controls derived from the untransmitted parental alleles. The beauty of the method is that it can be performed using standard statistical software and that additional effects, such as parent-of-origin, effects can be included. The major drawback is that, to date, the technique has not been adapted to include extended pedigrees without splitting them up into nuclear families.
A large number of computer programs are available for family-based association tests, including AFBAC,  QTDT,  FBAT, [7–11] TRANSMIT  and PDT . These software packages primarily focus on the computation of various test statistics, whereas the PBAT software package also exhibits pre- and post-analysis features. The PBAT software can be downloaded from http://www.biostat.harvard.edu/~clange/default.htm.
PBAT is an interactive software package that provides tools for the design and data analysis of family-based association studies. It is available for Windows XP, Linux and UNIX operating systems. The newest version of PBAT (v2.5) includes many features that were not available in earlier versions,  such as haplotype analysis tools that can be invoked using batch mode or user interface, more flexible specifications in power calculations and allowance for discrete trait distribution when applicable. In particular, PBAT incorporates the features of the family-based tests of association (FBAT) package http://www.biostat.harvard.edu/fbat/fbat.htm but provides many additional options for designing association/linkage studies and analysing data with multiple continuous traits. Perhaps the most striking feature, which gives PBATa unique advantage over most available software in the field, is its implementation of the screening techniques -- that is, the conditional mean model approach [1, 2] -- that allow the user to handle the multiple comparison problem at a genome-wide level . Further advantages of PBAT are the analytical power and sample size calculations for family-based association tests [15, 16]. PBAT is especially well suited for quantitative traits while possibly accounting for important predictors.
The cornerstone of the package is the unified approach to FBAT, introduced by Rabinowitz and Laird  and Laird et al. . FBAT builds on the original Transmission Disequilibrium Test (TDT) method,  in which alleles transmitted to affected offspring are compared with the expected distribution of alleles among offspring. It has been generalised so that tests of different genetic models, tests of different sampling designs, tests involving different disease phenotypes, tests with missing parents and tests of different null hypotheses are all in the same framework. In particular, the FBAT statistic is based on a linear combination of offspring genotypes and traits:
where V = Var(S) and Tij represents the coded phenotype (ie the phenotype adjusted for any covariates) of the j-th offspring in family i. Xij denotes the offspring's coded genotype at the locus being tested. It depends on the genetic model under consideration.
The expected distribution is derived using Mendel's law of segregation and conditioning on the sufficient statistics for any nuisance parameters under the null hypothesis, the null hypothesis being 'no linkage and no association' or 'no association, in the presence of linkage'.
PBAT provides methods for a wide range of situations that arise in family-based association studies using FBAT statistics. More specifically, there are two main components: tools for the planning of family-based association studies and data analysis tools. In terms of study planning, PBAT computes the power for study designs that consist of different family types with varying numbers of offspring, under different ascertainment conditions and allowing for missing parental genotypes. The data analysis tools available in PBAT provide options to test linkage or association in the presence of linkage, using (bi-allelic or multi-allelic) marker or haplotype data, single or multiple traits (eg measurements recorded repeatedly over time) that may be quantitative, qualitative or time-to-onset, with nuclear families as well as extended pedigrees. PBAT easily handles covariates and gene/covariate interactions in all computed FBAT statistics. Furthermore, PBAT can also be used for post-study power calculations and construction of the most powerful test statistic. For situations in which multiple traits and markers are given, PBAT's screening tools reduce the large pool of traits and markers and select the most promising combinations in terms of the FBAT statistic.
Using PBAT's screening tools the present authors have shown that genome-wide association studies using families are realisable in terms of data analysis . The key concept of the implemented screening techniques is the conditional mean model approach, [1, 2] for which the data space is partitioned into two independent testing sets. This allows one to control the type I error rates and to overcome one of the most important statistical hurdles when analysing genome-wide association studies with thousands of markers: the multiple comparison problem. The screening technique maintains its protective character for extended datasets with a few hundred thousand SNPs. It should be noted that, in general, adding more SNPs comes at the cost of power loss when corrections for multiple testing need to be applied (eg Bonferroni-type corrections to control type I error). These screening methods are hardly affected by adding 'non-causal' SNPs. In addition, they are robust against effects of population stratification and admixture, since the final decision in the screening process is based on FBATs, which guard against these confounding factors. Finally, PBAT's screening tools are most successful in detecting common disease susceptibility loci. This is particularly attractive in the light of the HapMap project,  which aims to describe the common patterns of genetic variation in humans.
The problem of detecting rare disease-associated SNPs remains; however, this is a general problem rather than a problem specifically related to the screening techniques of PBAT. Applying the authors' screening tools using the haplo-type features of PBAT (eg using sliding windows acknowledging the linkage disequilibrium structures present in the data) may be more beneficial. This is work in progress. TRAN-SMIT  is another program for transmission disequilibrium testing that uses marker haplotypes based on several closely linked markers. By contrast with PBAT, however, TRANSMIT leads to elevated false-positive rates in the presence of population admixture and does not handle quantitative traits . Moreover, it has no built-in functions for performing screening on a genome-wide level.
PBAT's data analysis tools have been extensively validated. These include the data analysis tools using univariate and multivariate traits,  multivariate/longitudinal FBAT models,  time-to-onset traits (Su; personal communication), haplotype analysis (Randolph; personal communication) and genomic screening . PBAT is under constant development. Future developments include refined screening tools and guidelines that apply to haplotype-based genomic screening, power calculations for haplotype analysis and further effort towards a PBAT compendium of commands and an extensive documentation for its users.
Lange C, DeMeo DL, Silverman E, et al: Using the non-informative families in family-based association tests: A powerful new testing strategy. Am J Hum Genet. 2003, 73: 801-811. 10.1086/378591.
Lange C, Lyon H, DeMeo DL, et al: A new powerful non-parametric two-stage approach for testing multiple phenotypes in family-based association studies. Hum Hered. 2003, 56: 10-17. 10.1159/000073728.
Van Steen K, McQueen MB, Herbert A, et al: Genomic screening in family based association testing for quantitative traits. Nat Genet. 2005, (under review)
Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in Type 1 diabetes. Am J Hum Genet. 2002, 70: 124-141. 10.1086/338007.
Thomson G: Mapping disease genes: Family-based association studies. Am J Hum Genet. 1995, 57: 487-498.
Abecasis GR, Cardon LR, Cookson WOC: A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000, 66: 279-292. 10.1086/302698.
Horvath S, Laird N: Discordant sibship test for disequili-brium/transmission: No need for parental data. Am J Hum Genet. 1998, 63: 1886-1897. 10.1086/302137.
Horvath S, Xin X, Laird NM: The family based association test method: Computing means and variances for general statistics. Technical Report. 2000, [http://www.biostat.harvard.edu/fbat/fbattechreport.ps]
Horvath S, Xu X, Laird NM: The family based association test method: Strategies for studying general genotype-phenotype associations. Eur J Hum Genet. 2001, 9: 301-306. 10.1038/sj.ejhg.5200625.
Laird NM, Horvath S, Xu X: Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000, 19 (1): S36-S42. 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M.
Lake S, Blacker D, Laird NM: Family-based tests of association in the presence of linkage. Am J Hum Genet. 2000, 67: 1515-1525. 10.1086/316895.
Clayton D: A generalization of the transmission/disequilibrium test for uncertain-haplotype transmission. Am J Hum Genet. 1999, 65: 1170-1177. 10.1086/302577.
Martin ER, Monks SA, Warren LL, et al: A test for linkage and association in general pedigrees: the pedigree disequilibrium test (PDT). Am J Hum Genet. 2000, 67: 146-154. 10.1086/302957.
Lange C, DeMeo DL, Silverman EK, et al: PBAT: Tools for family-based association studies. Am J Hum Genet. 2004, 74: 367-369. 10.1086/381563.
Lange C, Laird NM: Power calculations for a general class of family-based association tests: Dichotomous traits. Am J Hum Genet. 2002, 67: 575-584.
Lange C, DeMeo D, Laird NM: Power calculations for a general class of family-based association tests: Quantitative traits. Am J Hum Genet. 2002, 71: 575-584. 10.1086/342406.
Rabinowitz D, Laird NM: A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000, 50: 227-233. 10.1159/000022920.
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: The insulin gene region and insulin- dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 65: 578-580.
The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.
Horvath S, Xu X, Lake SL, et al: Family-based tests for associating haplotypes with general phenotype data: Application to asthma genetics. Genet Epidemiol. 2004, 26: 61-69. 10.1002/gepi.10295.
DeMeo DL, Lange C, Silverman EK, et al: Univariate and multivariate family-based association analysis of the IL-13 ARG130GLN polymorphism in the Childhood Asthma Management Program. Genet Epidemiol. 2002, 23: 335-348. 10.1002/gepi.10182.
Lange C, Van Steen K, Andrew T, et al: A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects. Stat Appl Genet Mol Biol. 2004, 1 (1): Article 17-[http://www.bepress.com/sagmb/vol3/iss1/art17]