Genetic association tests: a method for the joint analysis of family and case-control data
- Courtney Gray-McGuire^{1, 2}Email author,
- Murielle Bochud^{3},
- Robert Goodloe^{4} and
- Robert C. Elston^{1}
DOI: 10.1186/1479-7364-4-1-2
© Henry Stewart Publications 2009
Received: 24 August 2009
Accepted: 24 August 2009
Published: 1 October 2009
Abstract
With the trend in molecular epidemiology towards both genome-wide association studies and complex modelling, the need for large sample sizes to detect small effects and to allow for the estimation of many parameters within a model continues to increase. Unfortunately, most methods of association analysis have been restricted to either a family-based or a case-control design, resulting in the lack of synthesis of data from multiple studies. Transmission disequilibrium-type methods for detecting linkage disequilibrium from family data were developed as an effective way of preventing the detection of association due to population stratification. Because these methods condition on parental genotype, however, they have precluded the joint analysis of family and case-control data, although methods for case-control data may not protect against population stratification and do not allow for familial correlations. We present here an extension of a family-based association analysis method for continuous traits that will simultaneously test for, and if necessary control for, population stratification. We further extend this method to analyse binary traits (and therefore family and case-control data together) and accurately to estimate genetic effects in the population, even when using an ascertained family sample. Finally, we present the power of this binary extension for both family-only and joint family and case-control data, and demonstrate the accuracy of the association parameter and variance components in an ascertained family sample.
Keywords
ascertainment correction family-based association linkage disequilibriumIntroduction
For much of the past three decades, linkage analysis has been the primary tool for the initial exploration of complex diseases believed to have an underlying genetic aetiology and has resulted in many large cohorts of family data with DNA samples available. Unfortunately, however, the ability of linkage analysis to localise potentially segregating susceptibility or protective genotypes has been limited to, at best, regions of 5-10 centimorgans (cM) in length and, at worst, 20 cM in length [1]. This limitation has led to a rise in popularity of methods for detecting allelic (or gametic) association in candidate genes, in candidate linkage regions or genome-wide. This allelic association, coupled with linkage, allows for much more precise localisation of regions housing disease genes because, if it is due to linkage disequilibrium (LD), it will span a much shorter distance within the genome than is usually found by linkage analysis. With this rise in association studies, there has been a trend toward the collection of unrelated case-control samples, often with the abandonment of large family studies. Certainly, these samples are much easier to obtain than are family samples, but they also have certain limitations, even within the context of recent genome-wide association successes [2, 3]. Further, allelic association can be due to factors other than LD (which we define as the combination of allelic association and linkage) or pleiotropy (a marker allele itself being involved in the aetiology of the disease) [4].
Population stratification, which exists when multiple strata within a given sample differ with respect to either the underlying trait distribution or the marker genotype distribution (and which leads to spurious association when it occurs with respect to both), is a commonly cited cause of false-positive findings in case-control association studies (eg Knowler et al. [5] ) and the most likely cause in genetic epidemiological studies. This threat of increased type I error rate has led to the development of many methods that guard against the effects of population stratification. The first two general classes use unlinked loci and can both be subsumed under the term 'genomic control': (1) test for population stratification using unlinked regions of the genome; (2) allow for the population stratification, as estimated from unlinked regions of the genome when performing an analysis of allelic association. The third general class guards against population stratification by using non-transmitted alleles as controls (ie a case-control design perfectly matched for ethnicity by appropriately using family data). While these methods are effective in controlling for population stratification, they each have their limitations with respect to power, efficiency and flexibility.
The limitations of genomic control methods [6–8] are the requirement of having genotypes at many loci unlinked to the disease allele. In the context of a genome-wide association scan, the choice of the best regions to use as a 'control' is difficult, as there is no guarantee that the markers being used are indeed unlinked to a disease gene. Applying this method to a candidate gene study suffers from the same limitation, but also requires significant additional cost and labour to type enough (and how many is enough?) additional loci.
Summary of TDT-type methods and their respective features
Incorporation of: | ||||||||
---|---|---|---|---|---|---|---|---|
Method/Reference | Missing parents | Multiple alleles | Parental phenotypes | Quantitative traits | Extended pedigrees | Different family structures | Multiple markers | Covariates |
Curtis (1997)^{S1} | * | * | ||||||
S-TDT (Spielman and Ewens, 1998)^{S2} | * | * | * | |||||
DAT (Boehnke and Langefeld, 1998)^{S3} | * | * | ||||||
SDT (Horvath and Laird, 1998)^{S4} | * | * | * | |||||
NFS (Whittemore and Tu, 2000)^{S5} | * | * | * | * | * | |||
TRANSMIT (Clayton, 1999)^{S6} | * | * | * | * | ||||
RC-TDT (Knapp, 1999)^{S7} | * | |||||||
1-TDT (Sun et al., 1999)^{S8} | * | * | ||||||
Martin et al. (1997)^{S9} | * | * | * | |||||
George et al. (1999)^{S10} | * | * | * | * | * | |||
P-TDT (Abecasis et al., 2000)^{S11} | * | * | * | * | * | |||
Bickeboller and Clerget-Darpoux(1995)^{S12} | * | |||||||
Spielman and Ewens (1996)^{S13} | * | |||||||
Purcell et al. (2005)^{S14} | * | * | * | |||||
TDT(max) (Morris, 1997)^{S15} | * | |||||||
Lazzeroni and Lange (1998)^{S16} | * | * | ||||||
Monks and Kaplan(2000)^{S17} | * | * | * | * | ||||
Xiong et al. (1998)^{S18} | * | * | ||||||
Fan and Jung(2002)^{S19} | * | * | * | * | ||||
TDT(Q1) - TDT(Q5) (Allison, 1997)^{S20}, | * | * | ||||||
Rabinowitz (1997)^{S21} | * | * | * | |||||
Allison et al. (1999)^{S22} | * | * | * | |||||
Sun et al. (2000)^{S23} | * | * | * | * | ||||
Schaid and Rowlands (2000)^{S24} | * | * | * | |||||
Waldman et al. (1999)^{S25} | * | * | ||||||
Sinsheimer et al. (2000)^{S26} | * | * | * | * | * | |||
Kistner and Weinberg (2004)^{S27} | * | * | * | |||||
QTDT (Abecasis et al., 2000)^{S28} | * | * | * | * | ||||
Zhu and Elston(2001)^{S29} | * | * | * | * | ||||
PDT (Martin, 2000)^{S30} | * | * | ||||||
Goring and Terwilliger(2000)^{S31} | * | * | ||||||
Clayton and Jones(1999)^{S32} | * | * | ||||||
ETDT (Sham and Curtis, 1995)^{S33} | * | |||||||
TDT-EX (Cleves et al., 1997)^{S34} | * | * | ||||||
Fulker (1999)^{S35} | * | * | * | |||||
Fan et al. (2002)^{S36} | * | * | * |
Novel methodological approaches for the analysis of LD in family data include a class of variance component approaches and what are termed family-based association tests (FBATs). Fulker first proposed a test for both between-family association (or, more appropriately, 'among-family', as we typically expect more than two families), which models the phenotypic means given the marker locus genotypes, and within-family association (linkage) by using identity-by-descent status in modelling the sib-pair variance-covariance structure [16]. It was shown that the within-family component provides an estimate of the additive genetic effect unaffected by population stratification. Sham et al. [17] extended this method to incorporate larger sibships, dominance variance and multi-allelic markers. It was further extended to sibships with or without parental genotypes, and to multi-generational pedigree data by Abecasis et al. [18] FBAT is a unified approach to family-based tests of association that 'compares tests for association to their conditional distributions given the minimal sufficient statistics under the null hypothesis for the genetic model, sampling plan and population admixture', [19] in two steps: (1) building a test statistic that is sensitive to the co-variation of the trait and marker; and (2) finding the distribution of the test statistic under the null hypothesis. Broadly speaking, the test statistic is the 'covariance between a function of the genotype and a function of the trait', [20] the dependent variable being the offspring genotype. While the first step gives great flexibility in the choice of test statistic, the second is designed to ensure correct type I error rates (ie validity), regardless of population admixture, genetic model or ascertainment scheme [21]. These approaches are broad, in that they can handle different genetic models, different family structures (including extended pedigrees) and disease phenotypes (qualitative or quantitative, single or multiple). As with the original TDT, however, only heterozygous parents are informative in this framework; non-family data cannot be included and, in the case of FBAT, even if one does have a random sample, the effect size of the allele of interest is not estimated. This can lead to a dramatic loss of effective sample size and therefore potential power and/or precision when compared with an unconditional method such as that presented here and demonstrated in our previous work [22]. Other methods more robust to these particular limitations have been recently proposed for assessing quantitative traits in family-based samples [23] and binary traits in case-control samples, including related individuals [24, 25]. Neither of these methods, however, includes an ascertainment correction (central to pooling family and case-control samples), nor do they estimate family or cluster effects. Further, the former does not allow for the inclusion of case-control data and the latter does not allow for the inclusion of covariates.
Based on the limitations of the existing strategies for testing LD, we present an alternative two-stage family-based association test in which we combine attributes of two existing methods, first to test whether population stratification is present and then appropriately test for and estimate the effect of, LD of a marker to a continuous trait. We further offer extensions of this method that can be applied to binary traits and hence allow an analysis of case-control data together with family data. We illustrate the power of this method for various sample sizes and structures, specifically for joint family and population-based samples that cannot be analysed by existing methods. We also extend this method to allow for the accurate estimation of association parameters and residual variance components from ascertained family data, and demonstrate, via simulation, that this method is effective in controlling ascertainment bias.
Methods
Continuous traits
in which the number of A_{1} alleles, along with other covariates, is a predictor of phenotype. In this model, z_{ i } is coded such that the allelic effect of substituting A_{2} for A_{1} is $\frac{1}{2}\delta $. The random components include p_{ i }, a random polygenic effect, f_{ i } and ${f}_{i}^{\prime}$, random nuclear family effects, m_{ i }, a random marital effect, s_{ i }, a random sibship effect and ε_{ i }, a random residual individual effect. In addition, the generalised power transformation (h), [28] applied to both the trait and its predictors, when simultaneously estimated under a model that assumes normality of the residuals, helps assure both linearity and normality, thus making the model robust to non-independence (as can be the case in large pedigrees). There are two random nuclear family effects f_{ i } and ${f}_{i}^{\prime}$ in model (1) because each individual is potentially a member of two different nuclear families, one in which we include the individual's parents and siblings and one in which we include the individual's spouse and children. All the random effects in the model are assumed to be mutually independent and, after the transformation, normally distributed with zero means and variances ${\sigma}_{\mathit{p\text{'}}}^{2}{\sigma}_{f}^{2}={\sigma}_{\mathit{f\text{'}}}^{2},{\sigma}_{m}^{2},{\sigma}_{s}^{2}$ and ${\sigma}_{\epsilon}^{2}$ such that: $V\left[h\left({y}_{i}\right)\right]={\sigma}_{p}^{2}+{\sigma}_{f}^{2}+{\sigma}_{\text{f'}}^{2}+{\sigma}_{m}^{2}+{\sigma}_{s}^{2}+{\sigma}_{\epsilon}^{2}$ for families with more than two generations, and $V\left[h\left({y}_{i}\right)\right]={\sigma}_{p}^{2}+{\sigma}_{f}^{2}+{\sigma}_{m}^{2}+{\sigma}_{s}^{2}+{\sigma}_{\epsilon}^{2}$ for families with only two generations. It is important to note that the total variance V[h(y_{i})] is made the same for all individuals by adjusting the residual variance ${\sigma}_{\epsilon}^{2}$ separately for each person (see Elston et al. [27] for details). This model has recently been further extended to allow for each person to have more than two nuclear family effects, as can occur when there are half-sibships in the data, and other kinds of common environmental cluster effects.
As currently implemented in the S.A.G.E. program ASSOC, the likelihood is maximised numerically over all parameters, and standard errors are determined by numerical double differentiation of the log likelihood. Also, p-values, based on the likelihood ratio or a Wald test, can be calculated for the transformation parameters, any of the variance components and any regression coefficients. They are two-sided for all transformation parameters and regression coefficients, and one-sided for all variance components.
This method is meant to follow existing evidence of linkage because it does not control for population stratification. With the growing popularity of genome-wide and candidate gene association studies, however, there are likely to be many instances in which linkage is not known a priori. For this reason, we suggest - rather than automatically resorting to cumbersome genomic control methods or a less powerful TDT-type design - using a two-stage procedure to (1) test for a stratification effect and then (2) test for allelic association. If there is no stratification, then the association resulting from model (1) can be interpreted as LD effects. If there is stratification, then one can use the same regression model framework to perform a test like those mentioned above (TDT and FBAT) that conditions on parental genotype.
where η_{ i }is the random effect comprising all of the familial, sibling, marital, polygenic and individual specific errors outlined above. George et al. [29] gave details of how the indicator variable x_{1i}is constructed to form a TDT-type test by substituting it for z_{ i } in regression model (1). We point out that, because it includes components of a TDT-type test, it requires family data. The variable x_{2i}is formed analogously for the other allele of an SNP. In the case of a multiallelic marker, all the other alleles can be pooled into a single allele for this purpose. To test for a stratification effect, we first test the null hypothesis that the genotypic effect is half the difference of the transmitted allele effects; that is, ${\beta}_{2}-{\beta}_{1}=\frac{1}{2}\delta $. If we do not reject this null hypothesis at some liberal significance level such as p = 0.2, we infer that there is no evidence of stratification, set β_{2} = β_{1} = 0 and estimate the allele A_{1} effect by $\frac{1}{2}\delta $. If there is any evidence of stratification, we set δ = 0 and estimate the allele A_{1} effect by β_{2} - β_{1}. Thus, once either β_{2} = β_{1} or δ is set to 0, as appropriate, we return to a framework in which we simultaneously estimate the effect of allele A_{1}, the residual variance components and one or more transformation parameters. We can use asymptotic results to obtain confidence intervals for all parameter estimates in the usual way, and the method can be extended to estimate genotype effects rather than allele effects. While other approaches like the principal component approach proposed by Zhu et al. [30] work well within this regression framework and are potentially more informative when many SNPs are available, this new approach is a viable option, even if only one or a few SNPs are typed (ie in the case of a candidate gene study).
Extension to binary traits
We have shown that the iterative maximisation procedure currently implemented in our software (ASSOC) is quite robust to these initial estimates, regardless of family size or misspecified analysis model [32]. We do note, however, that the ease of maximisation and the accuracy of estimates depend on both the sample size and the number of parameters estimated. In general, we recommend at least 20 observations per parameter estimated to ensure accuracy (which can be assessed based on standard errors we provide).
In matrix (7), N_{ k } is the number of persons in the k^{ th } pedigree, p is the total number of regression coefficients in equation (4), including the intercept, and β_{ j } represents any one of them.
Combining case-control and family data
One of the benefits of the regression framework outlined above is the flexibility to include families of any size or structure. This is vital, given, as mentioned above, that the magnitude of the effects associated with any given gene for a common complex disease is likely to be small. Certainly, provided we are only interested in hypothesis testing (we will discuss parameter estimation later), unmatched case-control data can be easily included as single person pedigrees with only an individual-specific variance. In this framework, however, matched case-control data can also be included by simply specifying the matched pairs as members of the same cluster (a cluster, of course, being a special case of a pedigree). We include in the model a cluster-specific variance component ${\sigma}_{c}^{2}$, such that $V\left[logit\left({y}_{i}\right)\right]={\sigma}_{c}^{2}+{\sigma}_{\epsilon}^{2},$ and then adjust the residual variance ${\sigma}_{\epsilon}^{2}$ so as to keep the total variance the same for all individuals. This approach does not limit the case-control cluster size or composition, as does conditional logistic regression.
Correcting for ascertainment
The underlying assumption of the method outlined above is that the sampling units (families, individuals, case-control clusters) represent a random sample from the same population. This is often not the case - particularly when families were sampled for a linkage study - and cannot be the case for case-control samples. The sample association and variance component estimates are thus not representative of the population values. We therefore present an ascertainment correction specifically for family data (and briefly address an extension to case-control data in the discussion).
where L(P) is the final likelihood, L(P_{ All }) is the like-lihood for the whole sample on the assumption of random sampling of families and L(P_{ PSF }) is the like-lihood for the family members in the PSF, similarly on the assumption of random sampling. (For single ascertainment, only the probands are included in the PSF). Maximising this likelihood (8) leads to consistent estimators of all the parameters [34].
Power calculations for family data
Total variance of the non-major gene component of the continuous liability underlying the binary trait and the proportion of that variance represented by each variance component for each model
Simulated proportion of variance for each variance component | ||||||
---|---|---|---|---|---|---|
Model Name | Total variance | Polygenic | Familial | Sibling | Marital | Random |
P | 0.3125 | 0.200 | - - - - - - | - - - - - - | - - - - - - | 0.800 |
FP | 0.3750 | 0.167 | 0.167 | - - - - - - | - - - - - - | 0.667 |
SP | 0.3750 | 0.167 | - - - - - - | 0.167 | - - - - - - | 0.667 |
MP | 0.3750 | 0.167 | - - - - - - | - - - - - - | 0.167 | 0.667 |
SMP | 0.4375 | 0.143 | - - - - - - | 0.143 | 0.143 | 0.571 |
Creation of a random sample was achieved by simply collecting individuals (and thus their entire pedigree) from the simulated population in the order in which they were encountered until the desired sample size (1,000 individuals) was met. All replicates were analysed using both the simulated correlation model and an 'incorrect' correlation model. For example, if data were simulated to have both a familial and polygenic effect, they were analysed under a model (denoted as FP) including both effects and one (denoted P) that included only a polygenic effect. Names for all the models investigated are enumerated in Table 1.
Type I error was calculated as the number of replicates simulated under the null hypothesis meeting a recommended cut-off point for genome-wide association studies by the Wellcome Trust of α = 5 × 10^{-7} [35]. Power was calculated as the number of replicates meeting the same criterion but simulated under the alternate hypothesis.
Sample size estimation for joint family and case-control data
In addition to the simulations outlined above, in order to demonstrate the usefulness of this method for the joint analysis of family and population or case-control data, we analytically estimated, for a combination of unrelated individuals (50 per cent cases, 50 per cent controls), nuclear families and/or extended pedigrees, the number of individuals required to detect a given effect size at a fixed type I error rate and power.
This is an extension of Nick et al., [36] who gave approximate results for exactly two founders and a dominant mode of inheritance, and assumes the quantitative trait locus and marker variants are in perfect LD. We derived $\text{var}\left(\widehat{\beta}\right)$ more generally for n_{ fi } founders, for both additive and dominant inheritance, as well as for relative pair specific correlations. We also allowed for incomplete LD by applying a 1/(0.8)-fold factor (equivalent to r^{2} = 0.8, D' ≈ 0.9).
For these calculations, we made some simplifying but conservative assumptions. First, we assumed that founder pairs have a correlation of 0 and that parent-offspring correlations (ρ_{ po }) and sib-sib correlations (ρ_{ ss }) correspond to a residual heritability of 2 ρ_{ po } = 2 ρ_{ ss } and that grandparent-grandchild pairs have a residual correlation of ρ_{ gg } corresponding to a residual heritability of 4 ρ_{ gg }. We further assumed, for simplicity, that all persons with the same genotype at the disease locus have the same disease risk and that LD between the locus and the closest SNP, assuming the same allele frequencies at the two loci, is given by r^{2} . Finally, we imposed the type I error recommended for genome-wide association studies by the Wellcome Trust of α = 5 × 10^{-7}, [35] and assumed a fixed power equivalent to a sample of 500 cases and 500 controls (0.92 for an additive model and 0.86 for a dominant model), and a locus-specific heritability of ${h}_{ls}^{2}$ - see equation (11) - of 0.05. We did this for samples of nuclear families only, extended pedigrees only and mixtures of both, for various sample sizes, and then, demonstrated the approximate linearity of the trend in sample size needed to detect the same effect given a fixed power and type I error.
Accuracy of association and variance component estimates
In addition to generating random family samples (RAND), we also generated a sample of singly ascertained families (ASC) by assigning each family a probability of entering the sample based on the number of affected members in the family: P(family enters sample) = N_{ a }/N, where N_{ a } is the number of affected members in the family and N is the number of family members. Each simulation output file was parsed and, if a family had an affected member, the above probability was calculated and, based on the appropriate Bernoulli distribution, the family was either ascertained or not until the desired sample size was met.
Accuracy of the association parameter as ln odds of being affected given two copies of the disease allele versus one copy for a sample size of 1,000 individuals
Nuclear | Extended | ||||
---|---|---|---|---|---|
Model* | RAND | ASC | RAND | ASC | |
FP-FP | Est | 2.529 | 2.479 | 2.511 | 2.561 |
rMSE | 0.1709 | 0.1210 | 0.1530 | 0.2030 | |
FP-P | Est | 2.537 | 2.509 | 2.517 | 2.524 |
rMSE | 0.1789 | 0.1510 | 0.1591 | 0.1661 | |
P-FP | Est | 2.780 | 2.655 | 2.763 | 2.722 |
rMSE | 0.1775 | 0.0529 | 0.1603 | 0.1196 | |
P-P | Est | 2.780 | 2.648 | 2.768 | 2.724 |
rMSE | 0.1775 | 0.0458 | 0.1655 | 0.1212 |
Results
Type I error and power in family data
Under both additive and dominant models, the association method we present for detecting diallelic trait loci has stable type I error rates of less than 0.05 (mean = 0.0452) for the RAND sample of both the nuclear families and extended pedigrees.
Sample size estimation in joint family and case-control data
Note that the equivalence of samples is shown, assuming (1) a common minor allele frequency (q = 0.5), for which the family data are not as informative as are the case-control data, and (2) an allele frequency under which the family sample is fairly informative (q = 0.1). As expected, the nuclear family sample (assuming an additive model with q = 0.5) requires the fewest additional unrelated individuals to detect a given effect, and the extended pedigree sample (assuming a dominant model with q = 0.5) requires the most additional unrelated individuals. The extended pedigree sample, under a dominant model with q = 0.5 or 0.1, performed similarly, as did the nuclear family sample under a dominant model with q = 0.5 or 0.1. The extended pedigree and nuclear family samples (assuming an additive model) require approximately the same number of additional unrelated persons to detect the given effect size.
Estimation using family samples
Accuracy of the association parameter
The estimates of the association parameter (expressed as the ln odds of two copies of the disease allele versus one copy) were, on average, 2.615 for the nuclear family sample and 2.636 for the extended family sample - not too dissimilar to the simulated value of 2.48. The RAND and ASC samples had similar averages of 2.648 and 2.603, respectively. Note that we purposely generated the data under a (probit) model different from the (logit) model used to analyse the data, to illustrate the robustness of the analysis model, and that the accuracy of the ascertainment correction is seen in the small difference in parameter estimates between the RAND and ASC samples. The average estimate for the ascertained extended families (2.633) was overestimated by a factor of 1.06, a slightly larger deviation from the simulated value than seen in the nuclear family samples, which had an average of 2.573 - only 1.03 times the simulated value (and the closest to it). The rMSE averaged over all models was 0.210 and all estimates were within a factor of 1.15 of the simulated value (Table 2).
Accuracy of the association parameter as ln odds of being affected given two copies of the disease allele versus no copies for a sample size of 1,000 individuals
Nuclear | Extended | ||||
---|---|---|---|---|---|
Model* | RAND | ASC | RAND | ASC | |
FP-FP | Est | 5.058 | 4.958 | 5.022 | 5.122 |
rMSE | 0.2936 | 0.3936 | 0.3295 | 0.2296 | |
FP-P | Est | 5.074 | 5.018 | 5.034 | 5.048 |
rMSE | 0.2777 | 0.3336 | 0.3176 | 0.3036 | |
P-FP | Est | 5.560 | 5.310 | 5.526 | 5.444 |
rMSE | 0.1010 | 0.5696 | 0.3536 | 0.4357 | |
P-P | Est | 5.560 | 5.296 | 5.536 | 5.448 |
rMSE | 0.3195 | 0.5836 | 0.3437 | 0.4316 |
Accuracy of the variance components
Accuracy of variance components as proportions of the total variance, N = 1,000
Nuclear | Extended | |||||
---|---|---|---|---|---|---|
Parameter | Model | RAND | ASC | RAND | ASC | |
Marital | MP-MP | Est | 0.079 | 0068 | 0.0775 | 0.0696 |
rMSE | 0.0877 | 0.0990 | 0.0894 | 0.0693 | ||
SMP-SMP | Est | 0.1743 | 0.0636 | 0.1291 | 0.0555 | |
rMSE | 0.0316 | 0.0781 | 0.0141 | 0.0866 | ||
Sibling | SP-SP | Est | 0.0574 | 0.0669 | 0.0549 | 0.0713 |
rMSE | 0.1095 | 0.1000 | 0.1122 | 0.0959 | ||
SMP-SMP | Est | 0.0549 | 0.0554 | 0.057 | 0.0579 | |
rMSE | 0.0883 | 0.1118 | 0.0860 | 0.1090 | ||
Polygenic | FP-FP | Est | 0.0896 | 0.0604 | 0.1711 | 0.1388 |
rMSE | 0.0775 | 0.1068 | 0.0000 | 0.0283 | ||
MP-MP | Est | 0.0741 | 0.0655 | 0.0775 | 0.0643 | |
rMSE | 0.0927 | 0.1015 | 0.0894 | 0.1030 | ||
P-P | Est | 0.063 | 0.0805 | 0.0602 | 0.0755 | |
rMSE | 0.1039 | 0.1196 | 0.1068 | 0.1249 | ||
SMP-SMP | Est | 0.2169 | 0.0617 | 0.1759 | 0.0559 | |
rMSE | 0.0742 | 0.0800 | 0.0332 | 0.0860 | ||
SP-SP | Est | 0.0962 | 0.0723 | 0.0782 | 0.0603 | |
rMSE | 0.0707 | 0.0949 | 0.0889 | 0.1068 | ||
Familial | FP-FP | Est | 0.1133 | 0.0122 | 0.033 | 0.0139 |
rMSE | 0.0539 | 0.3521 | 0.1342 | 0.1530 | ||
P-FP | Est | 0.0198 | 0.0032 | 0.048 | 0.0102 | |
rMSE | 0.0200 | 0.0032 | 0.0480 | 0.0100 |
The accuracy of the variance component estimates were affected by the sampling scheme, as expected. The RAND samples resulted in estimates closest to the simulated population values, but ASC samples yielded estimates reasonably reflective of the population values as well.
Discussion
The prediction of the future of genetic studies of complex disease is ever changing, but what remains true is that we must have methods of analysis that are both powerful and flexible. Whether searching for common genes with small effect or rare genes with large effect, we shall need large samples that are likely to come only from combining family, population-based and case-control data and we must have methods that analyse these combinations. In fact, the use of family samples was recently high-lighted by Visscher et al., [2] showing that including related individuals results in only a small loss of power but large gains in terms of quality control, flexibility of tests to be performed and ability to control for population stratification. Our results support these assertions and we further recommend that association methods must account for environmental covariates (which are certain to play a role in complex diseases) and must not be restricted by, but rather be effective in controlling for, population stratification. These tools will be powerful in aiding both genome-wide association and candidate gene studies.
We have present here a method to test and estimate the association between an allele or genotype and a continuous or binary trait, as well as approaches to combining family and case-control data that are powerful as well as robust to ascertainment. We also present a two-stage procedure to determine the need for a test that is robust to stratification. A purist would argue that a two-stage approach could affect type I error rate. The important thing to note, however, is that this decision should be made on the basis of the significance, not the magnitude, of the difference in the two estimates of marker effect, β_{2} - β_{1} versus $\frac{1}{2}\delta $, because a study whose sample size is powered to detect a small effect will automatically be powered to detect the small biases that stratification could induce.
We further present a method for correcting for ascertainment and accurately estimating association parameters, as well as variance components, even in ascertained family data. Two things should be pointed out, however. First, we examined only single ascertainment, when a more complex scheme is used to collect families such that most of the sample is in the PSF and/or the PSF is undefined, the estimates for the association parameter and the variance components will reflect only the effect in the sample. Note, however, that the test for association is still valid and it is only the parameter estimates that are affected. Secondly, when combining data from a case-control sample and an ascertained family sample, for the parameter estimates from this method to be reflective of the population from which the samples were drawn, certain assumptions must be met: (1) the cases in the population-based data should have been phenol-typed in a manner similar to the cases in the family data; (2) there must be appropriate correction for ascertainment; and (3) the non-cases or 'controls', although matched, should apart from this also be a random sample - if they are a completely random sample from the same population, it is possible to estimate a relative risk, while if they are a random sample of those showing absence of the phenotype of interest, only an odds ratio can be estimated. If the phenotype is sufficiently rare such that choosing controls based on absence of the trait of interest is essentially the same as random sampling, then the relative risk and odds ratio will be essentially the same. Because this is not the case for common complex diseases, we suggest and will investigate further in future studies, two other ways of combining case-control and family data for accurate estimation: (1) express the likelihood for the case-control data in terms of odds ratios, which are functions of the parameters in the pedigree likelyhood, and constrain the maximum likelihood for them such that the marginal probability of disease, given a set of regressors, is finite; [37] and (2) multiply the likelihood by a factor that summarises any information we have about the prevalence of disease independent of the sample data. This factor would be expressed as μ^{ R } (1 - μ)^{N - R}, where μ is the prevalence of the disease - expressed as a function of the parameters in the full likelihood at particular values of the covariates in the model - and R reflects our external information about the number of affected persons in a population of size N. For example, if we have an estimate of μ, $\widehat{\mu}$ and its standard error (s.e.), we can estimate reasonable values for N and R by noting $\text{s}\text{.e}\text{.}=\sqrt{\widehat{\mu}\left(1-\widehat{\mu}\right)/N}$, and hence $N=\widehat{\mu}\left(1-\widehat{\mu}\right)/{\left(\text{s}.\text{e}.\right)}^{2}$ and $R={N}_{\widehat{\mu}}$. It is known that constraining likelihood maximisation so that the estimated disease prevalence is equal to its true prevalence can be equivalent to a correction for single ascertainment [38]. These two options offer simple solutions for 'non-traditional' samples and will be examined in future work.
The general method described in this paper, which is currently being implemented in the program package S.A.G.E., is more flexible than other TDT-type methods and more efficient (in the practical sense) than genomic control methods. Further, we have shown the power of this method for binary traits in various types of family, population-based and combined samples at a constant type I error rate and, while we concede that a population-based sample could sometimes detect a smaller effect size than the respective family-based samples, as mentioned earlier, these scenarios assume the same degree of heterogeneity and sporadic cases in all samples after correction for ascertainment. We know that this is not likely to be the case, as family samples are designed to decrease greatly the number of sporadic cases and, at least to some extent, reduce the amount of heterogeneity in the sample in a manner that makes appropriate ascertainment correction difficult. Further, for most complex phenotypes, family samples of at least the size examined here (and usually much larger) already exist and, as shown in Figures 2 and 3, can drastically reduce the number of population-based samples needed to detect even very small effects. Other benefits of family data, such as increased ability to assess the effects of shared environment and parent-of-origin effects, to detect errors and many others are beyond the scope of this paper, but must also be considered. Finally, while having to correct for ascertainment is one of the reasons often cited for using population-based versus family data, we have demonstrated that, in principle, our method can be used to estimate fairly accurately the effect size of a given allele of interest for a given population, even if using an ascertained sample. For situations where most of the sample is in the PSF (and hence likelihood (8) contains little information), or the PSF is ill-defined, we suggest constraining the likelihood to give an accurate estimate of the disease prevalence. Future investigation will determine the accuracy of estimates obtained in this manner.
Declarations
Acknowledgements
This work was supported in part by a US Public Health Service resource grant (RR03655) from the National Center for Research Resources, research grant (GM28356) from the National Institute of General Medical Sciences, Cancer Center support grant (P30CAD43703) and Transdisciplinary Research in Energetic and Cancer grant (U54CA116867), both from the National Cancer Institute, and training grant (HL07567) from the National Heart, Lung and Blood Institute, as well as from the Swiss National Foundation for Science (PROSPER: 3200BO-111362/1 and 3233BO-111361/1). Some of the results of this paper were obtained by using the program package S.A.G.E., which is supported by a US Public Health Service Resource grant (RR03655) from the NCRR.
Authors’ Affiliations
References
- Cordell HJ: 'Sample size requirements to control for stochastic variation in magnitude and location of allele-sharing linkage statistics in affected sibling pairs'. Ann Hum Genet. 2001, 65: 491-502. 10.1046/j.1469-1809.2001.6550491.x.View ArticlePubMedGoogle Scholar
- Visscher PM, Andrew T, Nyholt DR: 'Genome-wide association studies of quantitative traits with related individuals: Little (power) lost but much to be gained'. Eur J Hum Genet. 2008, 16: 387-390. 10.1038/sj.ejhg.5201990.View ArticlePubMedGoogle Scholar
- Altshuler D, Clark AG: 'Genetics. Harvesting medical information from the human family tree'. Science. 2005, 307: 1052-1053. 10.1126/science.1109682.View ArticlePubMedGoogle Scholar
- Elston RC: 'Linkage and association to genetic markers'. Exp Clin Immunogenet. 1995, 12: 129-140.PubMedGoogle Scholar
- Knowler WC, Williams RC, Pettitt DJ, Steinberg AG: 'Gm3;5,13,14 and type 2 diabetes mellitus: An association in American Indians with genetic admixture'. Am J Hum Genet. 1988, 43: 520-526.PubMed CentralPubMedGoogle Scholar
- Pritchard JK, Rosenberg NA: 'Use of unlinked genetic markers to detect population stratification in association studies'. Am J Hum Genet. 1999, 65: 220-228. 10.1086/302449.PubMed CentralView ArticlePubMedGoogle Scholar
- Gorroochurn P, Heiman GA, Hodge SE, Greenberg DA: 'Centralizing the non-central chi-square: A new method to control for population stratification in genetic case-control association studies'. Genet Epidemiol. 2006, 30: 277-289. 10.1002/gepi.20143.View ArticlePubMedGoogle Scholar
- Devlin B, Roeder K: 'Genomic control for association studies'. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Spielman RS, McGinnis RE, Ewens WJ: 'Transmission test for linkage disequilibriumml: The insulin gene region and insulin-dependent diabetes mellitus (IDDM)'. Am J Hum Genet. 1993, 52: 506-516.PubMed CentralPubMedGoogle Scholar
- Rubinstein P, Walker M, Carpenter C, Carrier C, et al: 'Genetics of HLA disease associations: The use of the haplotype relative risk (HRR) and the 'haplo-delta' (Dh) estimates in juvenile diabetes from three racial groups'. Hum Immunol. 1981, 3: 384-View ArticleGoogle Scholar
- Abecasis GR, Cardon LR, Cookson WO, Sham PC, et al: 'Association analysis in a variance components framework'. Genet Epidemiol. 2001, 21 (Suppl 1): S341-S346.PubMedGoogle Scholar
- Curtis D, Sham PC: 'A note on the application of the transmission disequilibrium test when a parent is missing'. Am J Hum Genet. 1995, 56: 811-812.PubMed CentralPubMedGoogle Scholar
- Abel L, Muller-Myhsok B: 'Maximum-likelihood expression of the transmission/disequilibrium test and power considerations'. Am J Hum Genet. 1998, 63: 664-667. 10.1086/301975.PubMed CentralView ArticlePubMedGoogle Scholar
- Tu IP, Whittemore AS: 'Power of association and linakge tests when the disease alleles are unobserved'. Am J Hum Genet. 1999, 64: 641-649. 10.1086/302253.PubMed CentralView ArticlePubMedGoogle Scholar
- Muller-Myhsok B, Abel L: 'Genetic analysis of complex diseases'. Science. 1997, 275: 1328-1329.PubMedGoogle Scholar
- Fulker DW, Cherny SS, Sham PC, Hewitt JK: 'Combined linkage and association sib-pair analysis for quantitative traits'. Am J Hum Genet. 1999, 64: 259-267. 10.1086/302193.PubMed CentralView ArticlePubMedGoogle Scholar
- Sham PC, Cherny SS, Purcell S, Hewitt JK: 'Power of linkage versus association analysis of quantitative traits, by use of variance-components models, for sibship data'. Am J Hum Genet. 2000, 66: 1616-1630. 10.1086/302891.PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cardon LR, Cookson WO: 'A general test of association for quantitative traits in nuclear families'. Am J Hum Genet. 2000, 66: 279-292. 10.1086/302698.PubMed CentralView ArticlePubMedGoogle Scholar
- Rabinowitz D, Laird N: 'A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information'. Hum Hered. 2000, 50: 211-223. 10.1159/000022918.View ArticlePubMedGoogle Scholar
- Laird NM, Horvath S, Xu X: 'Implementing a unified approach to family-based tests of association'. Genet Epidemiol. 2000, 19 (Suppl 1): S36-S42.View ArticlePubMedGoogle Scholar
- Horvath S, Xu X, Laird NM: 'The family based association test method: Strategies for studying general genotype - phenotype associations. Eur J Hum Genet. 2001, 9: 301-306. 10.1038/sj.ejhg.5200625.View ArticlePubMedGoogle Scholar
- Gray-McGuire C, Song Y, Sinha R, Won S, et al: 'Comparison of family based association methods and designs for genome-wide association scans'. Proceedings of the Genetic Analysis Workshop 15. 2006, 14-18. [http://www.geneticepi.org]Google Scholar
- Chen WM, Abecasis GR: 'Family-based association tests for genomewide association scans'. Am J Hum Genet. 2007, 81: 913-926. 10.1086/521580.PubMed CentralView ArticlePubMedGoogle Scholar
- Thornton T, McPeek MS: 'Case-control association testing with related individuals: A more powerful quasi-likelihood score test'. Am J Hum Genet. 2007, 81: 321-337. 10.1086/519497.PubMed CentralView ArticlePubMedGoogle Scholar
- Bourgain C, Hoffjan S, Nicolae R, Newman D, et al: 'Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus'. Am J Hum Genet. 2003, 73: 612-626. 10.1086/378208.PubMed CentralView ArticlePubMedGoogle Scholar
- George VT, Elston RC: 'Testing the association between polymorphic markers and quantitative traits in pedigrees'. Genet Epidemiol. 1987, 4: 193-201. 10.1002/gepi.1370040304.View ArticlePubMedGoogle Scholar
- Elston RC, George VT, Severtson F: 'The Elston- Stewart algorithm for continuous genotypes and environmental factors'. Hum Hered. 1992, 42: 16-27. 10.1159/000154043.View ArticlePubMedGoogle Scholar
- George V, Elston RC: 'Generalized modulus power transformation'. Commun Stat Theory Methods. 1988, 17: 2933-2952. 10.1080/03610928808829781.View ArticleGoogle Scholar
- George V, Tiwari HK, Zhu X, Elston RC: 'A test of transmission/disequilibrium for quantitative traits in pedigree data, by multiple regression'. Am J Hum Genet. 1999, 65: 236-245. 10.1086/302444.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Li S, Cooper RS, Elston RC: 'A unified association analysis approach for family and unrelated samples correcting for stratification'. Am J Hum Genet. 2008, 82: 352-365. 10.1016/j.ajhg.2007.10.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Elston RC, Bielefeld RA: 'Testing disease marker association in pedigree data'. Proceedings of the Annual Meeting of the American Statistical Association. 1997, 34-43.Google Scholar
- Gray-McGuire C: 'Assessment of a variance component method for binary phenotype data: Model misspecification and effects of ascertainment'. 2004, Case Western Reserve University, Cleveland, OH, USA, (Thesis)Google Scholar
- Gourieroux C, Monfort A: 'Pseudo maximum likelihood methods'. Handbook of Statistics. 1993, 11: 335-362.View ArticleGoogle Scholar
- Ginsburg E, Malkin I, Elston RC: 'Sampling correction in linkage analysis'. Genet Epidemiol. 2004, 27: 87-96. 10.1002/gepi.20008.View ArticlePubMedGoogle Scholar
- Wellcome Trust Case Control C: 'Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls'. Nature. 2007, 447: 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Nick TG, George V, Elston RC, Wilson AF: 'Statistical validity for testing associations between genetic markers and quantitative traits in family data'. Genet Epidemiol. 1995, 12: 145-161. 10.1002/gepi.1370120204.View ArticlePubMedGoogle Scholar
- Prentice RL, Pyke R: 'Logistic disease incidence models and case-control studies'. Biometrika. 1979, 66: 403-411. 10.1093/biomet/66.3.403.View ArticleGoogle Scholar
- Burton PR: 'Comment on "Ascertainment adjustment in complex diseases"'. Genet Epidemiol. 2002, 23: 214-218. 10.1002/gepi.10199.View ArticlePubMedGoogle Scholar