Population genetic analysis of ascertained SNP data
© Henry Stewart Publications 2004
Received: 16 January 2004
Accepted: 16 January 2004
Published: 1 March 2004
The large single nucleotide polymorphism (SNP) typing projects have provided an invaluable data resource for human population geneticists. Almost all of the available SNP loci, however, have been identified through a SNP discovery protocol that will influence the allelic distributions in the sampled loci. Standard methods for population genetic analysis based on the available SNP data will, therefore, be biased. This paper discusses the effect of this ascertainment bias on allelic distributions and on methods for quantifying linkage disequilibrium and estimating demographic parameters. Several recently developed methods for correcting for the ascertainment bias will also be discussed.
Keywordssingle nucleotide polymorphisms ascertainment bias statistical analysis linkage disequilibrium demographic parameters
Many of the resources previously allocated to genomic sequencing have recently been devoted to the typing and discovery of single nucleotide polymorphisms (SNPs), resulting in a rapid increase in the amount of publicly available SNP data. In August 2003, the public dbSNP database at NCBI  contained 268,374 SNPs with allele frequency information (Build 116). In August 2002 (Build 106), it had contained only 47,577 SNPs. In one year, the number of SNPs in dbSNP with frequency information increased more than five-fold.
The major objective of most SNP typing and discovery studies is to develop a resource for genetic mapping studies [2, 3]. The large SNP datasets will provide an invaluable resource in both pedigree-based studies and association mapping studies [4–6]. The large SNP datasets also provide a remarkable resource for human population genetic analysis, however. Population geneticists will be interested in estimating recombination rates and levels of linkage disequilibrium,[7–10] as well as parameters relating to the demographics and ancestry of human populations using the available SNP data [11, 12]. In addition, large SNP datasets can be used to scan the genome for regions that may have been targeted by selection [13–15]. SNPs targeted by selection are presumably more likely to be disease associated [16, 17].
Unfortunately, most of the standard analytical methods usually used for population genetic inferences are not applicable to the majority of the SNP data. Almost all available population genetic methods assume that the analysed loci have been sampled randomly among the pool of all loci. Most SNP loci, however, were originally identified through an SNP discovery process that tends to select loci with particular allelic distributions [18–21]. This introduces an ascertainment bias which, if uncorrected, will bias parameter estimates and lead to false inferences [22, 23].
The aim of this review is to discuss the effects of the ascertainment bias for some common SNP discovery protocols and also to discuss some recently developed methods for correcting the ascertainment bias problem. If not otherwise stated, it will be assumed that SNPs are effectively unlinked. This will usually be a reasonable assumption for datasets containing multiple SNPs scattered throughout the genome. The case of linked SNPs can similarly be dealt with, however, and is discussed elsewhere [11, 23, 24].
There are probably more different SNP discovery protocols (ascertainment schemes) than there are research groups involved in SNP discovery. It is unlikely that any particular method of addressing the problem of ascertainment bias is appropriate for all ascertainment schemes. Most ascertainment schemes have the common feature that SNPs are originally discovered in a relatively small sample, however. The SNPs are then subsequently typed in a larger sample for the purpose of population genetic inferences. Small samples have a relatively smaller probability of containing rare alleles than large samples. The effect is, therefore, that in the final typed sample there is an excess of SNP loci with common alleles and a deficiency of loci with rare alleles. This deficiency of rare alleles in the typed sample is a common feature in many SNP datasets. Here, the term ascertainment sample is used to denote the sample used originally to discover the SNPs, while typed sample is used to denote the final sample used for population genetic inferences.
The ascertainment sample usually consists of two or more gene copies from a panel. A panel is a group of individuals whose DNA has been used in the SNP discovery process. SNPs are usually originally identified from an alignment of sequences or a collection of sequences, the SNP discovery alignment. In some cases, all individuals in a panel have been typed in the ascertainment sample and are represented in the SNP discovery alignment. In such cases the depth (d) of the SNP discovery alignment is equal to twice the number of diploid individuals in the panel (n p ). Very often, however, only a subset of the panel haplotypes (gene copies) have been typed for each SNP in the ascertainment sample (d < n p ). This may occur, for example, if the SNP discovery process is based on data obtained from shotgun sequencing. Although one may know how many sequences were included in the alignment for each SNP that was discovered, one will not know the true depth of the alignment because the sequences have been sampled with replacement from the panel sequences, ie the alignment in the ascertainment sample for a particular SNP may contain the same sequence more than once. Furthermore, the information regarding the depth of the alignment for each SNP may have been lost through time. This may occur, for example, because the number of sequences in the alignment has increased through time and no records have been kept regarding the number of sequences on which the SNP discovery process was based. SNP discovery protocols may, therefore, differ in the assumptions one can make regarding the depth of the ascertainment sample. Ascertainment schemes may also vary depending on whether singletons or low-frequency SNPs have been eliminated directly, on various aspects relating to the SNP verification process (eg re-sequencing) and on the method used for base-calling. Finally, the effect of the ascertainment bias will differ depending on whether the sequences used for SNP discovery is a subset of the data in the typed sample or if there is no overlap between these two sets of data.
Effect on the frequency spectrum
The distribution of X in Equation (1) gives the expected frequency spectrum for this model. The particular version shown in Equation (1) assumes that it is known which allelic type is mutant and which allelic type is ancestral.
Sherry et al.  used a similar expression to test `goodness o fit' of a standard neutral model to data of population frequencies of Alu elements, in a case where the Alu elements had originally been detected in a single diploid genome.
The frequency spectrum--when it is known which allele is ancestral, the so-called folded frequency spectrum--is given by Pr (X = x) + Pr (X = n - x).
and Y and Z are the number of mutant alleles in the ascertainment sample and the pooled sample, respectively.
Figures 1b and 2b show the unfolded and folded frequency spectrum, respectively, in a sample of size n = 20. Notice that the effect of the ascertainment bias is even stronger than in the case where the ascertainment sample sequences were a subset of the typed sample sequences.
Effect on inferences of demographic parameters
With the exception of population growth parameters, the effect of the ascertainment bias on inferences regarding demographic parameters has not been extensively analysed in the literature . Population growth has the effect of skewing the frequency spectrum towards an excess of rare alleles . Because the effect on the frequency spectrum of most ascertainment schemes is in the opposite direction--towards an increase in the number of intermediate frequency alleles--the effect of the ascertainment bias will be to reduce or eliminate the evidence for population growth. For example, Nielsen  found that there was little or no evidence for population growth in a dataset of 39 SNP loci. The lack of evidence for population growth was probably caused by the effects of the ascertainment scheme originally used to discover the SNPs . Polanski and Kimmel found that estimates of population growth rates are extremely sensitive to the exact details of the ascertainment scheme . They noted that even if just a small number of SNPs with low frequency polymorphisms have been eliminated due to presumed sequencing errors, this can substantially alter estimates of population growth rates.
The ascertainment scheme also has a profound effect on inferences regarding population structure. Wakeley et al.  showed that under a model of human demographics, the effect of an ascertainment bias would be to overestimate the rate of migration between populations. The complex ascertainment scheme considered by Wakeley et al. would preferentially select SNPs in genomic fragments with very old coalescent times. Because of the older coalescent times, these fragments of the genome have had an increased opportunity for migration in their genealogical history than fragments with very recent coalescent times, leading to ascertainment bias towards estimates of lower population subdivision.
If only one population is represented in the ascertainment sample, the effective population size of this population will be overestimated relative to the population size of other populations included in the typed sample. This is an issue that has been explored extensively for restriction fragment length polymorphism (RFLP) data [32–35]. Most of the available human RFLP data are based on polymorphisms that were originally identified in European populations. Initial analysis of these data led to the conclusion that the effective population size of Europeans is as large, or larger, than the effective population size of Africans. Most other data, however, such as mitochondrial DNA data, have shown that the effective population size of Africans in fact is much larger than the effective population size of Europeans [32, 36]. Once discovered, the higher African heterozygosity, which had been obscured by ascertainment bias, became an important feature of 'out of Africa' theories. To date, this is probably the best practical example of how ascertainment bias may lead to erroneous conclusions in human genetics.
In all cases, it is assumed that the ascertainment sample is a subset of the typed sample and that a locus is included in the analysis if there is variability in the ascertainment sample pooled from the two populations. Notice (see Figure 3a) that the frequency spectrum, when the typed sample is used as the ascertainment sample, is not very different from the frequency spectrum expected from a panmictic population, although there is a slight shift towards fewer ancestral alleles of low frequency. Because the ascertainment scheme is based on variability in any of the populations, one of the populations may now be invariable (X = 0 or X = 20), while the other population is variable. In this case, the expected value of the popular measure of population subdivision, FST, is approximately 0.12.
If the ascertainment sample consists of two chromosomes from each population (Figure 3b), the frequency spectrum is further skewed towards including more high frequency ancestral alleles. In this case, the expected value of FST is approximately 0.17. The ascertainment scheme has increased the level of expected heterozygosity both within and between populations, but the combined effect is to increase the value of FST in this case. If ascertainment is based on a sample from only one population, the ascertainment population (Figure 3c), the frequency spectra of the two populations will differ because invariable loci may not exist in the population from which the data have been ascertained. The expected value of FST is now 0.13. Furthermore, among the variable site patterns, the ascertainment population has relatively more mutant alleles of intermediate frequency. The expected heterozygosity will be higher in the ascertainment population than in the other population.
Effect on linkage disequilibrium
The effect of the SNP discovery protocol on measures of linkage disequilibrium (LD) was examined by Nielsen and Signorovitch  and Clark et al.  The effect depends on the measure of LD and the exact details of the ascertainment protocol. For a protocol in which the ascertainment sample is a subset of the typed sample and identical for all loci, the standardised linkage disequilibrium coefficient, D', will be underestimated [28, 38]. Another measure of LD, the squared allele frequency correlation coefficient, r2, however, will be overestimated in the presence of this type of ascertainment bias . In both cases, the effect is quite strong. For example, if the population recombination rate (ρ = 2N e R; where N e = effective population size and R = recombination rate) equals 1, r2 is increased 2.5 times if n = 100 and d = 5: Akey et al.  also showed that in the case of population subdivision, where only one or a subset of a population are represented in the ascertainment sample, the ascertainment bias may be even more pronounced for the populations not represented in the ascertainment sample.
By contrast, the ascertainment protocol has much less effect on Hudson's composite likelihood estimator of ρ. Typically, the bias in the estimate is less than 20 per cent and can be almost negligible, depending on the exact details of the ascertainment scheme . In general, if ascertainment for all loci is based on the same set of chromosomes, the ascertainment bias will be towards lower values of ρ.
Correcting ascertainment bias
It should by now be clear that appropriate population genetic analysis of ascertained SNP data is problematic in the absence of methods for correcting for ascertainment bias. Fortunately, it is possible in many cases to correct the ascertainment bias relatively easily, if reliable information is available regarding the details of the ascertainment scheme.
where n j is the observed number of loci with allele frequency j.
Such an approach has been used to correct estimates of ρ based on Hudson's (2001) estimator . Application of this procedure produces approximately unbiased estimates of ρ. Polanski and Kimmel  used this method to estimate population growth rates from SNP data. They showed that, under a model of exponential population growth, as the effect of the ascertainment bias increases, the power to reject the hypothesis of no population growth decreases, even after correction of the ascertainment bias. In this case, an SNP discovery protocol in which loci with high frequency alleles have been chosen preferentially has caused a reduction in power compared with randomly sampled SNPs. The reason is that most of the information regarding population growth comes from rare alleles. In general, using ascertainment protocols that enrich the data with respect to common alleles can lead both to a decrease and an increase in power, depending on the specifics of the models and parameters being estimated.
The methods for correcting the likelihood function and estimating the frequency spectrum can easily be extended to the case where low frequency alleles have been eliminated directly  and to more complicated ascertainment schemes involving linked SNPs . Wakeley et al.  considered data in which SNPs have been ascertained on the basis of the number of SNPs occurring on the genomic fragment on which they are located. They used methods similar to Equation (5) to estimate population growth rates and migration rates between human populations.
In theory, if detailed records regarding the SNP discovery protocols are being kept, corrections of the ascertainment bias are always possible. Even in the case where some information regarding the ascertainment scheme has been lost, such as the allele frequencies in the ascertainment samples, it may be possible to recover approximately unbiased parameter estimates and valid confidence intervals by statistical modelling of the ascertainment process. In cases where there is little or no overlap between the ethnicities of the individuals included in the typed sample and the ascertainment samples, however, corrections can only be made in parametric models describing the genetic relationship between the populations. In such cases, it will typically be difficult or impossible to use classical non-parametric methods for statistical inference.
Conclusions and recommendations
The size of the panel and the ethnicity of the panel members.
The details of the protocol used to sample sequences from the panel, ie full sampling or sampling with or without replacement; the ascertainment sample sizes; and, for linked SNPs, information regarding independent or correlated sampling of SNPs in the ascertainment sample.
Details regarding base-calling and elimination of rare alleles.
In many cases, this information is available or can be reconstructed. If not, this will in most cases preclude valid population genetic inferences based on the SNP data.
Much work is still needed on SNP ascertainment bias corrections. For example, there is still a need for standard methods for estimating levels of population subdivision from SNP data corresponding to the classical FST estimator . Such an estimator would be useful, for example, in studies aimed at detecting (possibly selected) genomic regions with extreme FST values . Researchers may also find corrections to estimates of the linkage disequilibrium coefficient (D') and its derivatives useful.
The large SNP datasets provide an unrivalled population genetic resource that, most likely, for many years will not be rivalled by data obtained using direct sequencing. Much effort is being devoted in the human genetics and population genetics communities to estimate ancestral and demographic parameters and parameters relating to recombination and mutation from human population genetic data. There is no reason why the large SNP datasets should not be used in this effort. Before this can happen, however, details regarding ascertainment schemes must be publicly available in greater detail. For example, information regarding the ethnicity of a panel is not sufficient without detailed information regarding how ascertainment samples were constructed from the panel when not all panel members have been sequenced for each SNP. Information regarding base-calling, used to assess the probability of unintentionally eliminating a low frequency allele, should also be available. Making this type of information available in databases, in a fashion that facilitates proper statistical analysis, provides a major bioinformatics task that should be given a very high priority by the human population genetics community.
- Smigielski EM, Sirotkin K, Minghong W, et al: 'dbSNP: A database of single nucleotide polymorphisms'. Nucl Acids Res. 2000, 28: 352-355. 10.1093/nar/28.1.352.PubMed CentralView ArticlePubMedGoogle Scholar
- Collins FS, Guyer MS, Chakravarti A: 'Variations on a theme: Cataloging human DNA sequence variation'. Science. 1997, 278: 1580-1581. 10.1126/science.278.5343.1580.View ArticlePubMedGoogle Scholar
- Collins FS, Brooks LD, Chakravarti A: 'A DNA polymorphism discovery resource for research on human genetic variation'. Genome Res. 1998, 8: 1229-1231.PubMedGoogle Scholar
- Risch N, Merikangas K: 'The future of genetic studies of complex human diseases'. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.View ArticlePubMedGoogle Scholar
- Brookes AJ: 'The essence of SNPs'. Gene. 1999, 234: 177-186. 10.1016/S0378-1119(99)00219-X.View ArticlePubMedGoogle Scholar
- Kruglyak L: 'Prospects for whole-genome linkage disequilibrium mapping of common disease genes'. Nat Genet. 1999, 22: 139-144. 10.1038/9642.View ArticlePubMedGoogle Scholar
- Reich DE, Cargill M, Bolk S, et al: 'Linkage disequilibrium in the human genome'. Nature. 2001, 411: 199-204. 10.1038/35075590.View ArticlePubMedGoogle Scholar
- Dawson E, Abecasis GR, Bumpstead S, et al: 'A first-generation linkage disequilibrium map of human chromosome 22'. Nature. 2002, 418: 544-548. 10.1038/nature00864.View ArticlePubMedGoogle Scholar
- Gabriel SB, Schaffner SF, Nguyen H, et al: 'The structure of haplotype blocks in the human genome'. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.View ArticlePubMedGoogle Scholar
- Clark AG, Nielsen R, Signorovitch J, et al: 'Linkage disequilibrium and inference of ancestral recombination in 538 single nucleotide polymorphism clusters across the human genome'. Am J Hum Genet. 2003, 73: 285-300. 10.1086/377138.PubMed CentralView ArticlePubMedGoogle Scholar
- Wakeley J, Nielsen R, Liu-Cordero SN, et al: 'The discovery of single nucleotide polymorphisms--and inferences about human demographic history'. Am J Hum Genet. 2001, 69: 1332-1347. 10.1086/324521.PubMed CentralView ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL, Feldman MW: 'The application of molecular genetic approaches to the study of human evolution'. Nat Genet. 2003, 33: 266-275. 10.1038/ng1113.View ArticlePubMedGoogle Scholar
- Sunyaev SR, Lathe WC, Ramensky VE, et al: 'SNP frequencies in human genes: An excess of rare alleles and differing modes of selection'. Trends Genet. 2000, 16: 335-337. 10.1016/S0168-9525(00)02058-8.View ArticlePubMedGoogle Scholar
- Akey JM, Zhang G, Zhang K, et al: 'Interrogating a high-density SNP map for signatures of natural selection'. Genome Res. 2002, 12: 1805-1814. 10.1101/gr.631202.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabeti PC, Reich DE, Higgins JM, et al: 'Detecting recent positive selection in the human genome from haplotype structure'. Nature. 2002, 419: 832-837. 10.1038/nature01140.View ArticlePubMedGoogle Scholar
- Reich DE, Cargill M, Bolk S, et al: 'Linkage disequilibrium in the human genome'. Nature. 2001, 411: 199-204. 10.1038/35075590.View ArticlePubMedGoogle Scholar
- Sunyaev SR, Ramensky VE, Koch I, et al: 'Prediction of deleterious human alleles'. Hum Mol Genet. 2001, 10: 591-597. 10.1093/hmg/10.6.591.View ArticlePubMedGoogle Scholar
- Taillon-Miller P, Gu Z, Li Q, et al: 'Overlaping genomic sequences: a treasure trove of single-nucleotide polymorphisms'. Genome Res. 1998, 8: 748-754.PubMed CentralPubMedGoogle Scholar
- Wang DG, Fan JB, Siao CJ, et al: 'Large-scale identification, mapping, and genotyping of single nucleotide polymorphisms in the human genome'. Science. 1998, 280: 1077-1082. 10.1126/science.280.5366.1077.View ArticlePubMedGoogle Scholar
- Picoult-Newberg L, Ideker TE, Pohl MG, et al: 'Mining SNPs from EST databases'. Genome Res. 1999, 9: 167-174.PubMed CentralPubMedGoogle Scholar
- Altshuler D, Pollar VJ, Cowles CR, et al: 'A SNP map of the human genome generated by reduced representation shotgun sequencing'. Nature. 2000, 407: 513-516. 10.1038/35035083.View ArticlePubMedGoogle Scholar
- Eberle MA, Kruglyak L: 'An analysis of strategies for discovery of single nucleotide polymorphisms'. Genet Epidemiol. 2000, 19: S29-S35. 10.1002/1098-2272(2000)19:1+<::AID-GEPI5>3.0.CO;2-P.View ArticlePubMedGoogle Scholar
- Nielsen R: 'Estimation of population parameters and recombination rates using single nucleotide polymorphisms'. Genetics. 2000, 154: 931-942.PubMed CentralPubMedGoogle Scholar
- Kuhner MK, Beerli P, Yamamoto J, et al: 'Usefulness of single nucleotide polymorphism data for estimating population parameters'. Genetics. 2000, 156: 439-447.PubMed CentralPubMedGoogle Scholar
- Kingman JFC: 'The coalescent'. Stochast Proc Appl. 1982, 13: 235-248. 10.1016/0304-4149(82)90011-4.View ArticleGoogle Scholar
- Hudson RR: 'Testing the constant-rate neutral model with protein sequence data'. Evolution. 1983, 37: 203-217. 10.2307/2408186.View ArticleGoogle Scholar
- Tajima F: 'Statistical method for testing the neutral mutation hypothesis by DNA polymorphism'. Genetics. 1989, 123: 585-595.PubMed CentralPubMedGoogle Scholar
- Nielsen R, Signorovitch J: 'Correcting for ascertainment biases when analyzing SNP data: Applications to the estimation of linkage disequilibrium'. Theor Pop Biol. 2003, 63: 245-255. 10.1016/S0040-5809(03)00005-4.View ArticleGoogle Scholar
- Polanski A, Kimmel M: 'New explicit expressions for relative frequencies of single nucleotide polymorphisms with application to statistical inference on population growth'. Genetics. 2003, 165: 427-436.PubMed CentralPubMedGoogle Scholar
- Sherry ST, Harpending HC, Batzer MA, et al: 'Alu evolution in human populations: Using the coalescent to estimate effective population size'. Genetics. 1997, 147: 1977-1982.PubMed CentralPubMedGoogle Scholar
- Slatkin M, Hudson RR: 'Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations'. Genetics. 1991, 129: 555-562.PubMed CentralPubMedGoogle Scholar
- Mountain JL, Cavalli-Sforza LL: 'Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms'. Proc Natl Acad Sci USA. 1994, 91: 6515-6519. 10.1073/pnas.91.14.6515.PubMed CentralView ArticlePubMedGoogle Scholar
- Rogers AR, Jorde LB: 'Ascertainment bias in estimates of average heterozygosity'. Am J Hum Genet. 1996, 58: 1033-1041.PubMed CentralPubMedGoogle Scholar
- Urbanek M, Goldman D, Long JC: 'The apportionment of dinucleotide repeat diversity in Native Americans and Europeans: A new approach to measuring gene identity reveals asymmetric patterns of divergence'. Mol Biol Evol. 1996, 13: 943-953. 10.1093/oxfordjournals.molbev.a025662.View ArticlePubMedGoogle Scholar
- Eller E: 'Effects of ascertainment bias on recovering human demographic history'. Hum Biol. 2001, 73: 411-428. 10.1353/hub.2001.0034.View ArticlePubMedGoogle Scholar
- Bowcock AM, Ruiz-Linares A, Tomfohrde J, et al: 'High resolution of human evolutionary trees with polymorphic microsatellites despite a constraint in allele length'. Nature. 1994, 368: 455-457. 10.1038/368455a0.View ArticlePubMedGoogle Scholar
- Hudson RR: 'Gene genealogies and the coalescent process'. Oxford Surveys in Evolutionary Biology. Edited by: P.H. Harvey and L Partridge. 1990, Oxford University Press, New York, NY, 7: 1-44.Google Scholar
- Akey JM, Zhang K, Xiong M, et al: 'The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium'. Mol Biol Evol. 2003, 20: 232-242. 10.1093/molbev/msg032.View ArticlePubMedGoogle Scholar
- Hudson RR: 'Two-locus sampling distributions and their application'. Genetics. 2001, 159: 1805-1817.PubMed CentralPubMedGoogle Scholar
- Weir BS, Cockerham CC: 'Estimating F-statistics for the analysis of population structure'. Evolution. 1984, 38: 1358-1370. 10.2307/2408641.View ArticleGoogle Scholar