- Primary research
- Open Access
Geographic stratification of linkage disequilibrium: a worldwide population study in a region of chromosome 22
Human Genomicsvolume 1, Article number: 399 (2004)
Recent studies of haplotype diversity in a number of genomic regions have suggested that long stretches of DNA are preserved in the same chromosome, with little evidence of recombination events. The knowledge of the extent and strength of these haplotypes could become a powerful tool for future genetic analysis of complex traits. Different patterns of linkage disequilibrium (LD) have been found when comparing individuals of African and European descent, but there is scarce knowledge about the worldwide population stratification. Thus, the study of haplotype composition and the pattern of LD from a global perspective are relevant for elucidating their geographical stratification, as it may have implications in the future analysis of complex traits. We have typed 12 single nucleotide polymorphisms in a chromosome 22 region--previously described as having high LD levels in European populations -- in 39 different world populations. Haplotype structure has a clear continental structure with marked heterogeneity within some continents (Africa, America). The pattern of LD among neighbouring markers exhibits a strong clustering of all East Asian populations on the one hand and of Western Eurasian populations (including Europe) on the other, revealing only two major LD patterns, but with some very specific outliers due to specific demographic histories. Moreover, it should be taken into account that African populations are highly heterogeneous. The present results support the existence of a wide (but not total) communality in LD patterns in human populations from different continental regions, despite differences in their demographic histories, as population factors seem to be less relevant compared with genomic forces in shaping the patterns of LD.
Future genetic analysis using single nucleotide polymorphisms (SNPs) will take advantage of the structure of the human genome in regions with high linkage disequilibrium (LD) and low haplotype number, in order to hasten and optimise gene mapping based on genetic association; find relatively frequent genetic variants associated with complex diseases; and define individual responses to drugs.
For these purposes, extensive knowledge of the patterns of LD in the human genome is required. It has been suggested that LD in humans could be organised as a pattern of blocks of variable length within which limited diversity is found, separated by regions with low LD. This structure could have been produced by a number of possible mechanisms, one of which is recombination hotspots [1–3]. The HapMap International Genome Project intends to create a map of haplotypes in four different populations in order to define sets of highly informative tag SNPs for future use. It is still unclear, however, to what extent a unique and general genome haplotype map exists or whether population structure is a main modifier of a putative human-wide pattern. The level of population structure affecting LD is also unclear: it could range fromdifferences between large continental groups to specificities of single populations with particular demographic histories. In fact, variable population stratification of LD for single loci has been found[4–6] and it is consistently observed that LD in non-African populations extends over longer physical distances than in Africans.
Here, we present a worldwide study of LD and haplotype structure in a region of chromosome 22, including 12 SNPs (Table 1), spanning 1.78 megabases (Mb), in which strong LD has been described in some European populations (English, Centre d'Étude du Polymorphisme Humain [CEPH] families, Estonians). Although it may seem that the distance between SNPs is beyond the usual range of LD and haplotype structure, this study focuses on the regions with the highest LD described along the entire length of chromosome 22. The analysis was performed on a total of 1,110 unrelated individuals from 39 different populations across the world. Our results contribute to the understanding of the differences in LD patterns that exist among populations, mainly defining wide regional areas with very high similarities, and the recognition of specific populations that might demonstrate special features.
Materials and methods
We have selected 12 SNPs, previously ascertained and typed by Dawson et al.,  in European (English and Estonian) populations in a high LD region in chromosome 22 (National Center for Biotechnology Information [NCBI] Build 34; 3,984,769 base pairs (bp) to 41,628,504 bp) (Table 1). The SNPs were identified through previous discovery efforts and are available on the Wellcome Trust Sanger Center Institute website (http://www.sanger.ac.uk). These SNPs cannot be considered as tag SNPs, but are markers flanking groups of SNPs with the highest LD in the region .
The analysis was performed on a total of 1,110 unrelated individuals: 1,063 worldwide purified genomic DNA samples from the Human Genome Diversity Project-CEPH Human Genome Diversity Cell Line Panel and 47 purified genomic DNA samples from Catalan individuals. The set of populations under study covered most of the complete human genetic diversity, as reported by Rosenberg et al. As some original population samples were small, some of the geographically closest populations were pooled. Tuscans and North Italians were grouped as Continental Italians (CIT); Dai, Lahu, Miaozu, Naxi, She, Tujia and Yiku populations were combined as South Chinese (SCH); and Daur, Hezhen, Mongolian, Orogen, Tu, Uygur and Xibo populations were grouped as North Chinese (NCH). The total number of populations studied was thus 39 (see Table 2). Genotyping data for 70 unrelated English individuals, performed by Dawson et al. and available on the Wellcome Trust Sanger Center Institute website, were also included in the present analysis for the selected markers.
Twelve SNPs were successfully genotyped using TaqMan technology from Applied Biosystems (AB). The Assay by Design service was used to design probes and primers. Each 5 μl polymerase chain reaction (PCR) mix contained 10 ng of genomic DNA, 0.125 μl of a 40× mix of primers and 6-carboxy-fluorescein (FAM) and VIC labelled TaqMan munor groove bunder (MGB) probes and 2.5 μl of TaqMan Universal PCR Master Mix. Amplification conditions were as follows: 50°C, 2 minutes; 95°C, 10 minutes; followed by 40 cycles of 94°C, 15 seconds and 60°C, 1 minute, in ABI Prism 7900 HT (AB). Fluorescence in each well was measured after PCR and the results were analysed using Sequence Detection System (SDS) version 2.1 (AB).
Haplotypes and LD
Haplotype frequencies were estimated from genotype frequencies using the expectation-maximisation (EM) algorithm , as implemented in Arlequin software . It should be noted that this has been described as a high LD region and thus even in samples with a small number of chromosomes (for example, of less than 50 individuals), power and accuracy in estimating haplotype frequencies is acceptable according to simulations . Haplotypes estimated at a frequency lower than a single chromosome were not considered. Besides haplotype diversity, the fraction of haplotypes not found (FNF statistic) was also computed as a measure of haplotype variation; it can be interpreted as the fraction of haplotypes not found in the population and is defined as
where Kh is the number of haplotypes found in the sample, Kmin is the minimum number of haplotypes that can be found in total LD (that is, two in the case of biallelic markers such as SNPs) and Kmax is the maximum possible number of different haplotypes expected under linkage equilibrium, given the size and allele frequencies of the population, and thus corrects for fixed loci (see Mateu et al.).
Correspondence and principal component analyses were performed using SPSS software version 9.0. For LD measures, we computed D' and r2 for each pair of markers using the Arlequin software package . Correspondence analysis provided a method for representing frequency data in a Euclidian space, so that the results could be visually examined for structure.
A description of the patterns of haplotype diversity within and among populations allows for an initial approach to the comprehension of the haplotype structure, its variation and diversity, and the global and regional similarities. Of the total of 4,096 different possible haplotypes, 531 were found. The number of shared haplotypes found in two or more populations was 182, a non-negligible fraction for such a wide genomic region. The most frequent haplotype was present in 118 chromosomes (5 per cent), all from European and Asian populations. In Africa, all of the haplotypes found at high frequency were population specific. The most common haplotypes found in Native Americans were present at very low frequencies elsewhere, a fact that can be explained by a bottleneck in the original settlement. We found a non-negligible fraction of fixed SNPs, mainly in Native Americans and Oceanians (see Table 2), which may be the result of the SNPs having been ascertained in Europeans and of genetic drift.
Table 2 shows, for each population and as an average for geographical regions, different descriptive parameters: haplotype diversity (Dh), observed number of haplotypes (Kh), number of haplotypes expected under equilibrium (Kmax), fraction of haplotypes not found (FNF), number of haplotypes shared between two or more populations and the number of nonpolymorphic SNPs. These figures are intended to present a comparative approximation of the amount of variation. Oceanian and Native American populations show the lowest haplotype diversities, with a high fraction of fixed SNPs. Asians and Europeans show high and similar haplotype diversities, with slightly lower values in Africans, even if fixation mainly affects a single population, the San from Namibia (with a low sample size and a high proportion of fixed loci). The fraction of chromosomes in a population harbouring haplotypes shared with other populations is lowest by far in Africans, but it is very high in Oceanians and Native Americans, which thus have a communality of haplotypes with Eurasian populations.
A measure of haplotype variability in the region could be obtained using the FNF statistic, which only depends on the number of polymorphic SNPs, and thus is not affected by the fixation of alleles in some SNPs. The number of different haplotypes expected under linkage equilibrium (given the sample size and allele frequencies) was compared with the number of observed haplotypes in each population. The resulting fraction (that is, the FNF; see Table 2) would be expected to increase when the SNP diversity is high and the number of observed haplotypes is low. The lowest mean value of FNF (and, thus, the highest richness of haplotypes) was found in Africans, with several geographical groups showing heterogeneity among single populations, mainly in Oceania and America.
In order to describe the patterns of haplotype variation and the similarities of populations based on their haplotype composition, a correspondence analysis was performed on the haplotype frequencies for each of the 40 populations, considering the haplotypes shared by at least two populations. The results for the first three dimensions are plotted in Figure 1. As expected, they show that Africans are the main source of variation (as revealed by the first dimension). The second dimension separates the five Native American populations from the rest; thus, Native Americans are the second most important source of global genetic variation for haplotype composition, even if in this case most of the haplotypes are shared with other populations. Finally, the third dimension differentiates East Asians and Oceanians from the rest of Eurasian populations. The most interesting feature is the continental clustering, with strong similarities among populations, mainly in two clusters: Europe, Middle East/North Africa and Central/South Asia on the one hand, and East Asia and Oceania on the other; there is higher heterogeneity within Africa (with many unshared haplotypes) and America (with most of the haplotypes shared).
LD decays with physical distance, but the pattern of decay shows strong differences among genomic regions with different recombination rates. For each pair of markers, we computed D' and r2, the two most common measures of LD [16–18]. Both statistics produced equivalent results in all of the performed analyses. Henceforth, therefore, only r2 results are shown.
In order to describe the similarities in the LD pattern among populations , a principal component analysis was performed upon measures of LD between adjacent pairs of markers. For every pair of populations, Pearson's correlation was calculated between the r2 values of LD between adjacent pairs of markers. The result was a correlation matrix among populations, which was summarised in a principal component analysis . Seven populations were excluded because of their high number of fixed SNPs and, thus, the missing LD measures (the populations with more than three missing values of r2 were not considered; therefore, a total number of 33 populations were included in this analysis). Results for the first two components (Figure 2) revealed, as in the case of haplotype structure, two clusters, one corresponding to Central and West Eurasia, explaining 42 per cent of the variance (a North African population showed an African position) and the other corresponding to East Asia (18.8 per cent of the variance). African populations were scattered in the plot, with different LD patterns among them.
We tested the statistical significance (Table 3, above the diagonal) of the previous correlation coefficients (between LD measures in contiguous SNPs) for pairs of populations within each geographical region (Table 3, below the diagonal). The significance of the probabilities was established using the rigid and conservative Bonferroni correction. We also calculated the correlations using the whole LD matrix, establishing the significance through the non-parametric Mantel test, with similar results to those found using just the diagonal values; however, the amount of noise for LD at large distances precludes its use. Oceanic and American groups were excluded because of lack of comparative data due to fixed SNPs. There are, with some exceptions, very strong correlations among populations within regional groups, except for Africans -- a further consequence of the genetic heterogeneity among African populations. In Europeans, all correlations were extremely significant, except for the Adygei in the Caucasus. In Central/South Asia, the pattern was less clear, having a larger diversity within some populations (such as Sindhi) and showing non-significant correlations with the rest. Finally, East Asian populations formed a tight cluster with very strong similarities in most of the comparisons. When performing the same analysis with one population from each region, correlations were much smaller, as expected (Table 3E); nevertheless, a cluster became evident with populations from West Eurasia (from Europe, the Middle East and Central Asia). The results of the correlations confirmed and quantified the principal component analysis in Figure 2.
The genetic diversity in humans has been used for decades to understand population history, but in recent years there has been a growing interest in ascertaining the extent of variation for other purposes--mainly for the genetic analysis of complex traits through methods based on LD . The most frequently-used method is the comparison between patients and control populations (association studies), with approaches ranging from a single candidate SNP to a whole genome scan. In fact, knowledge of genetic stratification is of interest to obtain reliable results in association studies, as it may help to answer questions such as: i) how different is the haplotype composition between populations or, in other words, how well would SNPs that account for the most common haplotypes (tagging SNPs) in one population work in other populations as tag SNPs? and ii) how different are the LD patterns, that is, if an association found in one population is not replicated in another, could it be due to differences in the LD pattern between the two populations?
The genetic diversification of humans is mainly the consequence of the specific demographic history of humans as a species and the particular history of each regional group or single population. It is of interest, therefore, to evaluate not only the differences and stratification of genetic variation in terms of allele and haplotype frequencies, but also of LD patterns, which have been less explored. In fact, the diversity observed is the result of the interplay between the genome (mainly recombination) and demographic factors (mainly drift); if the former was the only player, there would be a single LD map of the human genome. In this case, the variation that might exist would depend on the relative importance of population-specific historical factors.
There is a fundamental problem in most studies of genetic variation: how were the variants ascertained? The present worldwide analysis of common SNPs identified in European samples, even if extreme frequencies have been avoided, has an ascertainment bias, with alleles being fixed in other populations. As well as this well known ascertainment bias, there will be a further ascertainment bias associated with the specific populations in which LD structure is described, a fact that will be more pronounced if differences in LD among populations are strong. As discussed below, this is not the case, and population factors are minor compared with genomic factors in shaping the patterns of LD. Recently, extensive simulations have demonstrated that ascertainment bias is an important problem to consider in the interpretation of LD estimates . Despite the availability of the SNPs required to build a haplotype map for European populations and the existence of statistical tools for correcting the ascertainment bias, an identification effort and allele frequency estimates of markers in other continental groups are essential.
Besides the ascertainment bias problems, it is clear that the diversity observed through both haplotype structure and LD patterns in worldwide populations do indeed reflect some effects of population events and demographic history. One example is the high frequency of fixed SNPs in Amerindians, which could be explained by a founder effect experienced by these populations. In addition, several studies have shown high levels of population substructure in Africa , which results in the observed divergent patterns of LD among African populations.
The analysis of haplotype composition has shown that widely scattered geographical groups are highly homogeneous. This is the case for populations in Europe, the Middle East/North Africa and Central/South Asia on the one hand and East Asia and the Pacific Rim on the other. More heterogeneity is observed in Africa (with high diversity and low haplotype sharing) and the Americas (with low diversity and very high haplotype sharing). The analysis of haplotype composition and differentiation among populations shows that differences in diversity are not strong and that the extent of haplotype sharing is high for all populations except Africans. Thus, although there are differences in haplotype frequencies that might be of anthropological interest, haplotype distribution shows remarkable constancy within large geographical groups, and their variation does not hamper the use of genetic strategies for looking for common sets of haplotypes.
Interestingly, the LD pattern presents a comparable picture, with very similar patterns for both the East Asian populations and for most West Eurasian populations. No doubt there is a single, shared LD structure for populations belonging to each group and, since LD structure is crucial for gene mapping based on genetic association, this suggests that there are good reasons to accept a common pattern in these two regions, with a unique LD structure for each -- having been shaped by a common demographic history. Nevertheless, some populations show divergent patterns. These are rather small populations with particular demographic histories. In the latter cases, the LD pattern cannot be ascertained from a common pattern. It is thus evident that for most Eurasian populations just two reference populations (from Europe and the Far East) could give a general framework of variation.
For the Americas and Oceania, differences in haplotype frequencies have not erased the clear genetic communality with Asian populations. Additional care has to be taken with populations that have had a special demographic history -- a fact that is generally known in anthropological genetics and that would prevent consideration of these populations as part of an analysis of general populations in terms of their LD composition.
For Africa, the picture is more complex, as haplotypes are more diverse, with less sharing and significant differences in the LD pattern. Within the continent of Africa, it does not seem to be appropriate to use or infer information across populations, and a larger effort is required to fully ascertain the LD variation within the continent.
Although further analysis would be needed in order to ascertain the precise extent of portability of tagging SNPs across specific populations, the present results support the existence of a wide (but not total) communality in LD patterns in human populations from different continental regions, despite differences in their demographic histories.
Goldstein DB: Islands of linkage disequilibrium. Nat Genet. 2001, 29: 109-111. 10.1038/ng1001-109.
Jeffreys AJ, Kauppi L, Neumann R: Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 2001, 29: 217-222. 10.1038/ng1001-217.
Stumpf MP: Haplotype diversity and the block structure of linkage disequilibrium. Trends Genet. 2002, 18: 226-228. 10.1016/S0168-9525(02)02641-0.
Tishkoff SA, Dietzsch E, Speed W, et al: Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science. 1996, 271: 1380-1387. 10.1126/science.271.5254.1380.
Mateu E, Calafell F, Lao O, et al: Worldwide genetic analysis of the CFTR region. Am J Hum Genet. 2001, 68: 103-117. 10.1086/316940.
Reich DE, Schaffner SF, Daly MJ, et al: Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet. 2002, 32: 135-142. 10.1038/ng947.
Gabriel SB, Schaffner SF, Nguyen H, et al: The structure of haplotype blocks in the human genome. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.
Dawson E, Abecasis GR, Bumpstead S, et al: A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002, 418: 544-548. 10.1038/nature00864.
Mullikin JC, Hunt SE, Cole CG, et al: An SNP map of human chromosome 22. Nature. 2000, 407: 516-520. 10.1038/35035089.
Cann HM, de Toma C, Cazes L, et al: A human genome diversity cell line panel. Science. 2002, 296: 261-262.
Rosenberg NA, Pritchard JK, Weber JL, et al: Genetic structure of human populations. Science. 2002, 298: 2381-2385. 10.1126/science.1078311.
Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995, 12: 921-927.
Schneider S, Roessli D, Excoffier L: Arlequin ver 2.0: A software for population genetic data analysis. 2000, Genetics and Biometry Laboratory, University of Geneva, Geneva, Switzerland, 2.0
Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet. 2000, 67: 947-959. 10.1086/303069.
Greenacre M: Correspondence analysis in medical research. Stat Methods Med Res. 1992, 1: 97-117. 10.1177/096228029200100106.
Lewontin RC: The interaction of selection and linkage. I. General considerations: Heterotic models. Genetics. 1964, 49: 49-67.
Hill WG, Robertson A: The effects of inbreeding at loci with heterozygote advantage. Genetics. 1968, 60: 615-628.
Pritchard JK, Przeworski M: Linkage disequilibrium in humans: Models and data. Am J Hum Genet. 2001, 69: 1-14. 10.1086/321275.
Bertranpetit J, Calafell F, Comas D, et al: Structure of linkage disequilibrium in humans: Genome factors and populations stratification. Cold Spring Harbor Symposia on Quantitative Biology. 2003, 68: 79-88. 10.1101/sqb.2003.68.79.
Calafell F, Bertranpetit J: Principal component analysis of gene frequencies and the origin of Basques. Am J Phys Anthropol. 1994, 93: 201-215. 10.1002/ajpa.1330930205.
Akey J, Zhang K, Xiong M, et al: The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium. Mol Biol Evol. 2003, 20: 232-242. 10.1093/molbev/msg032.
Carlson CS, Eberle MA, Rieder MJ, et al: Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet. 2003, 33: 518-521. 10.1038/ng1128.
Tishkoff SA, Williams SM: Genetic analysis of African populations: Human evolution and complex disease. Nat Rev Genet. 2002, 3: 611-621. 10.1038/ni0702-611.
This study was supported by the European Project QLG2-CT-2002-00916 and by the Ministerio de Ciencia y Tecnologia of the Spanish Government (BNC2001-0772). We are particularly grateful to Dr Ian Dunham and Dr Lon Cardon for their advice on marker selection and analysis. We would especially like to thank M. Feldman for his suggestions on data analysis, Mònica Vallés for technical support and Michelle Gardner for help in the correction of the manuscript.