Primary research | Open | Published:
Robustness of the inference of human population structure: A comparison of X-chromosomal and autosomal microsatellites
Human Genomicsvolume 1, Article number: 87 (2004)
In this paper, data on 20 X-chromosomal microsatellite polymorphisms from the HGDP-CEPH cell line panel are used to infer human population structure. Inferences from these data are compared to those obtained from autosomal microsatellites. Some of the major features of the structure seen with 377 autosomal markers are generally visible with the X-linked markers, although the latter provide less resolution. Differences between the X-chromosomal and autosomal results can be explained without requiring major differences in demographic parameters between males and females. The dependence of the partitioning on the number of individuals sampled from each region and on the number of markers used is discussed.
Differences in patterns of human genetic variation across genetic systems -- such as autosomes, the X and Y chromosomes and the mitochondrial genome -- can be attributed to two main sources: (1) differences between males and females in demographic parameters such as population size and migration rate; and (2) differences across systems in the mechanism of inheritance. Past studies have reported on differences between evolution in males and females by comparing autosomal data with the non-recombining portion of the Y chromosome (NRY) and to mitochondrial DNA (mtDNA);[1–3] however, the X chromosome has generally not been utilised in these studies.
Unlike the NRY and mtDNA, the X chromosome undergoes recombination and contains numerous independent markers. Additionally, selection, if present in the uni-parental systems, will affect every locus; on the X chromosome, however, it affects only those loci that are closely linked to selected sites. Consequently, the differences in variation between autosomes and the X chromosome may be more directly ascribed to male/female demographic differences than those between autosomes and the uniparental systems.
Here the HGDP-CEPH Human Genome Diversity Cell Line Panel is used to test whether individual multilocus genotypes defined by X-linked markers produce different inferences about population structure from those obtained using autosomal genotypes. Results from X-chromosomal analysis of molecular variance (AMOVA) and cluster analysis (as implemented in structure) are compared with those found using the same techniques on 377 autosomal markers . These comparisons are used to study the extent to which the differences between their findings and the results reported by Rosenberg et al.  stem from differences in the mechanism of inheritance and from the smaller amount of information available on the X chromosome. This analysis leads to conclusions about the robustness of population structure inference with respect to number of microsatellite markers and the number of individuals sampled per region.
Methods and results
The 1,056 individuals (677 males, 379 females) analysed by Rosenberg et al.  and Zhivotovsky et al., who derive from 52 populations in seven regional groups, were typed for X-linked markers. The X-chromosomal data were compared to autosomal data from these same individuals [6, 7].
The loci studied on the X chromosome consist of 20 polymorphic microsatellites -- 4 di-, 2 tri- and 14 tetranucleotide repeats -- from Marshfield Screening Set #10, with 5.2 per cent missing data. Three of the markers were pseudoautosomal (tetranucleotide DXYS218, tetranucleotide DXS9900 and dinucleotide DXYS154), so that the males were not hemizygous but homo- or heterozygous at these loci.
In both the autosomal data and the X-chromosomal data, markers are sufficiently widely spaced that, within individual populations, linkage disequilibrium as estimated by homo-zygosity-based statistics is generally not observed (results not shown). Thus, these loci can be treated as independent markers.
Heterozygosity in the seven regions, computed using the unbiased estimator, ranged as ollows: 0.57 (America), 0.64 (Oceania), 0.67 (East Asia), 0.71 (Central/South Asia), 0.72 (Europe), 0.74 (Middle East) and 0.78 (Africa). Of the 237 total alleles in the data, 34 were confined to a single population; 29 of these 'private' alleles appeared only once in the sample. Of the 208 alleles found more than once in the sample, 7.2 per cent were exclusive to one of the seven geographical regions listed above.
AMOVA for the X chromosome
It has generally been observed that the within-population component of genetic variation, W = 1 -- F st , is the largest component of human genetic diversity [1, 6, 10, 11]. Using Genetic Data Analysis (GDA) and assuming Hardy-Weinberg proportions within populations, the variance of allelic indicator variables were partitioned in the same manner as was done for autosomal loci from the same individuals . For the 17 non-pseudoautosomal X-chromosomal markers, the within-group variance component accounted for 87-93 per cent of variation among individuals (Table 1). Note that these values are generally smaller than the corresponding autosomal values in Table 1 of Rosenberg et al.  (Table 2).
This observation may be explained by a faster rate of genetic drift for X-chromosomal markers, by comparison to that for autosomal markers. Because populations contain fewer copies of X chromosomes than of any given autosomes, drift may proceed more rapidly for X-chromosomal markers, leading to greater X-chromosomal differentiation across populations and larger among-population and among-region variance components.
This argument can be investigated using Slatkin's[13, 14] formulation of F st in a set of d populations, each with constant 'effective population size' of N individuals. Consider a marker for which t0 and t1 are the mean coalescence times for two alleles from the same population and from different populations, respectively, and for which the mean coalescence time for two alleles chosen from any two populations is t = t0/d + (d-1)t1 /d. Assuming mutation rates are small, Slatkin obtained:
Suppose that the d populations diverged simultaneously at time Q in the past, from an ancestral population also with effective population size N, where Q is measured in the same units as t0, t1 and t. Noting that t1 = t0 + Q and substituting u for (d - 1)Q/d, (1) gives:
The value of t0, in units of generations or years, is proportional to the effective population size, and therefore differs across marker systems. Let F aut and F X denote autosomal and X-chromosomal values of F st , and let T aut and T X denote autosomal and X-chromosomal values of t0. Following a similar calculation to that of Pérez-Lezaun et al., to determine the relationship between F aut and F X , one can equate expressions for u obtained from autosomal and X-chromosomal versions of (2):
The within-population components of genetic variation, W aut = 1 - F aut and W X = 1- F X for autosomal and X-chromosomal loci, respectively, satisfy:
Writing N = N f + N m , where N f and N m are effective population sizes of females and males, respectively, r = N f /N is the female fraction of the effective population size. Using the expressions for autosomal and X-chromosomal effective population sizes, N aut = 4r(1 - r)N and N x = 9r(1 - r)N/[2(2 - r)],[16–18] together with the fact that T aut /N aut = T x /N x , (3-6) can be simplified.
Restricting attention to (6), leads to:
In terms of the relative rate of drift in females compared with males, denoted as z = [1/(2N f )]/[1/(2N m )] = N m /N f , (7) gives:
Note that there are two special cases of interest (Table 2). At r = 1/2 (z = 1), drift proceeds at the same rate in males and females, so that W X = 3W aut /(4 - W aut ). At r = 7/8 = 0.875 (z = 1/7 ≈ 0.143), the slow speed of drift in females compared with males reduces the drift rate of X chromosomes exactly enough to counteract the increase in X-chromosomal drift rate that results from their smaller number in the population. In other words, W X = W aut . The fact that the hypothesis W X = W aut (Table 2) at P = 0.05 for 11 of the 13 groupings of data in Table 2 can be rejected means that the hypothesis z = 1/7 can also be rejected.
For each of the 13 datasets, the values of r, the female fraction of effective population size, were varied from 0 to 1. At each choice of r, the transformation in (7) was applied and P-values for the two-sided Wilcoxon test between the list of 377 transformed autosomal within-population variance components and the within-population variance components observed at the 17 non-pseudoautosomal X-linked markers were obtained (Figure 1a-c).
At r = 0.5, when drift proceeds at the same rate in males and females, significant P-values (P < 0.05) were found for Africa, Eurasia (treated both as one region and three regions), Europe, Central/South Asia and East Asia. Therefore, for the remaining seven of the 13 samples, the differences in autosomal and X-chromosomal F st values can be explained by assuming that N m = N f and by using the smaller effective population size of X chromosomes alone. Because Rosenberg et al.  found that repeat size affected divergence, Wilcoxon tests were also performed between transformations of the 274 autosomal tetranucleotide repeats and the 14 X-linked tetranucleotides and similar results were obtained, whether or not the two pseudoautosomal tetranucleotides were included in analysis (not shown).
The values of r on the interval [0,1] that produced the largest P-value (see Figure 1a-c) are also reported in Table 2. America was the only sample where the value of r corresponding to the maximal P-value was greater than 0.5 (r = 0.66 resulted in P = 1.00 for this case).
Of special interest is the fact that five of the six samples with significant P-values (P < 0.05) at r = 0.5 recorded significant P-values as r varied over the whole range [0,1]. For example, as r varies from 0 to 1 in Figure 1b, the P-value resulting from the autosomal transformation for Eurasia (treated as one region) ranges from 4.95 × 10-4 (r = 0) to 1.97 × 10-10 (r = 1): The single exception to this pattern was Africa, where the P-value decreased monotonically as r increased and P < 0.05 for r ≥ 0.06.
For these six groupings of the dataset (Africa, Eurasia both as one and as three regions, Europe, Central/South Asia and East Asia), the divergence model with constant effective population size is likely to provide a poorer approximation, as it does not account for population growth or migration (Figure 1d-f).
Multidimensional scaling analysis
Geographic groups of populations are revealed by multidimensional scaling of pairwise F st values (Figure 2): sub-Saharan Africa, (western) Eurasia (which includes Europe, the Middle East, and Central/South Asia), East Asia, Oceania and America. Three populations from Eurasia (Uygur, Hazara, Brahui) and three populations from East Asia (Dai, Cambodian, Han from North China) overlap in the plot. The American populations show much greater within-region genetic differentiation than other continental groups, with the Mayan population (labelled as 4 in Figure 2) deviating somewhat from the rest of American samples. These results agree with the analysis of the same populations using auto-somal microsatellite markers .
X-chromosomal population structure
The structure program identifies subgroupings with distinctive allele frequencies and places individuals into K clusters, where K is defined beforehand by the user and can be varied across independent runs of the program. An individual's membership of a particular cluster is presented as a number between 0 and 1, with membership coefficients summing to 1 across all K clusters.
As is true of autosomal allele frequencies, X-chromosomal allele frequencies are strongly correlated across regions (Table 3). Thus, as was done for the autosomal genotypes from the same individuals, the correlated allele frequencies model implemented in structure was used with runs of the same number of iterations as those used to analyse the autosomal data.
America and Africa were the two essentially discrete regions generated at K = 2 for the X-chromosomal dataset (Figure 3). To compare results with Rosenberg et al.,K was increased from 2 to 6 incrementally. At K = 3, Eurasian populations were somewhat identified and the Mozabites were observed to have substantial membership with Africans, as may be expected from their location in Algeria. At K = 4, the X-chromosomal data show noticeably different structure from the autosomal data (see Figure 1 of Rosenberg et al. ), as East Asia does not separate as a genetic cluster with good resolution. The next distinct cluster appears at K = 6, where the Oceanic, American and African regions are observed; Eurasia and East Asia separate less obviously, but still appear differentiated from each other.
The X chromosome polymorphisms produced similar clustering to the autosomes, but with less resolution. This raises the question of how the resolution of clusters depends on the number of markers available to study. Figure 4 shows that when the same amounts of data are used, the autosomal and X-chromosomal loci are largely in agreement. Clustering from 20 markers on either autosome 5 or autosome 11 (Figure 4) revealed results very similar to those found with the X-chromosomal dataset. (These particular autosomes were chosen because exactly 20 microsatellites had been typed on them.) For these chromosomes, at K = 6, only American, African and (in the case of chromosome 5) Oceanic populations appear distinctly. Furthermore, a sample of 20 markers spread across all of the autosomes yielded similar results, with the Kalash appearing as a distinct group, but with the Oceanic cluster absent. The Kalash -- also seen distinctly in Figure 4 from the markers on chromosome 11--formed the sixth cluster in Rosenberg et al.  and was the only major cluster in that study that did not match a major geographical region.
The 377 autosomal markers in the HGDP-CEPH Human Genome Diversity Cell Line Panel data[4, 6, 7] comprise the largest multilocus dataset presently available for studying globally distributed populations. Of interest in studies of population structure is the number of loci needed for clustering. Also considered here is the required number of sampled individuals [6, 20].
Rosenberg et al.  found that inference of membership coefficients is most successful with at least 150 markers, and this is corroborated in Figure 5. It is also seen in Figure 5 that the addition of more individuals to a subset of the entire autosomal dataset (which contains 377 markers and 1,056 individuals) did not improve population structure inference as much as did the addition of loci. Data from individuals are used to estimate allele frequencies, which can be done fairly accurately with a small number of individuals; however, as structure uses distinctive genotypic combinations for the construction of clusters, and multilocus combinations are more likely to be distinctive to particular groups than are singlelocus types, additional loci can contribute more information to cluster analysis than can the addition of more individuals to the sample (Figure 6).
Oceania appears in Figure 6 as a distinct cluster with only ten loci and between 35 and 100 individuals per region. Because the Oceanic populations together contain 39 individuals, increasing the number of individuals beyond 35 per region meant that every Melanesian and Papuan was included in the subset run of structure. Thus, the distinctive allele frequencies of these populations identify this particular genetic cluster, despite the use of only ten loci.
The same techniques as Rosenberg et al.  and Zhivotovsky et al.  were used to analyse genetic structure as inferred from 20 microsatellite markers on the X chromosome. Multidimensional scaling (Figure 2) did not reveal major departures from the patterns exhibited by the autosomal data. As was also observed on the autosomes, both America and Oceania are the regions exhibiting the lowest heterozygosity (0.57 and 0.64, respectively) on the X chromosome.
Seielstad et al.  used a migration model to attribute differences in F st across genetic systems to a difference in male and female migration rates. By contrast, a divergence model was used here and it was found that the differences observed in F st values can, in many cases, be explained by the smaller effective population size of X chromosomes compared with autosomes. This is similar to what was observed by Jorde et al., who reported higher G st values in Y-chromosome restriction-site polymorphisms and mtDNA compared with autosomal systems, and found that this difference was expected because of the lower effective population size of the uniparentally-inherited portions of the genome. In those regions here where the smaller number of X chromosomes does not provide a sufficient explanation (Africa, Eurasia, Europe, Central/South Asia and East Asia), the assumptions of the divergence model--especially that of constant population size--may be responsible for the disagreement.
Upon closer examination of these differences in observed F st , the data here provide some support for the idea that genetic drift occurs faster in females than in males, or, equivalently, that the female effective population size is smaller than that of males. Many factors could potentially explain this observation; a larger correlation in females between reproductive success in parents and offspring or a smaller generation time in females may increase the rate of drift in females compared with that in males.
The use of X-chromosomal data revealed clustering similar to that obtained using autosomal data, but with less resolution (Figures 3-5). In America, Africa and Oceania, inferred clusters corresponded closely with predefined populations using both the autosomal and X-chromosomal loci, but the pattern of admixture observed by Rosenberg et al.  is not exactly the same as that revealed by the X chromosome, due to reduced resolution of clusters.
Note that the Oceanic (Melanesian and Papuan) populations in Figure 3 appear most similar to the African populations for 2 ≤ K ≤ 4, and then appear as their own genetic cluster at K = 6. This contrasts with the analysis of Wilson et al., whose analysis of 23 X-linked microsatellites using structure showed the Oceanic population combining with the Chinese population at K = 3. A possible explanation for the results here may be a migration from Africa to Oceania separate from the primary migration out of Africa to other regions .
While choosing representative individuals from various populations is an important factor in the success of studies concerned with inference of population structure, the robustness of structure is much more dependent on the number of microsatellite markers used (Figures 5 and 6). In common with Rosenberg et al., it is observed here that ancestry inference is most successful with at least 150 loci (Figure 5). Bamshad et al.  reported that correct assignment to the continent of origin with a mean accuracy of at least 90 per cent required a minimum of 60 loci and reached 99-100 per cent accuracy when more than 100 loci were used.
In contrast to this study, Bamshad et al.  considered a sample correctly assigned if the cluster with the greatest membership coefficient for an individual was the same as the predefined assignment. The criterion here compares the membership coefficients across all K clusters calculated when using structure on a subset of the data, with assignment made based on the full dataset. Thus, it is a measure of how well the results with smaller amounts of data match those with larger datasets, rather than a measure of 'correct assignment'. The difference in these criteria is likely to account for the smaller amount of genetic data regarded as sufficient by Bamshad et al.  The similarity coefficient C may be more sensitive to differences in membership coefficients between two runs and can be viewed as a conservative measure of similarity for the runs: visual similarity between graphs of estimated membership coefficients (Figure 6) can be achieved even with fairly small values of C (Figure 5). In Figure 6, for example, the plot using 100 loci and a maximum of 200 individuals per region is quite similar to the plot of the full data, while the similarity coefficient between the structure runs of that particular subset and the entire dataset is 0.379. C does not make use of the 'correct' predefined structure, and, thus, unlike the criterion used by Bamshad et al. , is unaffected by errors among the predefined labels.
While most studies to date have lacked the power to make strong inferences about population structure (due to the very recent availability of datasets with individuals assayed for large numbers of loci), future studies should choose an appropriate number both of individuals per region and of loci for these analyses. Note, however, that the sampling scheme may affect the estimated structure. For example, finer distinctions among populations of interest become visible when individuals who are more distantly related to those populations are omitted from analysis .
Although differences between the population structure based on the autosomes and X-linked loci may be expected due to differences in male and female demography, the differences between the results here and those of Rosenberg et al.  were largely due to the smaller number of X chromosomes in a population compared with autosomes, and to the smaller amount of data available from the X chromosome. From these results, it might be inferred that sex-biased demographic processes have not had a great influence on human population structure.
Jorde LB, Watkins WS, Bamshad MJ, et al: 'The distribution of human genetic diversity: A comparison of mitochondrial, autosomal and Y-chromosome data'. Am J Hum Genet. 2000, 66: 979-988. 10.1086/302825.
Oota H, Settheetham-Ishida W, Tiwaweck D, et al: 'Human mtDNA and Y-chromosome variation is correlated with matrilocal versus patrilocal residence'. Nature Genet. 2001, 29: 20-21. 10.1038/ng711.
Seielstad MT, Minch E, Cavalli-Sforza LL: 'Genetic evidence for a higher female migration rate in humans'. Nature Genet. 1998, 20: 278-280. 10.1038/3088.
Cann HM, de Toma C, Cazes L, et al: 'A human genome diversity cell line panel'. Science. 2002, 296: 261-262.
Pritchard JK, Stephens M, Donnelly P: 'Inference of population structure using multilocus genotype data'. Genetics. 2000, 155: 945-959.
Rosenberg NA, Pritchard JK, Weber JL, et al: 'Genetic structure of human populations'. Science. 2002, 298: 2381-2385. 10.1126/science.1078311.
Zhivotovsky LA, Rosenberg NA, Feldman MW: 'Features of evolution and expansion of modern humans, inferred from genome-wide microsatellite markers'. Am J Hum Genet. 2003, 72: 1171-1186. 10.1086/375120.
Sabatti C, Risch N: 'Homozygosity and linkage disequilibrium'. Genetics. 2002, 160: 1707-1719.
Weir B: Genetic Data Analysis II. 1996, Sinauer Press, Sunderland, MA
Barbujani G, Magagni A, Minch E, et al: 'An apportionment of human DNA diversity'. Proc Natl Acad Sci USA. 1997, 94: 4516-4519. 10.1073/pnas.94.9.4516.
Lewontin RC: 'The apportionment of human diversity'. Evol Biol. 1972, 6: 381-398.
Lewis PO, Zaykin DV: 'Genetic Data Analysis: Computer program for the analysis of allelic data'. 2001, [http://lewis.eeb.uconn.edu/lewishome/software.html]
Slatkin M: 'Inbreeding coefficients and coalescence times'. Genet Res. 1991, 58: 167-175. 10.1017/S0016672300029827.
Slatkin M: 'A measure of population subdivision based on microsatellite allele frequencies'. Genetics. 1995, 139: 457-462.
Pérez-Lezaun A, Calafell F, Seielstad M, et al: 'Population genetics of Y-chromosome short tandem repeats in humans'. J Mol Evol. 1997, 45: 265-270. 10.1007/PL00006229.
Ewens WJ: Population Genetics. 1969, Methuen & Co., London, UK
Hartl DL, Clark AG: Principles of Population Genetics. 1997, Sinauer Press, Sunderland, MA
Nordborg M, Krone S: 'Separation of time scales and convergence to the coalescent in structured populations'. Modern Developments in Theoretical Population Genetics. Edited by: Slatkin M, Veuille M. 2002, Oxford University Press, Oxford, UK
Rosenberg NA, Pritchard JK, Weber JL, et al: 'Response to comment on 'Genetic structure of human populations''. Science. 2003, 300: 1877-
Bamshad MJ, Wooding S, Watkins WS, et al: 'Human population genetic structure and inference of group membership'. Am J Hum Genet. 2003, 72: 578-589. 10.1086/368061.
Edwards AWF: 'Human genetic diversity: Lewontin's fallacy'. Bioessays. 2003, 25: 798-801. 10.1002/bies.10315.
Helgason A, Hrafnkelsson B, Gulcher JR, et al: 'A popula-tionwide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: Evidence for a faster evolutionary rate of mtDNA lineages than Y chromosomes'. Am J Hum Genet. 2003, 72: 1370-1388. 10.1086/375453.
Wilson JF, Weale ME, Smith AC, et al: 'Population genetic structure of variable drug response'. Nature Genet. 2001, 29: 265-269. 10.1038/ng761.
Disotell TR: 'Human evolution: the southern route to Asia'. Curr Biol. 1999, 9: R925-R928. 10.1016/S0960-9822(00)80106-2.
Rosenberg NA: 'Distruct: A program for the graphical display of population structure'. Mol Ecol Notes. 2004, [http://www.blackwell-synergy.com/links/doi/10.1046/j.1471-8286.2003.00566.x/full]
This research was supported in part by NIH Grants GM28428 and GM28016. Sohini Ramachandran is also supported by a NDSEG fellowship. Noah A. Rosenberg is supported by an NSF Postdoctoral Fellowship in Biological Informatics.