Robustness of the inference of human population structure: A comparison of X-chromosomal and autosomal microsatellites
© Henry Stewart Publications 2004
Received: 24 October 2003
Accepted: 24 October 2003
Published: 1 January 2004
In this paper, data on 20 X-chromosomal microsatellite polymorphisms from the HGDP-CEPH cell line panel are used to infer human population structure. Inferences from these data are compared to those obtained from autosomal microsatellites. Some of the major features of the structure seen with 377 autosomal markers are generally visible with the X-linked markers, although the latter provide less resolution. Differences between the X-chromosomal and autosomal results can be explained without requiring major differences in demographic parameters between males and females. The dependence of the partitioning on the number of individuals sampled from each region and on the number of markers used is discussed.
KeywordsAMOVA Bayesian inference clustering human evolution population divergence X chromosome
Differences in patterns of human genetic variation across genetic systems -- such as autosomes, the X and Y chromosomes and the mitochondrial genome -- can be attributed to two main sources: (1) differences between males and females in demographic parameters such as population size and migration rate; and (2) differences across systems in the mechanism of inheritance. Past studies have reported on differences between evolution in males and females by comparing autosomal data with the non-recombining portion of the Y chromosome (NRY) and to mitochondrial DNA (mtDNA);[1–3] however, the X chromosome has generally not been utilised in these studies.
Unlike the NRY and mtDNA, the X chromosome undergoes recombination and contains numerous independent markers. Additionally, selection, if present in the uni-parental systems, will affect every locus; on the X chromosome, however, it affects only those loci that are closely linked to selected sites. Consequently, the differences in variation between autosomes and the X chromosome may be more directly ascribed to male/female demographic differences than those between autosomes and the uniparental systems.
Here the HGDP-CEPH Human Genome Diversity Cell Line Panel is used to test whether individual multilocus genotypes defined by X-linked markers produce different inferences about population structure from those obtained using autosomal genotypes. Results from X-chromosomal analysis of molecular variance (AMOVA) and cluster analysis (as implemented in structure) are compared with those found using the same techniques on 377 autosomal markers . These comparisons are used to study the extent to which the differences between their findings and the results reported by Rosenberg et al.  stem from differences in the mechanism of inheritance and from the smaller amount of information available on the X chromosome. This analysis leads to conclusions about the robustness of population structure inference with respect to number of microsatellite markers and the number of individuals sampled per region.
Methods and results
The 1,056 individuals (677 males, 379 females) analysed by Rosenberg et al.  and Zhivotovsky et al., who derive from 52 populations in seven regional groups, were typed for X-linked markers. The X-chromosomal data were compared to autosomal data from these same individuals [6, 7].
The loci studied on the X chromosome consist of 20 polymorphic microsatellites -- 4 di-, 2 tri- and 14 tetranucleotide repeats -- from Marshfield Screening Set #10, with 5.2 per cent missing data. Three of the markers were pseudoautosomal (tetranucleotide DXYS218, tetranucleotide DXS9900 and dinucleotide DXYS154), so that the males were not hemizygous but homo- or heterozygous at these loci.
In both the autosomal data and the X-chromosomal data, markers are sufficiently widely spaced that, within individual populations, linkage disequilibrium as estimated by homo-zygosity-based statistics is generally not observed (results not shown). Thus, these loci can be treated as independent markers.
Heterozygosity in the seven regions, computed using the unbiased estimator, ranged as ollows: 0.57 (America), 0.64 (Oceania), 0.67 (East Asia), 0.71 (Central/South Asia), 0.72 (Europe), 0.74 (Middle East) and 0.78 (Africa). Of the 237 total alleles in the data, 34 were confined to a single population; 29 of these 'private' alleles appeared only once in the sample. Of the 208 alleles found more than once in the sample, 7.2 per cent were exclusive to one of the seven geographical regions listed above.
AMOVA for the X chromosome
Analysis of molecular variance (AMOVA) for 17 (non-pseudoautosomal) X-chromosomal markers.
Number of regions
Number of populations
Variance components (%)
Among populations within regions
Results for two-sided Wilcoxon tests with r = 0.875 and r = 0.5.
Number of regions
Number of populations
P-value for two-sided Wilcoxon test with H0:W X = W aut (r= 0.875, z = 0.143)
P-value for two-sided Wilcoxon test with H0: W X = 3W aut /(4-W aut ) (r= 0.5, z = 1)
Value of rthat produces highest P-value
5.43 × 10-5
1.01 × 10-3
2.42 × 10-4
9.04 × 10-3
4.44 × 10-6
7.69 × 10-4
7.72 × 10-10
2.53 × 10-7
3.63 × 10-9
1.22 × 10-6
1.15 × 10-6
5.49 × 10-5
6.08 × 10-3
5.47 × 10-10
3.60 × 10-8
2.43 × 10-10
1.19 × 10-8
This observation may be explained by a faster rate of genetic drift for X-chromosomal markers, by comparison to that for autosomal markers. Because populations contain fewer copies of X chromosomes than of any given autosomes, drift may proceed more rapidly for X-chromosomal markers, leading to greater X-chromosomal differentiation across populations and larger among-population and among-region variance components.
Writing N = N f + N m , where N f and N m are effective population sizes of females and males, respectively, r = N f /N is the female fraction of the effective population size. Using the expressions for autosomal and X-chromosomal effective population sizes, N aut = 4r(1 - r)N and N x = 9r(1 - r)N/[2(2 - r)],[16–18] together with the fact that T aut /N aut = T x /N x , (3-6) can be simplified.
Note that there are two special cases of interest (Table 2). At r = 1/2 (z = 1), drift proceeds at the same rate in males and females, so that W X = 3W aut /(4 - W aut ). At r = 7/8 = 0.875 (z = 1/7 ≈ 0.143), the slow speed of drift in females compared with males reduces the drift rate of X chromosomes exactly enough to counteract the increase in X-chromosomal drift rate that results from their smaller number in the population. In other words, W X = W aut . The fact that the hypothesis W X = W aut (Table 2) at P = 0.05 for 11 of the 13 groupings of data in Table 2 can be rejected means that the hypothesis z = 1/7 can also be rejected.
At r = 0.5, when drift proceeds at the same rate in males and females, significant P-values (P < 0.05) were found for Africa, Eurasia (treated both as one region and three regions), Europe, Central/South Asia and East Asia. Therefore, for the remaining seven of the 13 samples, the differences in autosomal and X-chromosomal F st values can be explained by assuming that N m = N f and by using the smaller effective population size of X chromosomes alone. Because Rosenberg et al.  found that repeat size affected divergence, Wilcoxon tests were also performed between transformations of the 274 autosomal tetranucleotide repeats and the 14 X-linked tetranucleotides and similar results were obtained, whether or not the two pseudoautosomal tetranucleotides were included in analysis (not shown).
The values of r on the interval [0,1] that produced the largest P-value (see Figure 1a-c) are also reported in Table 2. America was the only sample where the value of r corresponding to the maximal P-value was greater than 0.5 (r = 0.66 resulted in P = 1.00 for this case).
Of special interest is the fact that five of the six samples with significant P-values (P < 0.05) at r = 0.5 recorded significant P-values as r varied over the whole range [0,1]. For example, as r varies from 0 to 1 in Figure 1b, the P-value resulting from the autosomal transformation for Eurasia (treated as one region) ranges from 4.95 × 10-4 (r = 0) to 1.97 × 10-10 (r = 1): The single exception to this pattern was Africa, where the P-value decreased monotonically as r increased and P < 0.05 for r ≥ 0.06.
For these six groupings of the dataset (Africa, Eurasia both as one and as three regions, Europe, Central/South Asia and East Asia), the divergence model with constant effective population size is likely to provide a poorer approximation, as it does not account for population growth or migration (Figure 1d-f).
Multidimensional scaling analysis
X-chromosomal population structure
The structure program identifies subgroupings with distinctive allele frequencies and places individuals into K clusters, where K is defined beforehand by the user and can be varied across independent runs of the program. An individual's membership of a particular cluster is presented as a number between 0 and 1, with membership coefficients summing to 1 across all K clusters.
Correlation coefficients of allele frequencies. Below the diagonal: correlations for 237 X-chromosomal alleles. Above the diagonal: correlations for 4682 autosomal alleles .
The 377 autosomal markers in the HGDP-CEPH Human Genome Diversity Cell Line Panel data[4, 6, 7] comprise the largest multilocus dataset presently available for studying globally distributed populations. Of interest in studies of population structure is the number of loci needed for clustering. Also considered here is the required number of sampled individuals [6, 20].
Oceania appears in Figure 6 as a distinct cluster with only ten loci and between 35 and 100 individuals per region. Because the Oceanic populations together contain 39 individuals, increasing the number of individuals beyond 35 per region meant that every Melanesian and Papuan was included in the subset run of structure. Thus, the distinctive allele frequencies of these populations identify this particular genetic cluster, despite the use of only ten loci.
The same techniques as Rosenberg et al.  and Zhivotovsky et al.  were used to analyse genetic structure as inferred from 20 microsatellite markers on the X chromosome. Multidimensional scaling (Figure 2) did not reveal major departures from the patterns exhibited by the autosomal data. As was also observed on the autosomes, both America and Oceania are the regions exhibiting the lowest heterozygosity (0.57 and 0.64, respectively) on the X chromosome.
Seielstad et al.  used a migration model to attribute differences in F st across genetic systems to a difference in male and female migration rates. By contrast, a divergence model was used here and it was found that the differences observed in F st values can, in many cases, be explained by the smaller effective population size of X chromosomes compared with autosomes. This is similar to what was observed by Jorde et al., who reported higher G st values in Y-chromosome restriction-site polymorphisms and mtDNA compared with autosomal systems, and found that this difference was expected because of the lower effective population size of the uniparentally-inherited portions of the genome. In those regions here where the smaller number of X chromosomes does not provide a sufficient explanation (Africa, Eurasia, Europe, Central/South Asia and East Asia), the assumptions of the divergence model--especially that of constant population size--may be responsible for the disagreement.
Upon closer examination of these differences in observed F st , the data here provide some support for the idea that genetic drift occurs faster in females than in males, or, equivalently, that the female effective population size is smaller than that of males. Many factors could potentially explain this observation; a larger correlation in females between reproductive success in parents and offspring or a smaller generation time in females may increase the rate of drift in females compared with that in males.
The use of X-chromosomal data revealed clustering similar to that obtained using autosomal data, but with less resolution (Figures 3-5). In America, Africa and Oceania, inferred clusters corresponded closely with predefined populations using both the autosomal and X-chromosomal loci, but the pattern of admixture observed by Rosenberg et al.  is not exactly the same as that revealed by the X chromosome, due to reduced resolution of clusters.
Note that the Oceanic (Melanesian and Papuan) populations in Figure 3 appear most similar to the African populations for 2 ≤ K ≤ 4, and then appear as their own genetic cluster at K = 6. This contrasts with the analysis of Wilson et al., whose analysis of 23 X-linked microsatellites using structure showed the Oceanic population combining with the Chinese population at K = 3. A possible explanation for the results here may be a migration from Africa to Oceania separate from the primary migration out of Africa to other regions .
While choosing representative individuals from various populations is an important factor in the success of studies concerned with inference of population structure, the robustness of structure is much more dependent on the number of microsatellite markers used (Figures 5 and 6). In common with Rosenberg et al., it is observed here that ancestry inference is most successful with at least 150 loci (Figure 5). Bamshad et al.  reported that correct assignment to the continent of origin with a mean accuracy of at least 90 per cent required a minimum of 60 loci and reached 99-100 per cent accuracy when more than 100 loci were used.
In contrast to this study, Bamshad et al.  considered a sample correctly assigned if the cluster with the greatest membership coefficient for an individual was the same as the predefined assignment. The criterion here compares the membership coefficients across all K clusters calculated when using structure on a subset of the data, with assignment made based on the full dataset. Thus, it is a measure of how well the results with smaller amounts of data match those with larger datasets, rather than a measure of 'correct assignment'. The difference in these criteria is likely to account for the smaller amount of genetic data regarded as sufficient by Bamshad et al.  The similarity coefficient C may be more sensitive to differences in membership coefficients between two runs and can be viewed as a conservative measure of similarity for the runs: visual similarity between graphs of estimated membership coefficients (Figure 6) can be achieved even with fairly small values of C (Figure 5). In Figure 6, for example, the plot using 100 loci and a maximum of 200 individuals per region is quite similar to the plot of the full data, while the similarity coefficient between the structure runs of that particular subset and the entire dataset is 0.379. C does not make use of the 'correct' predefined structure, and, thus, unlike the criterion used by Bamshad et al. , is unaffected by errors among the predefined labels.
While most studies to date have lacked the power to make strong inferences about population structure (due to the very recent availability of datasets with individuals assayed for large numbers of loci), future studies should choose an appropriate number both of individuals per region and of loci for these analyses. Note, however, that the sampling scheme may affect the estimated structure. For example, finer distinctions among populations of interest become visible when individuals who are more distantly related to those populations are omitted from analysis .
Although differences between the population structure based on the autosomes and X-linked loci may be expected due to differences in male and female demography, the differences between the results here and those of Rosenberg et al.  were largely due to the smaller number of X chromosomes in a population compared with autosomes, and to the smaller amount of data available from the X chromosome. From these results, it might be inferred that sex-biased demographic processes have not had a great influence on human population structure.
This research was supported in part by NIH Grants GM28428 and GM28016. Sohini Ramachandran is also supported by a NDSEG fellowship. Noah A. Rosenberg is supported by an NSF Postdoctoral Fellowship in Biological Informatics.
- Jorde LB, Watkins WS, Bamshad MJ, et al: 'The distribution of human genetic diversity: A comparison of mitochondrial, autosomal and Y-chromosome data'. Am J Hum Genet. 2000, 66: 979-988. 10.1086/302825.PubMed CentralView ArticlePubMedGoogle Scholar
- Oota H, Settheetham-Ishida W, Tiwaweck D, et al: 'Human mtDNA and Y-chromosome variation is correlated with matrilocal versus patrilocal residence'. Nature Genet. 2001, 29: 20-21. 10.1038/ng711.View ArticlePubMedGoogle Scholar
- Seielstad MT, Minch E, Cavalli-Sforza LL: 'Genetic evidence for a higher female migration rate in humans'. Nature Genet. 1998, 20: 278-280. 10.1038/3088.View ArticlePubMedGoogle Scholar
- Cann HM, de Toma C, Cazes L, et al: 'A human genome diversity cell line panel'. Science. 2002, 296: 261-262.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: 'Inference of population structure using multilocus genotype data'. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, et al: 'Genetic structure of human populations'. Science. 2002, 298: 2381-2385. 10.1126/science.1078311.View ArticlePubMedGoogle Scholar
- Zhivotovsky LA, Rosenberg NA, Feldman MW: 'Features of evolution and expansion of modern humans, inferred from genome-wide microsatellite markers'. Am J Hum Genet. 2003, 72: 1171-1186. 10.1086/375120.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabatti C, Risch N: 'Homozygosity and linkage disequilibrium'. Genetics. 2002, 160: 1707-1719.PubMed CentralPubMedGoogle Scholar
- Weir B: Genetic Data Analysis II. 1996, Sinauer Press, Sunderland, MAGoogle Scholar
- Barbujani G, Magagni A, Minch E, et al: 'An apportionment of human DNA diversity'. Proc Natl Acad Sci USA. 1997, 94: 4516-4519. 10.1073/pnas.94.9.4516.PubMed CentralView ArticlePubMedGoogle Scholar
- Lewontin RC: 'The apportionment of human diversity'. Evol Biol. 1972, 6: 381-398.View ArticleGoogle Scholar
- Lewis PO, Zaykin DV: 'Genetic Data Analysis: Computer program for the analysis of allelic data'. 2001, [http://lewis.eeb.uconn.edu/lewishome/software.html]Google Scholar
- Slatkin M: 'Inbreeding coefficients and coalescence times'. Genet Res. 1991, 58: 167-175. 10.1017/S0016672300029827.View ArticlePubMedGoogle Scholar
- Slatkin M: 'A measure of population subdivision based on microsatellite allele frequencies'. Genetics. 1995, 139: 457-462.PubMed CentralPubMedGoogle Scholar
- Pérez-Lezaun A, Calafell F, Seielstad M, et al: 'Population genetics of Y-chromosome short tandem repeats in humans'. J Mol Evol. 1997, 45: 265-270. 10.1007/PL00006229.View ArticlePubMedGoogle Scholar
- Ewens WJ: Population Genetics. 1969, Methuen & Co., London, UKView ArticleGoogle Scholar
- Hartl DL, Clark AG: Principles of Population Genetics. 1997, Sinauer Press, Sunderland, MAGoogle Scholar
- Nordborg M, Krone S: 'Separation of time scales and convergence to the coalescent in structured populations'. Modern Developments in Theoretical Population Genetics. Edited by: Slatkin M, Veuille M. 2002, Oxford University Press, Oxford, UKGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, et al: 'Response to comment on 'Genetic structure of human populations''. Science. 2003, 300: 1877-View ArticleGoogle Scholar
- Bamshad MJ, Wooding S, Watkins WS, et al: 'Human population genetic structure and inference of group membership'. Am J Hum Genet. 2003, 72: 578-589. 10.1086/368061.PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards AWF: 'Human genetic diversity: Lewontin's fallacy'. Bioessays. 2003, 25: 798-801. 10.1002/bies.10315.View ArticlePubMedGoogle Scholar
- Helgason A, Hrafnkelsson B, Gulcher JR, et al: 'A popula-tionwide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: Evidence for a faster evolutionary rate of mtDNA lineages than Y chromosomes'. Am J Hum Genet. 2003, 72: 1370-1388. 10.1086/375453.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilson JF, Weale ME, Smith AC, et al: 'Population genetic structure of variable drug response'. Nature Genet. 2001, 29: 265-269. 10.1038/ng761.View ArticlePubMedGoogle Scholar
- Disotell TR: 'Human evolution: the southern route to Asia'. Curr Biol. 1999, 9: R925-R928. 10.1016/S0960-9822(00)80106-2.View ArticlePubMedGoogle Scholar
- Rosenberg NA: 'Distruct: A program for the graphical display of population structure'. Mol Ecol Notes. 2004, [http://www.blackwell-synergy.com/links/doi/10.1046/j.1471-8286.2003.00566.x/full]Google Scholar