Robustness of the inference of human population structure: A comparison of X-chromosomal and autosomal microsatellites

In this paper, data on 20 X-chromosomal microsatellite polymorphisms from the HGDP-CEPH cell line panel are used to infer human population structure. Inferences from these data are compared to those obtained from autosomal microsatellites. Some of the major features of the structure seen with 377 autosomal markers are generally visible with the X-linked markers, although the latter provide less resolution. Differences between the X-chromosomal and autosomal results can be explained without requiring major differences in demographic parameters between males and females. The dependence of the partitioning on the number of individuals sampled from each region and on the number of markers used is discussed.


Introduction
Differences in patterns of human genetic variation across genetic systems -such as autosomes, the X and Y chromosomes and the mitochondrial genome -can be attributed to two main sources: (1) differences between males and females in demographic parameters such as population size and migration rate; and (2) differences across systems in the mechanism of inheritance. Past studies have reported on differences between evolution in males and females by comparing autosomal data with the non-recombining portion of the Y chromosome (NRY) and to mitochondrial DNA (mtDNA); 1 -3 however, the X chromosome has generally not been utilised in these studies.
Unlike the NRY and mtDNA, the X chromosome undergoes recombination and contains numerous independent markers. Additionally, selection, if present in the uniparental systems, will affect every locus; on the X chromosome, however, it affects only those loci that are closely linked to selected sites. Consequently, the differences in variation between autosomes and the X chromosome may be more directly ascribed to male/female demographic differences than those between autosomes and the uniparental systems.
Here the HGDP-CEPH Human Genome Diversity Cell Line Panel 4 is used to test whether individual multilocus genotypes defined by X-linked markers produce different inferences about population structure from those obtained using autosomal genotypes. Results from X-chromosomal analysis of molecular variance (AMOVA) and cluster analysis (as implemented in structure 5 ) are compared with those found using the same techniques on 377 autosomal markers. 6 These comparisons are used to study the extent to which the differences between their findings and the results reported by Rosenberg et al. 6 stem from differences in the mechanism of inheritance and from the smaller amount of information available on the X chromosome. This analysis leads to conclusions about the robustness of population structure inference with respect to number of microsatellite markers and the number of individuals sampled per region.

Data
The 1,056 individuals (677 males, 379 females) analysed by Rosenberg et al. 6 and Zhivotovsky et al., 7 who derive from 52 populations in seven regional groups, were typed for X-linked markers. The X-chromosomal data were compared to autosomal data from these same individuals. 6,7 The loci studied on the X chromosome consist of 20 polymorphic microsatellites -4 di-, 2 tri-and 14 tetranucleotide repeats -from Marshfield Screening Set #10, with 5.2 per cent missing data. Three of the markers were pseudoautosomal (tetranucleotide DXYS218, tetranucleotide DXS9900 and dinucleotide DXYS154), so that the males were not hemizygous but homo-or heterozygous at these loci.
In both the autosomal data and the X-chromosomal data, markers are sufficiently widely spaced that, within individual populations, linkage disequilibrium as estimated by homozygosity-based statistics 8 is generally not observed (results not shown). Thus, these loci can be treated as independent markers.

Genetic diversity
Heterozygosity in the seven regions, computed using the unbiased estimator, 9 ranged as follows: 0.57 (America), 0.64 (Oceania), 0.67 (East Asia), 0.71 (Central/South Asia), 0.72 (Europe), 0.74 (Middle East) and 0.78 (Africa). Of the 237 total alleles in the data, 34 were confined to a single population; 29 of these 'private' alleles appeared only once in the sample. Of the 208 alleles found more than once in the sample, 7.2 per cent were exclusive to one of the seven geographical regions listed above.

AMOVA for the X chromosome
It has generally been observed that the within-population component of genetic variation, W ¼ 1 2 F st , is the largest component of human genetic diversity. 1,6,10,11 Using Genetic Data Analysis (GDA) 12 and assuming Hardy-Weinberg proportions within populations, the variance of allelic indicator variables were partitioned in the same manner as was done for autosomal loci from the same individuals. 6 For the 17 nonpseudoautosomal X-chromosomal markers, the withingroup variance component accounted for 87 -93 per cent of variation among individuals ( Table 1). Note that these values are generally smaller than the corresponding autosomal values in Table 1 of Rosenberg et al. 6 (Table 2).
This observation may be explained by a faster rate of genetic drift for X-chromosomal markers, by comparison to that for autosomal markers. Because populations contain fewer copies of X chromosomes than of any given autosomes, drift may proceed more rapidly for X-chromosomal markers, leading to greater X-chromosomal differentiation across populations and larger among-population and among-region variance components. This argument can be investigated using Slatkin's 13,14 formulation of F st in a set of d populations, each with constant 'effective population size' of N individuals. Consider a marker for which t 0 and t 1 are the mean coalescence times for two alleles from the same population and from different populations, respectively, and for which the mean coalescence time for two alleles chosen from any two populations is t ¼ t 0 =d þ ðd 2 1Þt 1 =d: Assuming mutation rates are small, Slatkin 13 obtained: Suppose that the d populations diverged simultaneously at time Q in the past, from an ancestral population also with effective population size N, where Q is measured in the same units as t 0 , t 1 and t. Noting that t 1 ¼ t 0 þ Q and substituting u for (d 2 1)Q/d, (1) gives: The value of t 0 , in units of generations or years, is proportional to the effective population size, and therefore differs across marker systems. Let F aut and F X denote autosomal and X-chromosomal values of F st , and let T aut and T X denote autosomal and X-chromosomal values of t 0 . Following a similar calculation to that of Pérez-Lezaun et al., 15 to determine the relationship between F aut and F X , one can equate expressions for u obtained from autosomal and X-chromosomal versions of (2): The within-population components of genetic variation, W aut ¼ 1 2 F aut and W X ¼ 12 F X for autosomal and X-chromosomal loci, respectively, satisfy: Writing N ¼ N f þ N m , where N f and N m are effective population sizes of females and males, respectively, r ¼ N f =N is the female fraction of the effective population size. Using the expressions for autosomal and X-chromosomal effective population sizes, N aut ¼ 4rð1 2 rÞN and N X ¼ 9rð1 2 rÞN =½2ð2 2 rÞ; 16 -18 together with the fact that T aut =N aut ¼ T X =N X ; (3 -6) can be simplified.
Restricting attention to (6), leads to: In terms of the relative rate of drift in females compared with males, denoted as gives: Note that there are two special cases of interest (Table 2). At r ¼ 1=2 ðz ¼ 1Þ, drift proceeds at the same rate in males and females, so that W X ¼ 3W aut =ð4 2 W aut Þ: At r ¼ 7=8 ¼ 0:875 ðz ¼ 1=7 < 0:143Þ, the slow speed of drift in females compared with males reduces the drift rate of X chromosomes exactly enough to counteract the increase in Xchromosomal drift rate that results from their smaller number in the population. In other words, W X ¼ W aut : The fact that the hypothesis W X ¼ W aut (Table 2) at P ¼ 0:05 for 11 of the 13 groupings of data in Table 2 can be rejected means that the hypothesis z ¼ 1=7 can also be rejected.
For each of the 13 datasets, the values of r, the female fraction of effective population size, were varied from 0 to 1. At each choice of r, the transformation in (7) was applied and P-values for the two-sided Wilcoxon test between the list of 377 transformed autosomal within-population variance components and the within-population variance components observed at the 17 non-pseudoautosomal X-linked markers were obtained (Figure 1a -c).
At r ¼ 0:5, when drift proceeds at the same rate in males and females, significant P-values (P , 0.05) were found for Africa, Eurasia (treated both as one region and three regions), Europe, Central/South Asia and East Asia. Therefore, for the remaining seven of the 13 samples, the differences in autosomal and X-chromosomal F st values can be explained by assuming that N m ¼ N f and by using the smaller effective population size of X chromosomes alone. Because Rosenberg et al. 19 found that repeat size affected divergence, Wilcoxon tests were also performed between transformations of the 274 autosomal tetranucleotide repeats and the 14 X-linked tetranucleotides and similar results were obtained, whether or not the two pseudoautosomal tetranucleotides were included in analysis (not shown). Table 1. Analysis of molecular variance (AMOVA) for 17 (non-pseudoautosomal) X-chromosomal markers. Ninety-five percent confidence intervals (in parentheses) were calculated using 1,000 bootstraps across loci. The World-B97 sample 6,19 consists of 14 populations that were chosen in order to approximate the sample of Barbujani  The values of r on the interval [0,1] that produced the largest P-value (see Figure 1a -c) are also reported in Table 2.
America was the only sample where the value of r corresponding to the maximal P-value was greater than 0.5 (r ¼ 0:66 resulted in P ¼ 1:00 for this case).
Of special interest is the fact that five of the six samples with significant P-values (P , 0.05) at r ¼ 0:5 recorded significant P-values as r varied over the whole range [0,1]. For example, as r varies from 0 to 1 in Figure 1b, the P-value resulting from the autosomal transformation for Eurasia (treated as one region) ranges from 4.95 £ 10 24 ðr ¼ 0Þ to 1.97 £ 10 210 ðr ¼ 1Þ: The single exception to this pattern was Africa, where the P-value decreased monotonically as r increased and P , 0.05 for r $ 0.06.
For these six groupings of the dataset (Africa, Eurasia both as one and as three regions, Europe, Central/South Asia and East Asia), the divergence model with constant effective population size is likely to provide a poorer approximation, as it does not account for population growth or migration 7 (Figure 1d -f ). Table 2. Results for two-sided Wilcoxon tests with r ¼ 0:875 and r ¼ 0:5: For 11 of the 13 datasets, the observed within-population variance components on the autosomes are significantly different (P , 0.05) from those observed using X-linked markers. For seven of the 13 groupings, the observed differences can be explained by accounting for the smaller effective population size of X chromosomes compared with autosomes (r ¼ 0:5). For six regions where a value of r is not given in the rightmost column, no value of r produces a high P-value.

Sample
Number of regions

Number of populations
Value of r that produces highest P-value   Figure 2) deviating somewhat from the rest of American samples. These results agree with the analysis of the same populations using autosomal microsatellite markers. 7

X-chromosomal population structure
The structure 5 program identifies subgroupings with distinctive allele frequencies and places individuals into K clusters, where K is defined beforehand by the user and can be varied across independent runs of the program. An individual's membership of a particular cluster is presented as a number between 0 and 1, with membership coefficients summing to 1 across all K clusters.
As is true of autosomal allele frequencies, X-chromosomal allele frequencies are strongly correlated across regions (Table 3). Thus, as was done for the autosomal genotypes from the same individuals, 6 the correlated allele frequencies model implemented in structure 5 was used with runs of the same number of iterations as those used to analyse the autosomal data.
America and Africa were the two essentially discrete regions generated at K ¼ 2 for the X-chromosomal dataset ( Figure 3). To compare results with Rosenberg et al., 6 K was increased from 2 to 6 incrementally. At K ¼ 3, Eurasian populations were somewhat identified and the Mozabites were observed to have substantial membership with Africans, as may be expected from their location in Algeria. At K ¼ 4, the X-chromosomal data show noticeably different structure from the autosomal data (see Figure 1 of Rosenberg et al. 6 ), as East Asia does not separate as a genetic cluster with good resolution. The next distinct cluster appears at K ¼ 6, where the Oceanic, American and African regions are observed; Eurasia and East Asia separate less obviously, but still appear differentiated from each other. The X chromosome polymorphisms produced similar clustering to the autosomes, but with less resolution. This raises the question of how the resolution of clusters depends on the number of markers available to study. Figure 4 shows that when the same amounts of data are used, the autosomal and X-chromosomal loci are largely in agreement. Clustering from 20 markers on either autosome 5 or autosome 11 ( Figure 4) revealed results very similar to those found with the X-chromosomal dataset. (These particular autosomes were chosen because exactly 20 microsatellites had been typed on them.) For these chromosomes, at K ¼ 6, only American, African and (in the case of chromosome 5) Oceanic populations appear distinctly. Furthermore, a sample of 20 markers spread across all of the autosomes yielded similar results, with the Kalash appearing as a distinct group, but with the Oceanic cluster absent. The Kalash -also seen distinctly in Figure 4 from the markers on chromosome 11 -formed the sixth cluster in Rosenberg et al. 6 and was the only major cluster in that study that did not match a major geographical region.

Robustness
The 377 autosomal markers in the HGDP-CEPH Human Genome Diversity Cell Line Panel data 4,6,7 comprise the largest multilocus dataset presently available for studying globally distributed populations. Of interest in studies of population structure is the number of loci needed for clustering. Also considered here is the required number of sampled individuals. 6  Robustness of the inference of human population structure Review PRIMARY RESEARCH Rosenberg et al. 6 found that inference of membership coefficients is most successful with at least 150 markers, and this is corroborated in Figure 5. It is also seen in Figure 5 that the addition of more individuals to a subset of the entire autosomal dataset (which contains 377 markers and 1,056 individuals) did not improve population structure inference as much as did the addition of loci. Data from individuals are used to estimate allele frequencies, which can be done fairly accurately with a small number of individuals; however, as structure uses distinctive genotypic combinations for the construction of clusters, and multilocus combinations are more likely to be distinctive to particular groups than are singlelocus types, 21 additional loci can contribute more information to cluster analysis than can the addition of more individuals to the sample ( Figure 6).
Oceania appears in Figure 6 as a distinct cluster with only ten loci and between 35 and 100 individuals per region. Because the Oceanic populations together contain 39 individuals, increasing the number of individuals beyond 35 per region meant that every Melanesian and Papuan was included in the subset run of structure. Thus, the distinctive allele frequencies of these populations identify this particular genetic cluster, despite the use of only ten loci.

Discussion
The same techniques as Rosenberg et al. 6 and Zhivotovsky et al. 7 were used to analyse genetic structure as inferred from 20 microsatellite markers on the X chromosome. Multidimensional scaling (Figure 2) did not reveal major departures from the patterns exhibited by the autosomal data. As was also observed on the autosomes, both America and Oceania are the regions exhibiting the lowest heterozygosity (0.57 and 0.64, respectively) on the X chromosome.
Seielstad et al. 3 used a migration model to attribute differences in F st across genetic systems to a difference in male and female migration rates. By contrast, a divergence model was used here and it was found that the differences observed in F st values can, in many cases, be explained by the smaller effective population size of X chromosomes compared with autosomes. This is similar to what was observed by Jorde et al., 1 who reported higher G st values in Y-chromosome restriction-site polymorphisms and mtDNA compared with autosomal systems, and found that this difference was expected because of the lower effective population size of the uniparentally-inherited portions of the genome. In those regions here where the smaller number of X chromosomes does not provide a sufficient explanation (Africa, Eurasia, Europe, Central/South Asia and East Asia), the assumptions of the divergence model -especially that of constant population size -may be responsible for the disagreement.
Upon closer examination of these differences in observed F st , the data here provide some support for the idea that genetic drift occurs faster in females than in males, or, equivalently, that the female effective population size is smaller than that of males. Many factors could potentially explain this observation; a larger correlation in females between reproductive success in parents and offspring or a smaller generation time in females 22 may increase the rate of drift in females compared with that in males.
The use of X-chromosomal data revealed clustering similar to that obtained using autosomal data, but with less resolution (Figures 3 -5). In America, Africa and Oceania, inferred clusters corresponded closely with predefined populations using both the autosomal and X-chromosomal loci, but the pattern of admixture observed by Rosenberg et al. 6 is not exactly the same as that revealed by the X chromosome, due to reduced resolution of clusters.  Figure 3 appear most similar to the African populations for 2 # K # 4, and then appear as their own genetic cluster at K ¼ 6. This contrasts with the analysis of Wilson et al., 23 whose analysis of 23 X-linked microsatellites using structure showed the Oceanic population combining with the Chinese population at K ¼ 3: A possible explanation for the results here may be a migration from Africa to Oceania separate from the primary migration out of Africa to other regions. 24 While choosing representative individuals from various populations is an important factor in the success of studies concerned with inference of population structure, the robustness of structure is much more dependent on the number of microsatellite markers used (Figures 5 and 6). In common with Rosenberg et al., 6 it is observed here that ancestry inference is most successful with at least 150 loci ( Figure 5). Bamshad et al. 20 reported that correct assignment to the continent of origin with a mean accuracy of at least 90 per cent required a minimum of 60 loci and reached 99 -100 per cent accuracy when more than 100 loci were used.
In contrast to this study, Bamshad et al. 20 considered a sample correctly assigned if the cluster with the greatest membership coefficient for an individual was the same as the predefined assignment. The criterion here compares the membership coefficients across all K clusters calculated when using structure on a subset of the data, with assignment made based on the full dataset. Thus, it is a measure of how well the results with smaller amounts of data match those with larger datasets, rather than a measure of 'correct assignment'. The difference in these criteria is likely to account for the smaller amount of genetic data regarded as sufficient by Bamshad et al. 20 The similarity coefficient C may be more sensitive to differences in membership coefficients between two runs and can be viewed as a conservative measure of similarity for the runs: visual similarity between graphs of estimated membership coefficients ( Figure 6) can be achieved even with fairly small values of C ( Figure 5). In Figure 6, for example, the plot using 100 loci and a maximum of 200 individuals per region is quite similar to the plot of the full data, while the similarity coefficient 6 between the structure runs of that particular subset and the entire dataset is 0.379. C does not make use of the 'correct' predefined structure, and, thus, unlike the criterion used by Bamshad et al. 20 , is unaffected by errors among the predefined labels.
While most studies to date have lacked the power to make strong inferences about population structure (due to the very recent availability of datasets with individuals assayed for large numbers of loci), future studies should choose an appropriate number both of individuals per region and of loci for these analyses. Note, however, that the sampling scheme may affect the estimated structure. For example, finer distinctions among populations of interest become visible when individuals who are more distantly related to those populations are omitted from analysis. 6 Although differences between the population structure based on the autosomes and X-linked loci may be expected due to differences in male and female demography, the differences between the results here and those of Rosenberg et al. 6 were largely due to the smaller number of X chromosomes in a population compared with autosomes, and to the smaller amount of data available from the X chromosome. From these results, it might be inferred that sex-biased demographic processes have not had a great influence on human population structure.  Figure 5. Grid of similarity coefficients. Each square in the grid represents a similarity coefficient 6 between 0 and 1 (0 corresponds to white, 1 to black) that measures the similarity between a structure run with a subset of the autosomal dataset and the entire dataset used by Rosenberg et al. 6 The subsets varied in the number of loci used (labelled on the vertical axis) and the number of individuals used per region (on the horizontal axis). For region i, N individuals per region corresponds to min(N, S i ) individuals, where S i is the total sample size of region i. Thus, five individuals per region corresponds to 35 individuals overall, ten per region to 70, 15 per region to 105, 20 per region to 140, 25 per region to 175, 35 per region to 245, 50 per region to 339, 75 per region to 489, 100 per region to 639 and 200 per region to 1,005 individuals total. Ten runs of each subset of the data were performed and the median similarity coefficient between the best subset run and ten runs of the entire dataset was used to generate a given square. Those values below the white line have a similarity coefficient of 50% or higher with the entire dataset. Using 150 loci or more, and 200 or more individuals per region, runs had similarity coefficients ranging from 0.87 to 0.98.