- Primary research
- Open Access
Impact of human population history on distributions of individual-level genetic distance
© Henry Stewart Publications 2005
- Received: 2 December 2004
- Accepted: 2 December 2004
- Published: 1 March 2005
Summaries of human genomic variation shed light on human evolution and provide a framework for biomedical research. Variation is often summarised in terms of one or a few statistics (eg FST and gene diversity). Now that multilocus genotypes for hundreds of autosomal loci are available for thousands of individuals, new approaches are applicable. Recently, trees of individuals and other clustering approaches have demonstrated the power of an individual-focused analysis. We propose analysing the distributions of genetic distances between individuals. Each distribution, or common ancestry profile (CAP), is unique to an individual, and does not require a priori assignment of individuals to populations. Here, we consider a range of models of population history and, using coalescent simulation, reveal the potential insights gained from a set of CAPs. Information lies in the shapes of individual profiles -- sometimes captured by variance of individual CAPs -- and the variation across profiles. Analysis of short tandem repeat genotype data for over 1,000 individuals from 52 populations is consistent with dramatic differences in population histories across human groups.
- human population genetic structure
- genetic similarity
- short tandem repeats (STRs)
- multilocus genotypes
The collective human gene pool, consisting of the genomes of all living people, has much to reveal regarding human population history. Until recently, surveys of human genetic variation have been sparse, in that hundreds or thousands of individuals have been studied for a small number of genetic regions (eg blood groups, Human Lymphocyte Antigens (HLA), mitochrondrial DNA, Y chromosome [1–3]) and a few individuals have been studied for a large fraction of the genome (eg through the Human Genome Project). In the past few years, however, larger sets of individuals have been studied for hundreds of genetic regions  and, concomitantly, new data analysis tools have been developed . With new data and new tools, we are rapidly gaining a more precise understanding of how genetically similar individuals are, and of how that similarity corresponds to other dimensions of human variation.
Summaries of human genetic variation
Most differences between genomes take the form of single nucleotide polymorphisms (SNPs) rather than DNA insertions, deletions or multiplications . For the autosomes, two DNA sequences chosen at random appear to differ at an average of about one per 1,000-1,500 nucleotide sites [7–9]. This level of diversity corresponds to between 2 and 3.2 million nucleotide differences between individual genomes and is about one order of magnitude lower than the diversity detected within Drosophila (fruitfly) populations .
Numerous studies have indicated that the number of differences between human genomes varies greatly depending on the pair of genomes considered. The most striking and consistent pattern is the higher level of genetic diversity in Africa than in other regions and the relatively low levels of diversity in the Americas. Zhao and colleagues, in examining a 10 kilobase (kb) non-coding region, found an average of 8.5 differences between African samples and an average of 8.2 differences between non-African samples . Yu and colleagues found a somewhat lower level of nucleotide diversity (π) of 0.076 per cent among Africans and 0.047 per cent among non-Africans . As indicated in the summary of short tandem repeat (STR) data by Rosenberg et al., diversity within African groups (average heterozygosity = 0.774) tends to be slightly higher than diversity within Middle Eastern (0.756), European (0.751) and Central and South Asian (0.752) populations .
Those groups are, in turn, somewhat more diverse than are the East Asian populations (heterozygosity = 0.723), which, in their turn, are more diverse than the Oceanic (0.683) and Native American (0.599) populations . All differences in heterozygosity for pairs of continents are significant at p < 0.00001, except for Europe versus the Middle East (p = 0.0058), Europe versus Central/South Asia (p = 0.7182) and the Middle East versus Central/South Asia (p = 0.0554) (Noah Rosenberg, personal communication).
Human genetic variation is often summarised in terms of hierarchical population genetic structure. In 1972, Lewontin estimated, using blood group and protein polymorphism data, that about 6.3 per cent of genetic variation was explained by differences among seven groups that he termed 'races' . Differences between members of the same population accounted for 85.4 per cent of the total genetic variation. The remaining 8.3 per cent was accounted for by the variation between populations, within each of the seven 'races' . In recent years, geneticists have replicated Lewontin's finding using independent regions of the genome: most estimates of FST (between-group variation) have ranged from 0.05-0.15 [4, 11–14]. These estimates indicate that two individuals affiliated with different racially or ethnically identified groups are only slightly more likely to differ at a given neutrally evolving locus than are two individuals affiliated with the same group. A large proportion of human genetic variation is found within racially, ethnically or linguistically identified groups. Notable exceptions, reflecting smaller effective population sizes, include the mitochondrial genome and Y chromosome SNPs, with recent estimates of between-group variation ranging from 0.3 to 0.4 [11, 15].
A focus on the individual
Although a combination of heterozygosity and genetic distance estimates for a set of populations may provide a fairly accurate summary of genetic variation, these statistics describe variation within the data most completely when population histories are relatively simple. One way to summarise population genetic structure in greater detail is to focus on individuals rather than on populations. For some species, summarisation at the individual level may reveal substructure that is hidden by population level summaries. In addition, the focus on the individual takes away the emphasis on the group labels. This change of emphasis can be particularly important when individuals have multiple group affinities. Finally, individual-based approaches have the potential to provide information about within- and between-group variations simultaneously.
Several research groups have used trees of individuals to summarise genetic variation [19–21]. Such trees provide much greater detail than do population trees, but are limited to a relatively small number of individuals and are not readily summarised. A number of algorithms, including those implemented via the immanc, BayesAss  and structure  software, allow one to assign, with a given probability, an individual (or portion of its genome) to a particular population. The BayesAss approach is valuable in estimating migration rates, inbreeding coefficients and recent immigrant ancestry simultaneously, but does require a priori assignment of individuals to populations. The structure algorithm identifies clusters of genetically similar individuals without a priori population assignment. The approach provides information regarding within-group variation, to the extent that individuals are inferred to be members of multiple clusters. The combination of structure analysis and distruct, a program that generates a graphical representation of population structure, provides a valuable exploratory tool. The many available approaches to estimation of relatedness between individuals also focus directly on individuals, but are most successfully applied in the context of relatively large, random mating populations [25–29].
Common ancestry profiles
We introduce an exploratory, individual-focused approach that complements population level analyses, trees of individuals, relatedness estimation and assignment/clustering algorithms. Distributions of genetic distances between individuals, here termed common ancestry profiles (CAPs), emphasise the shared ancestry of all members of a species and provide a detailed description of genetic variation without the need for a priori assignment of individuals to populations. Like distruct, the approach provides a visual representation of genetic variation, thereby constituting an exploratory data analysis tool. The profiles enable us to visualise how genetically similar an individual is to others in the context of linguistic, social or geographic variation. In addition, the approach brings together genealogical and population level perspectives.
The total set of genetic distances among individuals can be partitioned in a manner analogous to a partitioning of variance: individual heterozygosity (the fraction of an individual's loci that is heterozygous) represents within-individual variation; comparisons among individuals of a population represent within-population, between-individual variation; comparisons of individuals of different populations of a region represent within-region, between-population variation; and comparisons of individuals of different regions represent between-region variation. The CAPs can, therefore, provide a graphical display of an often misinterpreted breakdown of total genetic variation into its components.
As illustrated in Figure 1, in practice CAPs can vary quite dramatically across individuals. The overall profile for the individual affiliated with the Pima population (Figure 1a) is more skewed than that of the French individual (Figure 1b) because the Pima individual is genetically much more similar to other Pima individuals than to non-Pima individuals (Figure 1c), while the French individual is, on average, almost as similar to non-French individuals as to other French individuals (Figure 1d).
Going one step further, by combining individual profiles across all individuals of each population, we see variation across 52 previously described populations (Figure 2) [4, 30]. Several of the population samples from the Americas, as well as the Melanesian sample, reveal relatively broad distributions, even though individuals known to be closely related to others in the sample set have been removed. The Colombian and Melanesian profiles, for instance, reveal a number of pairs of genetically very similar individuals. Several of these pairs represent first- or second-degree relatives according to relpair  analysis.
A CAP can be informative by indicating possible duplicated samples, by indicating closely related individuals within a dataset or as a graphical display of the partitioning of variance. The power of the approach is greatest, however, in cases where we have expectations for the shape of a profile. For a random mating population with all sampled individuals distantly related, a Central Limit Theorem argument leads to the expectation of a normal distribution for an individual CAP. Expectations under more realistic models of population history, however, are essential if we are to accurately interpret a set of profiles.
In order to facilitate interpretation of CAPs, we have considered a set of simple models of population history, simulating genetic variation in the context of those models using a coalescent approach. Through simulation, we have explored the impact of population isolation and of gene flow on CAPs, and the potential for inferring recent or long-term gene flow from a set of CAPs. In light of these simulations, we have evaluated a set of individual and group CAPs derived from STR variation at 377 loci.
where i, j, k, l represent distinct alleles. The similarity estimate meets the criteria for transformation to a distance, in that it is everywhere non-negative definite .
Estimates are averaged across loci to generate an approximation of distance for a pair of individuals. The distance measure ranges in value from 0-1.0, with 0 indicating the comparison of two individuals identical and homozygous at all loci and 1.0 indicating no overlap of alleles. (the distance between an individual and him- or herself) for individuals heterozygous at all loci is 0.5. More generally, the distance between an individual and him- or herself (or a monozygotic twin) is , where h is the fraction of loci heterozygous in that individual.
An individual CAP consists of the distribution of estimates for a single focal individual compared with a set of other individuals. We represent individual CAPs as binned relative frequencies, with the range divided into a set of equal-sized bins. The set of comparison individuals may be a geographically global sample, a set of cases or controls in a medical context or any other set of interest. A group CAP consists of genetic distance estimates between all pairs of a set of individuals. Conditional on the genotype of the individual, an individual CAP consists of a set of independent distances. Conditional on the genotypes for a group of individuals, a CAP for that group consists of a set of non-independent distances.
Models of population history
We considered a range of models to explore the impact of sample size and population history on CAPs. The basic model included two populations of effective size 1,000 that diverged 2,000 generations ago. We investigated the sensitivity of the CAPs and summary statistics to: (a) sample size (n = 25, 50 and 100 individuals per population); (b) time of population divergence (t = 1,000, 2,000 and 5,000 generations in the past); and (c) rate and timing of gene flow following divergence. We investigated both continuous gene flow (continuous gene flow following divergence at the rate of 0.5 or 2.0 migrants per generation) and recent gene flow. The recent gene flow model represents population divergence 2,000 generations ago followed by isolation for 1,900 generations, and then gene flow at a rate of 2.0 migrants per generation during the past 100 generations. We investigated both symmetrical and asymmetrical gene flow models. Results presented here are for asymmetrical models unless otherwise stated.
Using coalescent-based simulation  of the above population histories, we generated genotypes for each of n sampled individuals per population. We assumed a single-step mutation model with a range constraint in order to best approximate evolution at STR, or microsatellite, loci. Five hundred unlinked loci were modelled for each sampled individual. We assumed a mutation rate of 0.0005/generation/locus, on the order of published estimates of effective mutation rate for STR loci . Given an average of 10.8 (± 0.2) alleles for 377 human STR loci, we assumed a range constraint of 15 repeat alleles; that is, stepwise mutation generated novel alleles until a total of 15 alleles had been generated, at which point mutation generated only new copies of existing alleles.
CAP summary statistics
where d is the number of bins and x i is the weight in bin i. We also calculated expected heterozygosity for each population sample and for the combined (overall) sample of n0 + n1 individuals for comparison with mean individual genetic distance. Summary statistics were calculated for four sets of CAPs: (a) 'overall', (b) 'within', (c) 'between' and (d) 'cryptic' -- an individual drawn at random from the reference and nonreference population is compared with other individuals of a random sample. Individual CAPs are presented as binned relative frequencies, with the range divided into 100 equalsized bins.
For models with gene flow, we summarised the distributions in greater detail by calculating the average weight (summarising over individuals) in three particular genetic distance bins. These genetic distance bins correspond to: (1) average genetic distance when reference individuals are compared with other individuals within the reference population; (2) average genetic distance when reference individuals are compared with other individuals from the non-reference population; and (3) the mid-point between these two bins, which corresponds to the average genetic distance when reference individuals from a population are compared with individuals within the reference population with mixed ancestry.
We analysed the CEPH-HGDP  multilocus STR genotype data generated by Rosenberg and colleagues in collaboration with the Marshfield Genotyping Service . That dataset includes 377 STR loci tested in 1,056 individuals from 52 human populations. Although many more populations have been typed for a small number of genetic markers, including the classical markers (eg blood groups), mtDNA and the Y chromosome, the Rosenberg STR dataset remains the richest published set in terms of the number of individuals typed for a relatively large number of markers. Each individual in the dataset is associated with a population (identified in a variety of ways in the contexts of a number of separate research projects) and a geographical region . We use those labels when referring to particular individuals by population or region.
1 st sample ID
2nd sample ID
We considered four pairs of populations in greater detail in light of the simulation results: two pairs of geographically proximate populations and two pairs of geographically distant populations. In each case, we calculated summary statistics and CAPs for an individual versus: (1) other individuals of his or her local (reference) population; (2) other individuals in the comparison (non-reference) population; and (3) all other individuals in both the local and comparison population. Individual CAPs are presented as binned relative frequencies and are summarised as described above.
CAPs and summary statistics: Basic population structure model
Summaries of individual common ancestry profiles (CAPs) derived from data simulated via two-population models
Impact of sample size
As indicated in Table 3, reducing the sample size from 100 to 25 individuals per population does not significantly change the average or standard deviation of individual CAPs, consistent with the average being linear in the data. Raggedness decreases with sample size for all comparison groups, although within-population CAPs are the most ragged for all sample sizes.
Impact of divergence time
Table 3 indicates the impact of population divergence time on individual CAPs. As expected, the average genetic distance for between-population comparisons increases with earlier population divergence. Earlier divergence therefore leads to greater separation between the within-population and between-population genetic distance peaks of a CAP.
Impact of gene flow
Impact of gene flow on individual common ancestry profiles (CAPs) derived from coalescent simulations
Nem = 0
Nem = 0.5
Nem = 2.0
Additionally, as expected, the frequency of high genetic distances decreases with gene flow. Recent migration following a relatively long period of population isolation leads to a slight increase in both average and standard deviation of the individual CAPs. Summary statistics for within-population CAPs (for the reference population that receives immigrants) reveal higher average genetic distance for models with gene flow and higher standard deviation of genetic distance for models with recent migration. Additionally, migration leads to CAPs with multiple peaks for within-population comparisons.
The overall CAP (based on genetic distance, ) for 1,013 individuals (512,578 pairs) is leptokurtotic and slightly positively skewed (Figure 1a, 'global'), with a median of 0.771 (mean of 0.772) and 5th to 95th percentile range of 0.732-0.816.
Two individual CAPs (Pima 1043 and French 516, each versus all other individuals in the dataset) illustrate the potential for variation across individual CAPs (Figures 1a and 1b). The overall distribution for the French individual (Figure 1b) is approximately normal, reflecting the overlap of the different CAPs shown in Figure 1d. The CAP for the Pima individual, however, is less symmetrical. The first peak from the left in the overall Pima 1043 CAP (Figure 1a) represents the comparison of Pima 1043 with other Pima, the second peak reflects comparison with non-Pima individuals in the Americas and the third small peak represents comparison with individuals outside the Americas (Figure 1c).
Group, or population, CAPs for 52 human populations are summarised in Figure 2. The distributions for the indigenous populations of the Americas and Oceania have the highest variances: pairs of individuals from the samples of populations in those regions have the broadest range of similarity estimates (Figure 2). All population samples from regions outside of the Americas and Oceania have similar levels of between-individual variation in terms of both mean and variance. There is a geographical trend, however, in that the genetic distance estimates for pairs of individuals from Africa are highest, followed by pairs from the Middle East and Europe. Pairs within East Asian populations tend to be slightly more similar to one another than pairs within African, Middle Eastern, European or Central/South Asian populations. Note that these distributions are dependent on the population labelling of individuals. We can compare the mean genetic distance across all pairs (dt = 0.772) to the mean genetic distance between individuals within populations (dp = 0.740) to obtain an estimate of between-population variation; (dt - dp)/dt = 0.041 is an example of a ratio of differences recently discussed at length by Rousset . The estimate is analogous to a standard FST, except that here within-individual variation is not considered.
Maximum genetic distance () between any pair of individuals drawn from each pair of geographical regions
Summaries of common ancestry profiles (CAPs) for four population pairs
Surui vs Karitiana n1 = 14, n2 = 12
Surui vs Karitiana
Surui and Karitiana
Burusho, Kalash n1 = 25, n2 = 25
Burusho vs Kalash
Burusho and Kalash
Pima, Mbuti n1 = 16, n2 = 15
Pima vs Mbuti
Pima and Mbuti
Papuan, Biaka n1 = 33, n2 = 17
Biaka vs Papuan
Biaka and Papuan
CAPs are novel, graphical representations of within- and between-group variation from the perspective of the individual. Like population or individual trees and other clustering algorithms, CAPs provide insight into population genetic structure. Through simulation, we have generated expectations for CAPs for two-population models and evaluated the sensitivity of those expectations to sample size, divergence time and gene flow between the two populations. Simulations demonstrated that, for simple population histories, sample size has little influence on summary statistics characterising the distributions. This finding is particularly relevant for studies where population structure is cryptic, so that sample sizes of subpopulations are unknown. Sensitivity to sample size was considered in the context of complete isolation between populations. It is likely that more complex models including gene flow would lead to greater sensitivity to sample size.
The simulation study of the impact of divergence time on CAPs revealed that the average between-population genetic distance differs from the average within-population genetic distance to a greater extent for populations that diverged earlier in time. This finding is consistent with expectations for the change in FST or population genetic distance over time. Ongoing gene flow has a different impact on CAPs than does recent gene flow following isolation: such recent gene flow leads to much broader distributions. Simulations presented above focused on two-population models, including unidirectional gene flow. Overall CAPs (where individuals in a reference population are compared with individuals in both the reference and the non-reference populations) generated given such models consist of three categories of genetic distance estimates. If the focal individual (for whom the CAP is generated) is an individual in the recipient population, these genetic distances correspond to the following comparisons: the focal individual versus individuals of the reference population with little immigrant ancestry; the focal individual versus individuals of mixed ancestry (in the reference population); and, finally, the focal individual versus individuals of the nonreference population. The magnitude and positions of the peaks resulting from these comparisons change as the amount of gene flow increases (Figure 4), suggesting that CAPs are informative regarding the rate of gene flow between populations. The more recent the population divergence, the lower the difference between the 'within' and 'between' genetic distances and, consequently, the less potential for recognizing gene flow. In situations where populations have diverged very recently, using a larger number of markers reduces the variance of a CAP and may, therefore, provide additional insight.
Simulations were designed to explore the impact of sample size and population processes on CAPs generated from STR multilocus genotypes. CAPs generated from SNP multilocus genotypes might differ from the STR-based profiles. Given the high heterozygosity of STR polymorphisms relative to SNPs, CAPs based on STRs are more likely to reach 'saturation' than are those based on SNPs. That is, the divergence between individuals is likely to approach an upper limit that depends on the mutation rate and range constraint, as well as population history. CAPs based on tens of thousands of SNPs may be more informative if recurrent mutation is rare. SNPs, however, are more likely to be subject to an ascertainment bias than are STRs. Given the impact of ascertainment bias on estimates of heterozygosity  and the correlation between heterozygosity and individual genetic distance (Table 3), such bias is very likely to influence CAPs.
Simulations presented in this paper assume randomly mating populations; however, there is extensive evidence for non-random mating, consanguinity and complex social structure (including matriarchy and patriarchy) in many human populations [39–41]. Given the potential for such demographic and sociocultural processes to influence individual genetic distances in real populations, models including more complex mating systems deserve further investigation. Other demographic factors, including population growth and population bottlenecks, are also likely to influence the shape of CAPs. Further simulations are required to assess the impact on CAPs of such demographic processes.
CAPs for 1,013 human individuals
CAPs generated from CEPH-HGDP STR multilocus genotypes are consistent with known patterns of human genetic variation . The overall CAP (based on the individual genetic distance measure, ) for humans is slightly skewed in a positive direction (Figure 1a, 'global'). In light of the simulations, we can conclude that this positive skewness reflects subdivision within the species. If mating is random with respect to genomes, the variance of is expected to be low. That is, most pairs of individuals are similarly divergent. Higher levels of substructure correspond to higher CAP variances. The concentration of genetic distance in a relatively narrow range (Figure 1a, 'global') is consistent with a generally low level of human population substructure (low FST); for pairs of individuals separated by more than three generations (ie most pairs), the genetic distance is very close to the overall average. Exceptions are in the lower tail of the distribution that includes pairs of closely related individuals. These exceptions include pairs of individuals in small populations that have undergone substantial random genetic drift, for instance during the peopling of the Americas. Heterozygosity is relatively low in indigenous populations of the Americas, and two 'unrelated' individuals from such a population are far more similar than are two individuals chosen at random from anywhere else in the world. FST estimates, because they reflect an average difference between groups, mask some of the between-population variation . The analyses presented here highlight the variation not captured by summary statistics.
The highest genetic distance value overall is 0.861, for a pair of individuals including one affiliated with the Mbuti population and one affiliated with the Pima population. These individuals are also among the most geographically distant from one another if we measure geographical distance along a migration pathway out of Africa, east through Eurasia and then into the Americas. Population subdivision within Africa has been so high that the two most genetically dissimilar individuals in Africa are more dissimilar than any two individuals outside of Africa, but not so high that those two individuals are the most dissimilar overall. As indicated in Table 5, the region with the greatest divergence between individuals (Africa) is also the region with highest heterozygosity. The pairs of individuals with the largest genetic distances vary depending on the distance metric (results not shown).
Many of the population samples included in the CEPHHGDP panel were included for anthropological interest. These populations are often small, more isolated than most ethnic/linguistic groups and considered to be the indigenous peoples of a region. They can be considered valuable with regard to understanding human genetic variation, in that they probably represent the extremes in terms of effective size and degree of isolation and, therefore, individual genetic distance.
In some cases, population profiles indicate deviations from simple models of population history. Profiles for several populations of the Americas and Oceania are much broader than those of other regions (Figure 2), possibly reflecting population substructure. As noted above, samples for the CEPH-HGDP cell line panel are distributed with indication that the Karitiana, Surui, Mayan and Pima samples include relative pairs. The data analyses described above do not include known relative pairs; however, reanalysis including sets of closely related individuals led to more highly skewed profiles (results not shown).
CAPs analysis revealed that the CEPH-HGDP sample set includes 13 duplicate samples. Such detection of duplicate samples is best carried out using a distance measure that gives a distance of 0 between two genetically identical (or almost identical, if occasional genotyping errors have occurred) individuals.
The four population pairs considered in detail illustrate the diversity of human CAPs. The CAPs (Figure 5), summarised in Table 6, can be interpreted in light of the simulations. Average genetic distances are consistent with high effective size for both the Biaka and the Mbuti, intermediate effective sizes for the Kalash, Burusho, Pima and the Papua New Guineans and low effective sizes for the Surui and the Karitiana. Figure 5 reveals older divergence for the geographically distant population pairs (given the difference between the average 'within' and 'between' genetic distances) compared with geographically proximate population pairs. For geographically distant population pairs, unimodal, distinct CAPs for the 'within' and 'between' comparisons indicate lack of gene flow. Overlapping 'within' and 'between' CAPs for geographically proximate populations are consistent with more recent divergence of these groups. The Kalash and Burusho, for example, seem to have similar effective sizes and to have diverged relatively recently. The second peak of the 'overall' comparison corresponds to a high genetic distance, indicating the presence of some particularly distant Burusho individuals. The Karitiana 'within' CAP is also bimodal, with one peak having a lower than average genetic distance. This peak could correspond to comparisons between local Karitiana and other local Karitiana (and the other peak corresponds to local Karitiana compared with Karitiana with recent migrant ancestry). Alternatively, the first peak may reflect inbreeding within the Karitiana population. The other two comparisons, Pima versus Mbuti and Papuan versus Biaka, reveal very high levels of variability in the African populations, consistent with previous analyses of these and other data.
The pattern of human population genomic variation is relevant in a number of research and education contexts. As noted above, the pattern reflects -- and therefore may provide insight into -- population history. In medical genetics, knowledge of any genetic substructure of a set of probands may inform research decisions. In forensics, an understanding of patterns of genetic variation is becoming increasingly relevant as institutions attempt to infer racial or ethnic affiliation of individuals using DNA data . In secondary and undergraduate education, the discussion of race and genetics has typically been highly superficial. As publicly available data regarding genetic information accrue, a basic understanding of human population genetic variation becomes an increasingly important component of public education.
When the research goal is to take into account cryptic population subdivision, as in case-control studies, genomic controls [43–45] or clustering approaches (eg structure) are appropriate; however, these approaches may not always reveal a small fraction of individuals that stands out from the rest in terms of genetic distance. The structure approach, for instance, is sensitive to sample size . The CAPs approach may more readily reveal anomalies such as duplicate samples and closely related pairs of individuals.
CAPs vary both within and across geographically and socially defined groups. The profiles indicate that some population labels serve as better proxies for genetic similarity than do others. That is, some linguistic or social groups consist of individuals much more genetically similar to one another than to individuals of other groups, while other groups do not. Emphasis on absolute description of variation can be valuable, in that continuity of the measurement is naturally emphasised. The individual-based CAPs approach also emphasises the shared ancestry of all humans: all pairs of individuals fall into the continuum of genetic distance. Finally, although the CAPs approach does not require a priori information regarding an individual's affiliation with one group or another, the approach does allow us to explore hypotheses regarding the correspondence between genetic and non-genetic dimensions of human variation.
CAPs can be considered as genomic versions of pairwise difference distributions for single DNA sequence loci [46, 47]. These genomic profiles enable us to consider both within- and between-group variation simultaneously and to complement traditional summary statistics in revealing differences among individuals in the variances of individual CAPs (eg Figures 1 and 5). Although most information regarding population genetic structure is captured in a sufficiently hierarchical analysis of variance, CAPs reveal, in addition, information at the genealogical level. While CAPs are no replacement for traditional population genetic summary statistics, direct estimation of gene flow (eg BayesAss ) or direct inference of degree of relationship (eg relpair ), they serve as a valuable exploratory tool and as an independent check of estimates derived using other methods.
We thank Alec Knight, Neil Risch, Noah Rosenberg and an anonymous reviewer for helpful discussion and suggestions regarding a previous version of this manuscript. Research was supported in part by NIH grant GM28428 and NSF grant BCS 9905574 to J.L.M.
- Cavalli-Sforza LL, Piazza A, Menozzi P: History and Geography of Human Genes. 1994, Princeton University Press, Princeton, NJGoogle Scholar
- Salas A, Richards M, De la Fe T, et al: The making of the African mtDNA landscape. Am J Hum Genet. 2002, 71: 1082-1111. 10.1086/344348.PubMed CentralView ArticlePubMedGoogle Scholar
- Underhill PA, Passarino G, Lin AA, et al: The phylogeography of Y chromosome binary haplotypes and the origins of modern human populations. Ann Hum Genet. 2001, 65: 43-62. 10.1046/j.1469-1809.2001.6510043.x.View ArticlePubMedGoogle Scholar
- Rosenberg NA, Pritchard JK, Weber JL, et al: Genetic structure of human populations. Science. 2002, 298: 2381-2385. 10.1126/science.1078311.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
- Venter JC, Adams MD, Myers EW, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.View ArticlePubMedGoogle Scholar
- Li WH, Sadler LA: Low nucleotide diversity in man. Genetics. 1991, 129: 513-523.PubMed CentralPubMedGoogle Scholar
- Zhao Z, Jin L, Fu YX, et al: Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. Proc Natl Acad Sci USA. 2000, 97: 11354-11358.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu N, Zhao Z, Fu YX, et al: Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol Biol Evol. 2001, 18: 214-222. 10.1093/oxfordjournals.molbev.a003795.View ArticlePubMedGoogle Scholar
- Lewontin R: The apportionment of human diversity. Evolutionary Biology. Edited by: Hecht MK, Steere WS. 1972, Plenum New York NY, 6:Google Scholar
- Jorde LB, Watkins WS, Bamshad MJ, et al: The distribution of human genetic diversity: A comparison of mitochondrial, autosomal, and Y-chromosome data. Am J Hum Genet. 2000, 66: 979-988. 10.1086/302825.PubMed CentralView ArticlePubMedGoogle Scholar
- Barbujani G, Magagni A, Minch E, Cavalli-Sforza LL: An apportionment of human DNA diversity. Proc Natl Acad Sci USA. 1997, 94: 4516-4519. 10.1073/pnas.94.9.4516.PubMed CentralView ArticlePubMedGoogle Scholar
- Romualdi C, Balding D, Nasidze IS, et al: Patterns of human diversity, within and among continents, inferred from biallelic DNA polymorphisms. Genome Res. 2002, 12: 602-612. 10.1101/gr.214902.PubMed CentralView ArticlePubMedGoogle Scholar
- Excoffier L, Hamilton G: Comment on genetic structure of human populations. Science. 2003, 300: 1877, author reply 1877-View ArticleGoogle Scholar
- Wilder JA, Kingan SB, Mobasher Z, et al: Global patterns of human mitochondrial DNA and Y-chromosome structure are not influenced by higher migration rates of females versus males. Nat Genet. 2004, 36: 1122-1125. 10.1038/ng1428.View ArticlePubMedGoogle Scholar
- Long JC, Kittles RA: Human genetic diversity and the nonexistence of biological races. Hum Biol. 2003, 75: 449-471. 10.1353/hub.2003.0058.View ArticlePubMedGoogle Scholar
- Edwards AW: Human genetic diversity: Lewontins fallacy. Bioessays. 2003, 25: 798-801. 10.1002/bies.10315.View ArticlePubMedGoogle Scholar
- Cavalli-Sforza LL, Edwards AW: Phylogenetic analysis. Models and estimation procedures. Am J Hum Genet. 1967, 19: 233-257.PubMed CentralPubMedGoogle Scholar
- Bowcock AM, Ruiz-Linares A, Tomfohrde J, et al: High resolution of human evolutionary trees with polymorphic microsatellites. Nature. 1994, 368: 455-457. 10.1038/368455a0.View ArticlePubMedGoogle Scholar
- Mountain JL, Cavalli-Sforza LL: Multilocus genotypes, a tree of individuals, and human evolutionary history. Am J Hum Genet. 1997, 61: 705-718. 10.1086/515510.PubMed CentralView ArticlePubMedGoogle Scholar
- Shriver MD, Kennedy GC, Parra EJ, et al: The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Hum Genomics. 2004, 1: 274-286.PubMed CentralView ArticlePubMedGoogle Scholar
- Rannala B, Mountain JL: Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci USA. 1997, 94: 9197-9201. 10.1073/pnas.94.17.9197.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilson GA, Rannala B: Bayesian inference of recent migration rates using multilocus genotypes. Genetics. 2003, 163: 1177-1191.PubMed CentralPubMedGoogle Scholar
- Rosenberg NA: Distruct: A program for the graphical display of population structure. Mol Ecol Notes. 2004, 4: 137-138.View ArticleGoogle Scholar
- Queller DC, Goodnight KF: Estimating relatedness using genetic markers. Evolution. 1989, 43: 258-275. 10.2307/2409206.View ArticleGoogle Scholar
- Lynch M, Ritland K: Estimation of pairwise relatedness with molecular markers. Genetics. 1999, 152: 1753-1766.PubMed CentralPubMedGoogle Scholar
- Li CC, Weeks DE, Chakravarti A: Similarity of DNA fingerprints due to chance and relatedness. Hum Hered. 1993, 43: 45-52. 10.1159/000154113.View ArticlePubMedGoogle Scholar
- Weeks DE, Young A, Li CC: DNA profile match probabilities in a subdivided population: When can subdivision be ignored?. Proc Natl Acad Sci USA. 1995, 92: 12031-12035. 10.1073/pnas.92.26.12031.PubMed CentralView ArticlePubMedGoogle Scholar
- Milligan BG: Maximum-likelihood estimation of relatedness. Genetics. 2003, 163: 1153-1167.PubMed CentralPubMedGoogle Scholar
- Cann HM, de Toma C, Cazes L, et al: A human genome diversity cell line panel. Science. 2002, 296: 261-262.View ArticlePubMedGoogle Scholar
- Epstein MP, Duren WL, Boehnke M: Improved inference of relationship for pairs of individuals. Am J Hum Genet. 2000, 67: 1219-1231.PubMed CentralView ArticlePubMedGoogle Scholar
- Rousset F: Inbreeding and relatedness coefficients: What do they measure?. Heredity. 2002, 88: 371-380. 10.1038/sj.hdy.6800065.View ArticlePubMedGoogle Scholar
- Johnson RA, Wichern DW: Applied Multivariate Statistical Analysis. 1988, Prentice-Hall, Englewood Cliffs, NJGoogle Scholar
- Excoffier L, Novembre J, Schneider S: SIMCOAL: A general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. J Hered. 2000, 91: 506-509. 10.1093/jhered/91.6.506.View ArticlePubMedGoogle Scholar
- Zhivotovsky LA, Rosenberg NA, Feldman MW: Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. Am J Hum Genet. 2003, 72: 1171-1186. 10.1086/375120.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramakrishnan U, Mountain JL: Precision and accuracy of divergence time estimates from STR and SNPSTR variation. Mol Biol Evol. 2004, 21: 1960-1971. 10.1093/molbev/msh212.View ArticlePubMedGoogle Scholar
- Harpending HC: Signature of ancient population growth in a low-resolution mitochondrial DNA mismatch distribution. Hum Biol. 1994, 66: 591-600.PubMedGoogle Scholar
- Mountain JL, Cavalli-Sforza LL: Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms. Proc Natl Acad Sci USA. 1994, 91: 6515-6519. 10.1073/pnas.91.14.6515.PubMed CentralView ArticlePubMedGoogle Scholar
- Storz JF, Ramakrishnan U, Alberts SC: Determinants of effective population size for loci with different modes of inheritance. J Hered. 2001, 92: 497-502. 10.1093/jhered/92.6.497.View ArticlePubMedGoogle Scholar
- Hussain R, Bittles AH: The prevalence and demographic characteristics of consanguineous marriages in Pakistan. J Biosoc Sci. 1998, 30: 261-275. 10.1017/S0021932098002612.View ArticlePubMedGoogle Scholar
- Bittles AH: Endogamy, consanguinity and community genetics. J Genet. 2002, 81: 91-98. 10.1007/BF02715905.View ArticlePubMedGoogle Scholar
- Cho MK, Sankar P: Forensic genetics and ethical, legal and social implications beyond the clinic. Nat Genet. 2004, 36: S8-12. 10.1038/ng1594.PubMed CentralView ArticlePubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 997-1004. 10.1111/j.0006-341X.1999.00997.x.View ArticlePubMedGoogle Scholar
- Pritchard JK, Rosenberg NA: Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999, 65: 220-228. 10.1086/302449.PubMed CentralView ArticlePubMedGoogle Scholar
- Reich DE, Goldstein DB: Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol. 2001, 20: 4-16. 10.1002/1098-2272(200101)20:1<4::AID-GEPI2>3.0.CO;2-T.View ArticlePubMedGoogle Scholar
- Slatkin M, Hudson RR: Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991, 129: 555-562.PubMed CentralPubMedGoogle Scholar
- Rogers AR, Harpending H: Population growth makes waves in the distribution of pairwise genetic differences. Mol Biol Evol. 1992, 9: 552-569.PubMedGoogle Scholar