Impact of human population history on distributions of individual-level genetic distance

Summaries of human genomic variation shed light on human evolution and provide a framework for biomedical research. Variation is often summarised in terms of one or a few statistics (eg FST and gene diversity). Now that multilocus genotypes for hundreds of autosomal loci are available for thousands of individuals, new approaches are applicable. Recently, trees of individuals and other clustering approaches have demonstrated the power of an individual-focused analysis. We propose analysing the distributions of genetic distances between individuals. Each distribution, or common ancestry profile (CAP), is unique to an individual, and does not require a priori assignment of individuals to populations. Here, we consider a range of models of population history and, using coalescent simulation, reveal the potential insights gained from a set of CAPs. Information lies in the shapes of individual profiles -- sometimes captured by variance of individual CAPs -- and the variation across profiles. Analysis of short tandem repeat genotype data for over 1,000 individuals from 52 populations is consistent with dramatic differences in population histories across human groups.


Introduction
The collective human gene pool, consisting of the genomes of all living people,h as much to reveal regarding human population history. Until recently,s urveys of humang enetic variation have been sparse,i nt hat hundreds or thousandso f individuals have been studied for as malln umber of genetic regions (eg blood groups, Human Lymphocyte Antigens (HLA), mitochrondrial DNA,Ychromosome 1-3 )a nd af ew individuals have been studied for al arge fraction of the genome (eg through the Human Genome Project). In the past few years, however, larger sets of individuals have been studied for hundreds of genetic regions 4 and, concomitantly,new data analysist ools have been developed. 5 With new data and new tools, we arer apidly gaining am ore precise understandingo f howg enetically similar individuals are,a nd of howt hat similarity corresponds to other dimensions of humanvariation.

Summaries of human genetic variation
Most differences between genomes take the formo fs ingle nucleotide polymorphisms (SNPs) rather than DNA insertions, deletions or multiplications. 6 For the autosomes, two DNA sequences chosen at random appear to differ at an average of about one per 1,000-1,500 nucleotide sites. [7][8][9] This level of diversity corresponds to between 2a nd 3.2 million nucleotide differences between individual genomes and is about one order of magnitude lowert han the diversity detected within Drosophila (fruitfly)p opulations. 7 Numerous studies have indicated that the number of differences between human genomes varies greatlyd epending on the pair of genomes considered. The most striking and consistent patterni st he higher level of genetic diversity in Africa than in other regions and the relatively lowl evels of diversity in the Americas. Zhao and colleagues, in examining a 10 kilobase (kb) non-coding region, found an average of 8.5 differences between African samples and an average of 8.2 differences between non-African samples. 8 Yu and colleagues found as omewhat lowerl evel of nucleotide diversity ( p )o f 0.076 per cent amongA fricans and 0.047 per cent among non-Africans. 9 As indicated in the summaryo fs hortt andem repeat (STR) data by Rosenberg et al., diversity within African groups( average heterozygosity ¼ 0.774) tends to be slightly higher than diversity within MiddleEastern(0.756), European (0.751) and Central and South Asian (0.752) populations. 4
Human genetic variation is often summarised in terms of hierarchical population genetic structure.I n1 972, Lewontin estimated, usingb loodgroup and protein polymorphism data, that about 6.3 per cent of genetic variationw as explained by differences among seven groups that he termed 'races'. 10 Differences between memberso ft he same population accounted for 85.4 per cent of the total genetic variation. The remaining 8.3 per cent wasa ccounted for by the variation between populations,w ithin each of the seven 'races'. 10 In recent years, geneticistsh aver eplicated Lewontin's finding using independent regionso ft he genome: most estimateso f F ST (between-group variation) have ranged from 0.05-0. 15. 4,11-14 These estimates indicate that twoi ndividuals affiliated withd ifferent racially or ethnicallyi dentifiedg roups are only slightly more likely to differ at ag iven neutrally evolving locus than are twoindividuals affiliated with the same group.Alarge proportion of human genetic variation is found within racially,e thnically or linguistically identifiedg roups. Notable exceptions, reflecting smaller effective population sizes, include the mitochondrial genome and Yc hromosome SNPs, with recent estimates of between-group variation ranging from 0.3 to 0.4. 11,15 Although human genetic variation has often been summarised using single statistics such as F ST ,s uch single statistics are an inadequate and potentiallym isleadings ummaryo fo ur species'diversity. 16,17 F ST is most straightforwardly interpreted if the underlying population historyi so fasinglep opulation that instantaneously divides into an umber of equally sized, panmictic subpopulations, each of which remains at the same size throughout the subsequent time.H uman historyi sf ar from fitting such amodel. Genetic distances, often represented in the formo fp opulation trees, 18 provide am ore detailed representationofstructure. 1 Recently,Long and Kittlesused a sequential model-fitting approach to infer structure,g enerating at ree relating as et of humanp opulations. 16 The latter study highlights theh ierarchical and uneven structure of human genetic variation( see their Figure 2D).

Af ocus on the individual
Although ac ombination of heterozygosity and genetic distance estimates for as et of populations mayp rovide af airly accurates ummaryofg enetic variation, these statistics describe variationw ithin the data most completely when population histories are relatively simple.O ne wayt os ummarise population genetic structure in greater detail is to focuso n individuals rather than on populations. For somes pecies, summarisationa tt he individual level mayr eveal substructure that is hiddenb yp opulationl evel summaries. In addition, the focus on the individual takes away the emphasis on the group labels. This change of emphasis can be particularlyi mportant when individuals have multiple group affinities.F inally,i ndividual-based approaches have the potential to provide information about within-and between-group variations simultaneously.
Several research groupsh aveu sed trees of individuals to summarise genetic variation. 19 -21 Such trees provide much greater detail than do population trees, but are limited to a relatively small number of individuals and are not readily summarised. An umber of algorithms, including those implemented via the immanc, 22 BayesAss 23 and structure 5 software,a llowone to assign, with ag iven probability,a ni ndividual( or portion of its genome) to ap articular population. The BayesAss approach is valuable in estimating migration rates, inbreeding coefficients and recent immigrant ancestry simultaneously,b ut does require ap riori assignmento fi ndividuals to populations. The structure algorithm identifiesc lustersofgenetically similar individuals without apriori population assignment. The approach provides information regarding within-group variation, to the extent that individuals are inferred to be memberso fm ultiple clusters. The combination of structure analysis and distruct, 24 ap rogram that generates a graphicalr epresentation of populations tructure,p rovides a valuable exploratoryt ool. Them anya vailable approaches to estimation of relatedness between individuals also focus directly on individuals, but aremost successfully applied in the context of relatively large,r andom mating populations. [25][26][27][28][29]

Common ancestryp rofiles
We introduce an exploratory, individual-focused approach that complements populationl evel analyses, trees of individuals, relatedness estimation and assignment/clustering algorithms. Distributions of genetic distances between individuals, here termed common ancestryp rofiles (CAPs),e mphasise the shared ancestryo fa ll memberso faspecies and providea detailed description of genetic variationwithout the need for a priori assignment of individuals to populations. Like distruct,the approach provides av isual representationo fg enetic variation, therebyc onstituting an exploratoryd ata analysis tool. The profiles enable us to visualise howg eneticallys imilar an individuali st oo thersi nt he context of linguistic,s ocialo rg eographic variation. In addition, the approach brings together genealogical and populationl evel perspectives.
The total set of genetic distancesa mong individuals can be partitionedi namanner analogous to ap artitioning of variance: individual heterozygosity (the fraction of an individual's loci that is heterozygous)r epresents within-individual variation; comparisons amongi ndividuals of ap opulation represent within-population, between-individual variation; comparisons of individuals of different populations of aregion represent within-region, between-population variation; and comparisons of individuals of different regions represent between-region variation. The CAPscan, therefore, provide a graphicaldisplayofanoften misinterpreted breakdown of total genetic variation into its components.
As illustrated in Figure1 ,i np ractice CAPsc an vary quite dramatically across individuals. The overall profile for the individual affiliated witht he Pimap opulation (Figure 1a) is more skewed than that of the French individual (Figure 1b Going one step further,b yc ombining individual profiles across all individuals of each population, we seevariation across 52 previously described populations ( Figure2 ). 4,30 Several of the populationsamples from the Americas, as well as the Melanesian sample, reveal relatively broad distributions, even though individuals known to be closely related to othersinthe sample set have been removed. The Colombian and Melanesian profiles, for instance, reveal anumberofpairsofgenetically very similar individuals. Several of these pairsr epresent first-or second-degree relativesa ccording to relpair 31 analysis.  ACAP can be informativebyindicating possible duplicated samples, by indicating closely related individuals within a dataset or as agraphicald isplayoft he partitioning of variance. The powerofthe approach is greatest, however, in cases where we have expectations for the shape of ap rofile.F or ar andom mating population with all sampled individuals distantly related, aC entralL imit Theorem argumentl eads to the expectation of an ormal distribution for an individual CAP. Expectations under more realistic models of population history, however, are essential if we aret oa ccurately interpret a set of profiles.
In ordert of acilitate interpretation of CAPs, we have considered as et of simple models of population history, simulating genetic variation in thec ontext of those models using ac oalescenta pproach.T hrough simulation, we have explored the impact of population isolation and of gene flow on CAPs, and the potentialf or inferring recent or long-term gene flowfromaset of CAPs. In light of thesesimulations, we have evaluatedaseto fi ndividual and group CAPsd erived from STR variation at 377 loci.

Methods
CAPs ACAP consists of the distribution of genetic distance for aset of pairso fi ndividuals. Although any measure of dissimilarity between individuals might be applied,w ef ocus on d xy -the probability of non-identity in state for twoa lleles, one from individual x and one from individual y ,chosen at random from aparticular genomic location. Here,the probability of identity in state is simply the probability that twoalleles are of identical type. 32 d xy is equivalent to 1 2 s xy -the probability of identity in state of twoalleles,one from individual x and one from individual y .U nder simple models of populationh istory, s xy ¼ rp þ (1 2 r ) p 2 ,w here p is the allelef requency in ab ase population and r is ar elatedness measureo r' probability of identity-by-descent'. 32 In many cases, however, population historyi sm ore complex or we arei nterested in identityby-descent of individuals in the base population. We therefore proceed without consideration of ab ase population and estimate s xy (and hence d xy )d irectly from multilocus genotypes. d xy is estimated as:^d whereŝ xy ; is calculated as follows for each locus. s xy ¼ 1for ð genotype x ¼ ii; genotype y ¼ iiÞ ; 0 : 5for ð x ¼ ij and y ¼ ijÞ or ð x ¼ ii and y ¼ ijÞ or vice versa; 0 : 25 for ð x ¼ ij and y ¼ ikÞ ; 0for ð x ¼ ii and y ¼ jjÞ ; and 0for x ¼ ij and y ¼ kl where i,j,k,lr epresent distinct alleles.T he similarity estimate meets the criteria for transformation to ad istance,i nt hat it is everywhere non-negatived efinite. 33 Estimates are averaged across loci to generate an approximation of distance for ap air of individuals. Thed istance measure rangesi nv alue from 0-1.0, with 0i ndicating the comparison of twoindividuals identical and homozygous at all loci and 1.0 indicating no overlap of alleles.d xx (the distance between an individual and him-or herself) for individuals heterozygousa ta ll loci is 0.5. More generally,t he distance between an individual and him-or herself (or am onozygotic twin) isd xy ¼ 0 : 5 h ; where h is the fraction of loci heterozygous in that individual. An

Models of population history
We considered ar ange of models to explore the impact of sample size and populationhistoryonC APs. The basic model included twop opulations of effectives ize 1,000 that diverged 2,000 generations ago.W ei nvestigated the sensitivity of the CAPsand summarystatistics to: (a) sample size (n ¼ 25, 50 and 100 individuals per population); (b) time of population divergence (t ¼ 1,000, 2,000and 5,000 generations in the past); and (c) rate and timing of gene flowf ollowing divergence.W e investigated bothcontinuous gene flow(continuous gene flow following divergence at the rate of 0.5 or 2.0 migrants per generation) and recent geneflow.The recent geneflow model represents population divergence 2,000 generations agof ollowedbyisolationfor 1,900 generations, and then gene flowat ar ate of 2.0 migrants per generation during the past 100 generations. We investigated both symmetrical and asymmetrical genefl ow models. Results presented here aref or asymmetrical models unless otherwise stated.

Coalescent simulations
Using coalescent-based simulation 34 of the above population histories, we generatedg enotypes for each of n sampled individuals per population. We assumed asingle-step mutation model with ar ange constraint in order to best approximate evolutiona tS TR,o rm icrosatellite, loci. Five hundred unlinked loci were modelled for each sampled individual. We assumed am utation rate of 0.0005/generation/locus, on the order of published estimates of effective mutation rate for STR loci. 35 Given an average of 10.8( 0.2) alleles for 377 human STR loci, 36 we assumed arange constraint of 15 repeat alleles; that is, stepwisemutation generated novelalleles until atotal of 15 alleles had been generated, at which pointm utation generated only new copies of existing alleles.

CAP summarys tatistics
CAPso fs imulated and empirical data were analysed similarly for pairso fp opulations. An individual 'overall' CAP consists of the distribution of genetic distancesb etween af ocali ndividuali nt he reference population and all other individuals of the twopopulations. An individual 'within' CAP consists of the distribution of distancesb etween the focal individuala nd the n 0 -1 other individuals of the reference population. An individual 'between' CAP consists of comparisons between the focal individual and the n 1 individuals of the non-reference population. CAPsw ere generated for each sampled individual of ar eference population. The following summarys tatistics were calculated using all such CAPso ft he reference population sample: average and standard deviation (across individuals) of the averaged xy between af ocal individual and others, and average and standard deviation (across individuals) of the standard deviation ofd xy for each individual. The averaged xy captures the centralt endency of the distributions, while its standard deviation indicates variation across individuals in that tendency.T he summaries of the standard deviations of individualprofiles capture the spread of individualprofiles and the variationa cross individuals in that spread. We calculated a raggedness statistic, r, 37 for each profile: where d is the number of bins and x i is the weight in bin i .W e also calculated expected heterozygosity for each population sample and for the combined (overall) sample of n 0 þ n 1 individuals for comparison with meani ndividual genetic distance.Summarystatistics were calculated for four sets of CAPs: (a) 'overall',( b) 'within', (c) 'between' and (d) 'cryptic' -an individual drawn at random from the reference and nonreference population is compared with other individuals of a random sample.I ndividual CAPs are presented as binned relative frequencies, witht he range divided into 100 equalsized bins. For models withg ene flow, we summarised the distributions in greater detail by calculating the average weight (summarising over individuals) in three particular genetic distance bins. These genetic distance bins correspond to: (1) average genetic distance when reference individuals are compared with other individuals within the reference population; (2) average genetic distance when reference individuals are compared with other individuals from the non-reference population; and (3) the mid-point between these twob ins, which corresponds to the average genetic distance when reference individuals from ap opulationa re compared with individuals within ther eference population with mixed ancestry.

Data analysis
We analysed the CEPH-HGDP 30 multilocus STR genotype data generated by Rosenberg and colleagues in collaboration with the Marshfield GenotypingS ervice. 4 That dataset includes 377 STR loci tested in 1,056 individuals from 52 human populations. Although many more populations have been typed for as mall number of genetic markers, including the classical markers( eg blood groups), mtDNAa nd the Y chromosome,t he Rosenberg STR dataset remains therichest published set in terms of the number of individuals typed for a relatively large number of markers. Each individual in the dataset is associated with apopulation(identified in avarietyof ways in the contexts of anumberofseparate research projects) and ageographicalregion. 4 We use those labels when referring to particular individuals by populationo rr egion.
We eliminated data for 13 individuals representingduplicate samples(see Ta ble 1). Thirty individuals from four populations from the Americas are known to be closely related to other individuals in the sample (see Ta ble 2). We carried out analyses both with and without theser elated individuals. We report results for the reduced dataset unless otherwise stated. Individual CAPs are presented as binned relative frequencies and ares ummarised as described above.

Simulations
CAPs and summarys tatistics: Basic population structure model We illustrate the simulation results with CAPsf or ten individuals generated under the basic populationmodel ( Figure3). These CAPsrepresent three categories of comparison:'overall' (an individual from ar eference population is compared with othersi nt he reference and non-reference populations); 'within' (an individual from ar eference population is compared witho thersi nt he reference population); and 'between' Ta ble 1. List of pairs of CEPH-HGDP samples 30  (an individual from the reference populationiscompared with individuals from the non-reference population). The overall CAPshavetwo peaks (Figure3a).Figures 3b and 3c reveal the components underlying those twop eaks: lowerg enetic distance for within-populationc omparisons ( Figure 3c) and a higher genetic distance for between-population comparisons (Figure 3b). Ta ble 3reveals that average genetic distance across individuals is highest for the between-group CAPs( 0.537), intermediate for the overall CAPs( 0.501) and lowest for within-populationC APs (0.469). The overall CAPsh avet he highest standard deviations (0.036), indicating higher within-CAP variance than for the within-population and betweenpopulation CAPs( 0.01 and 0.015,r espectively). Heterozygosity estimates were highest for the overall category. Average raggedness,w hich increases with rapid changesi nb in frequencies, wash ighest for the within-population comparisons, despite thes moothness of those distributions (Figure 3c). The raggedness statistic fails to capture the multimodal nature of the overall CAPs.
Impact of sample size.A si ndicated in Ta ble 3, reducing the sample size from 100 to 25 individuals per population does not significantly change the average or standard deviation of individual CAPs, consistent with the average being lineari n the data. Raggednessd ecreases with sample size for all comparison groups, although within-populationC APs are the most ragged for all sample sizes.
Impact of divergence time.T able 3i ndicates thei mpact of population divergence time on individual CAPs. As expected, the average genetic distance for between-population comparisons increases withe arlier population divergence.E arlier divergence thereforel eads to greater separation between the within-populationa nd between-population genetic distance peaks of aC AP.
Impact of gene flow.A summaryo fC APsf or populations with asymmetric gene flowi sp resented in Ta ble 4. We illustrate the simulation results with exampleC APsf or sets of ten individuals ( Figure 4). These CAPs are 'cryptic', in that any simulated population structurei si gnored and as ubsample of individuals is drawn without consideration of population affiliation. Overall, the average genetic distance increases with increasing gene flow, as does raggedness. The standard deviation of individual genetic distance distributions decreases, as does the sampleh eterozygosity, with increasing gene flow. These results reflect the appearance of weight in the central region of the CAP distribution (Figure 4b), between the average genetic distance for the within-population comparisons and the average genetic distance for the between-population comparisons.I ntermediate peaks correspond to comparisons between focal individuals in the reference population (which receivesgene flow) and migrant individuals (individuals with ah igh proportion of immigrant ancestry) of the reference population. The weight in this intermediate portion of the distribution (averaged over 50 randomly selected CAPs) increases with gene flow( Ta ble4 ).
Ta ble 2. List of individuals removed from analysis because of known close relationship (within two degrees) to another individual included in CEPH-HGDP shorttandem repeat dataset. 4,30 Additional pairs of individuals indicated as possible close relatives by common ancestryprofile analysis werenot removed. Population identification codes (IDs) drawn from Noah Rosenberg (http:// www.cmb.usc.edu/people/noahr//diversitycodes.txt).

Review PRIMARY RESEARCH
Additionally,a se xpected, the frequency of highg enetic distances decreases with gene flow. Recent migration following a relatively long period of population isolationl eads to as light increase in both average and standard deviation of the indi-vidualC APs. Summarys tatistics for within-population CAPs (for the reference population that receivesi mmigrants) reveal CAPs derived from coalescent simulations of two populations of effectives ize 1,000 that diverged2 ,000 generations ago,g enerating 500 shorttandem repeat loci (mutation rate: 0.0005/locus/generation; range constraint: 15, stepwise mutation model).    Harpending. 37 f Average weight of distribution (acrossi ndividuals)ine ach of three sets of bins corresponding to peak at lower genetic distance( 1), peak at higher genetic distance( 3) and midpoint between these two peaks (2). See text for further details.

Impact of human population history
higher average genetic distance for models with geneflow and higher standard deviation of genetic distance for models with recent migration. Additionally,m igration leads to CAPs with multiplep eaks for within-population comparisons.

Data analysis
The overall CAP (based on genetic distance,d xy )for 1,013 individuals (512,578 pairs) is leptokurtotic and slightly positively skewed (Figure 1a, 'global'), with amedian of 0.771 (mean of 0.772)and 5th to 95th percentile range of 0.732 -0.816. Tw oi ndividual CAPs( Pima 1043 and French 516, each versus all other individuals in the dataset) illustrate the potentialf or variation across individual CAPs (Figures 1a and  1b). Theo verall distribution for the French individual ( Figure  1b) is approximately normal, reflecting the overlap of the different CAPs shown in Figure 1d. The CAP for the Pima individual, however, is less symmetrical. The first peak from the left in the overall Pima1 043 CAP (Figure 1a) represents the comparison of Pima 1043 witho ther Pima, the second peak reflects comparison withn on-Pima individuals in the Americas and thethird small peak represents comparison with individuals outside the Americas (Figure 1c).
Group,o rp opulation, CAPsf or 52 humanp opulations are summarised in Figure 2. The distributions for the indigenous populations of the Americas and Oceania have the highest variances: pairso fi ndividuals from the samples of populations in those regions have the broadest range of similarity estimates ( Figure 2). All population samples from regionso utside of the Americas and Oceania have similar levelso fb etween-individualvariation in terms of both mean and variance.There is a geographical trend, however, in that the genetic distance estimates for pairso fi ndividuals from Africa areh ighest, followedb yp airsf romt he Middle East and Europe.P airswithin East Asian populations tend to be slightly more similar to one another than pairsw ithin African,M iddle Eastern, European or Central/SouthA sian populations. Note that these distributions are dependent on the population labellingo fi ndividuals. We can compare them ean genetic distance across all pairs( d t ¼ 0.772)t ot he meang enetic distance between individuals within populations ( d p ¼ 0.740) to obtain an estimate of between-population variation; ( d t 2 d p )/d t ¼ 0.041 is an example of aratio of differences recently discussed at length by Rousset. 32 Thee stimatei sa nalogous to as tandard F ST , except that here within-individual variation is not considered.
Ta ble 5r eports the largest genetic distance for any two individuals from each pair of geographicalr egions. The two most genetically dissimilar individuals in the dataset ðd xy ¼ 0 : 861Þ are an individual from Africa (Mbuti) and one from the Americas (Pima). The twomostdifferent individuals in Africa (a Yo ruba/Mbuti comparison withd xy ¼ 0 : 846) arem ore different than anyt wo individuals outside of Africa (a Han/ Druze comparison withd xy ¼ 0 : 825), consistent with our understandingo ft he high level of genetic diversity and population substructurew ithin Africa. Mean genetic distance can be directly compared withdegree of relationship in asmall number of cases. CAPso fi ndividuals in 19 populations were consistent with ar elationship of degree 1( siblings or parentoffspring pairs). Genetic distance ðd xy )v aried dramatically across these putativefi rst-degree relativep airs( 0.630-0.411). In fact,the twomostdissimilar Surui individuals ðd xy ¼ 0 : 419Þ in the sample were estimated to be more similar than two putativefi rst-degree relativep airsi nA frican populations (one pair of Mbuti individuals and one pair of San individuals).
CAPs(Figure5)and summarystatistics ( Ta ble 6) vary across the Surui/Karitiana, Burusho/Kalash,P ima/Mbuti and Papuan/Biaka comparisons. Average within-population genetic distancesa re highest for the Biaka and Mbuti, intermediate for the Kalash, Burusho,Pima and Papuan and lowest for the Surui and Karitiana. The overall CAPs arebimodal for the Surui/Karitiana comparison and the Pima/Mbuti comparison. By contrast, the Burusho/Kalash comparison is unimodal, except for as mall peak representingc omparisons between more distant individuals. The Papuan/Biaka comparison is intermediate witht wo overlapping peaks.

Discussion
CAPsa re novel, graphicalr epresentations of within-and between-group variation from the perspective of the individual. Likep opulation or individual trees and other clustering algorithms,C APsp rovide insight into population genetic structure.T hrough simulation, we have generated expectationsfor CAPs for two-population models and evaluated the sensitivity of those expectations to sample size,d ivergence time and geneflow betweenthe twopopulations. Simulations demonstrated that, for simple population histories, sample size has little influence on summarys tatistics characterising the distributions. This finding is particularlyr elevant for studies where population structure is cryptic,s ot hat sample sizes of subpopulations areu nknown. Sensitivity to sample size was considered in the context of completei solationb etween populations. It is likely that more complex models including gene floww ould lead to greater sensitivity to samples ize.
The simulation study of the impact of divergence time on CAPsr evealed that the average between-population genetic distance differsf romt he average within-population genetic distance to ag reater extent for populations that diverged earlier in time.This finding is consistent with expectations for the change in F ST or population genetic distance over time.
Ongoing gene flowh as ad ifferent impact on CAPst han does recent gene flowf ollowing isolation: such recent gene flow leads to much broader distributions. Simulations presented above focused on two-population models, including unidirectional gene flow. Overall CAPs( wherei ndividuals in a reference population are compared with individuals in both the reference and the non-reference populations) generated givens uch models consist of three categories of genetic distance estimates. If the focal individual (for whom the CAP is generated) is an individual in the recipient population, these genetic distances correspond to the following comparisons:the focal individual versus individuals of the reference population with little immigrant ancestry; the focal individual versus individuals of mixed ancestry( in the reference population); and, finally,t he focal individual versus individuals of the nonreference population. The magnitudea nd positions of the peaks resulting from these comparisonsc hange as the amount of gene flowi ncreases (Figure 4), suggestingt hat CAPsa re informativer egarding the rate of gene flowb etween populations.The more recent the populationdivergence,the lower the difference between the 'within'a nd 'between' genetic distancesa nd, consequently,t he less potential for recognizing gene flow. In situations where populations have diverged very recently,using alarger number of markersreduces the variance of aC AP and may, therefore, providea dditional insight.
Simulations were designed to explore thei mpact of sample size and population processes on CAPsg enerated from STR Ta ble 5. Maximum genetic distance ðd xy Þ between any pair of individuals drawn from each pair of geographical regions.

Africa
Mid multilocus genotypes. CAPs generated from SNP multilocus genotypes might differ from the STR-basedp rofiles.G iven the high heterozygosity of STR polymorphisms relativet o SNPs, CAPs based on STRs arem ore likely to reach 'saturation' than aret hose based on SNPs. That is, thed ivergence between individuals is likelyt oa pproach an upper limit that dependso nt he mutationr ate and range constraint, as well as population history. CAPs based on tens of thousandso fS NPs mayb em ore informativei fr ecurrent mutation is rare.S NPs, however, aremore likelytobesubject to an ascertainmentbias than are STRs. Given the impact of ascertainmentb ias on estimates of heterozygosity 38 and the correlation between heterozygositya nd individual genetic distance (Table 3), such bias is very likely to influence CAPs. Simulations presented in this paper assume randomly mating populations; however, there is extensivee vidence for non-random mating, consanguinitya nd complex social structure (including matriarchya nd patriarchy) in many human populations. 39 -41 Given the potential for such demographic and sociocultural processes to influence individual genetic distances in real populations, models including more complexm ating systems deserve furtheri nvestigation. Other demographic factors, including population growth and population bottlenecks, are also likely to influence the shape of CAPs. Further simulations are required to assessthe impact on CAPso fs uch demographic processes.

CAPs for 1,013 human individuals
CAPsg enerated from CEPH-HGDP STR multilocus genotypes arec onsistentw ith known patterns of humang enetic variation. 16 The overall CAP (based on the individual genetic

Impact of human population history
Review PRIMARY RESEARCH distance measure,d xy )f or humans is slightly skewed in a positived irection (Figure 1a, 'global'). In light of the simulations,w ec an conclude that this positives kewness reflects subdivision within the species. If mating is random with respect to genomes,t he variance ofd is expected to be low. That is,m ost pairso fi ndividuals ares imilarly divergent. Higherl evels of substructure correspond to higher CAP variances.T he concentrationo fg enetic distance in ar elatively narrowrange (Figure 1a, 'global') is consistentwithagenerally lowl evel of human populations ubstructure (low F ST ); for pairso fi ndividuals separated by more than three generations (ie most pairs), theg enetic distance is very closet ot he overall average.E xceptions arei nt he lowert ail of the distribution that includes pairso fc losely related individuals. These exceptions include pairso fi ndividuals in small populations that have undergones ubstantial random genetic drift, for instance during the peoplingo ft he Americas. Heterozygosity is relatively lowi ni ndigenous populations of the Americas, 35 and two' unrelated' individuals from such ap opulation are far more similar than are twoi ndividuals chosen at random from anywhereelse in the world. F ST estimates, becausethey reflect an average difference between groups, mask someo ft he between-populationv ariation. 16 The analyses presented here highlightt he variationn ot captured by summarys tatistics.
The highest genetic distance value overall is 0.861,f or a pair of individuals including one affiliated with the Mbuti population and one affiliated with the Pimapopulation. These individuals are also among the most geographically distant from one another if we measureg eographicald istance along a migrationp athway out of Africa, east through Eurasiaa nd then into the Americas. Population subdivision within Africa has been so high that the twom ost genetically dissimilar individuals in Africa are more dissimilar than any two individuals outside of Africa, but not so high that those two individuals are the most dissimilar overall. As indicated in Ta ble 5, the region with the greatest divergence between individuals (Africa) is alsot he region with highest heterozygosity.The pairsofindividuals with the largestgenetic distances vary depending on the distance metric (results not shown). Many of the populations amples included in the CEPH-HGDP panel were included for anthropological interest. These populations are often small, more isolated than most ethnic/linguistic groupsa nd considered to be the indigenous peoples of ar egion.T hey can be considered valuable with regardt ou nderstanding human genetic variation, in that they probably represent the extremes in terms of effective size and degree of isolationa nd, therefore, individual genetic distance.
In some cases, population profiles indicate deviations from simplem odelso fp opulation history. Profiles for several populations of the Americas and Oceania arem uch broader than those of other regions (Figure 2), possibly reflecting population substructure. As noted above,s amples for the CEPH-HGDP cell line panel ared istributed with indication that the Karitiana, Surui, Mayana nd Pima samples include relative pairs. The data analyses described above do not include known relative pairs; however, reanalysis including sets of closely related individuals ledt om ore highly skewed profiles (results not shown).
CAPsa nalysis revealed that the CEPH-HGDP sample set includes 13 duplicate samples.S uch detection of duplicate samplesisbest carried out usingadistance measure that givesa distance of 0b etween twog enetically identical( or almost identical, if occasional genotyping errors have occurred) individuals.
The four populationp airsc onsidered in detail illustrate the diversity of humanC APs. The CAPs ( Figure 5), summarised in Ta ble 6, can be interpreted in light of the simulations. Average genetic distancesa re consistent withh igh effective size for both the Biaka and the Mbuti, intermediate effective sizes for the Kalash, Burusho,P ima and the Papua New Guineans and lowe ffective sizes for the Surui and the Karitiana. Figure 5reveals older divergence for the geographically distant populationp airs( givent he difference between the average 'within' and 'between' genetic distances) compared with geographically proximate population pairs. For geographically distant population pairs, unimodal, distinct CAPs for the 'within' and 'between' comparisons indicate lack of gene flow. Overlapping 'within'a nd 'between' CAPsf or geographically proximate populations areconsistent withmore recent divergence of these groups. The Kalash and Burusho, for example,s eem to have similar effective sizes and to have diverged relatively recently.T he second peak of the 'overall' comparison corresponds to ah igh genetic distance,i ndicating the presence of some particularlyd istant Burusho individuals. The Karitiana' within' CAP is also bimodal, with one peak having al ower than average genetic distance.T his peak could correspond to comparisons between local Karitiana and other local Karitiana (and the other peak corresponds to local Karitiana compared with Karitianawithr ecent migrant ancestry). Alternatively,t he first peak mayr eflect inbreeding within the Karitiana population. The other twocomparisons, Pimaversus Mbuti and Papuan versus Biaka, reveal very high levels of variability in the African populations, consistentwith previous analyses of these and other data.

Applications
The patterno fh umanp opulation genomic variation is relevant in an umber of research and education contexts. As noted above,t he patternr eflects -and thereforem ay provide insight into -population history. In medical genetics, knowledge of any genetic substructure of as et of probands mayi nformr esearch decisions. In forensics, an understanding of patterns of genetic variationi sb ecomingi ncreasingly relevant as institutions attemptt oi nfer racial or ethnic affiliation of individuals using DNA data. 42 In secondarya nd undergraduate education, the discussiono fr ace and genetics has typically been highly superficial. As publicly available data regarding genetic information accrue,abasic understanding of human populationg enetic variation becomes an increasingly important component of public education.
When the research goal is to takei nto account cryptic population subdivision, as in case-control studies, genomic controls 43 -45 or clustering approaches (eg structure)a re appropriate; however, thesea pproaches mayn ot alwaysr eveal as mallf raction of individuals that stands out from the rest in terms of genetic distance.T he structure approach,f or instance, is sensitivet os ample size. 4 The CAPsa pproach maym ore readily reveal anomalies such as duplicates amples and closely related pairso fi ndividuals.
CAPsv aryb oth within and across geographically and socially defined groups. The profiles indicate that some population labelss erve as better proxies for genetic similarity than do others. That is, some linguistic or social groups consist of individuals much more genetically similar to one another than to individuals of other groups, while other groupsdonot. Emphasis on absolute description of variation can be valuable, in that continuity of the measurement is naturally emphasised. The individual-basedC APsa pproach also emphasises the shared ancestryo fa ll humans: all pairso fi ndividuals fall into the continuum of genetic distance.F inally, although the CAPsapproach does not require apriori information regarding an individual'sa ffiliation with one group or another,t he approach does allowu st oe xplore hypotheses regarding the correspondence between genetic and non-genetic dimensions of human variation.
CAPsc an be considered as genomic versions of pairwise difference distributions for single DNA sequence loci. 46,47 These genomic profiles enable us to consider both within-and between-group variations imultaneously and to complement traditionals ummarys tatistics in revealingd ifferences among Impact of human population history Review PRIMARY RESEARCH individuals in the variances of individual CAPs( eg Figures 1  and 5). Although mosti nformation regarding population genetic structure is captured in as ufficiently hierarchical analysisofv ariance, 16 CAPs reveal, in addition, information at the genealogical level. While CAPsa re no replacement for traditional populationg enetic summarys tatistics, directe stimation of gene flow( eg BayesAss 23 )o rd irect inference of degree of relationship (eg relpair 31 ), they serve as av aluable exploratoryt ool and as an independentc heck of estimates derived using other methods.