Inference of ancestry: constructing hierarchical reference populations and assigning unknown individuals

The ability to infer personal genetic ancestry is being increasingly utilised in certain medical and forensic situations. Herein, the unsupervised Bayesian clustering algorithms structure, is employed to analyse 377 autosomal short tandem repeats typed on 1,056 individuals from the Centre d'Etude du Polymorphisme Humain Human Diversity Panel. Individuals of known geographical origin were hierarchically classified into a framework of increasingly homogeneous clusters to serve as reference populations into which individuals of unknown ancestry can be assigned. The groupings were characterised by the geographical affinities of cluster members and the accuracy of these procedures was verified using several genetic indices. Fine-scale substructure was detectable beyond the broad population level classifications that previously have been explored in this dataset. Metrics indicated that within certain lines, the strongest structuring signals were detected at the leaves of the hierarchy where lineage-specific groupings were identified. The accuracy of unknown assignment was assessed at each level of the hierarchy using a 'leave one out' strategy in which each individual was stripped of cluster membership and then re-assigned using the supervised Bayesian clustering algorithm implemented in GeneClass2. Although most clusters at all levels of resolution experienced highly accurate assignment, a decline was observed in the finer levels due to the mixed membership characteristics of some individuals. The parameters defined by this study allowed for assignment of unknown individuals to genetically defined clusters with measured likelihood. Shared ancestry data can then be inferred for the unknown individual.


Introduction
Hypervariable microsatellite markers, situated across the autosomes, have been shown to produces tronger resolution for high-level differentiation of populations when compared with biallelic markers. 1 Several expanded studies have demonstrated the usefulness and accuracy with which multi-locus microsatellites can define genetic groupings that correspond well with geographicala nd other proxyd esignations; 2-7 however, the resolution of such studies has been variable.U sing 377 autosomal locia nd Bayesian clustering methods, Rosenberg et al. 6 demonstrated genetic differentiation amongm ajor continents and the ability within certain localities to identifyasubpopulationa sas ingle genetic grouping from other geographically adjacent populations.
Although the confounding effects of homoplasy have provided conflicting analyses at times, microsatellites are generally considered to be highly informativeg enetic markers for high-resolution population differentiation studies. Using extensivem ulti-locus genotypes from aw orldwide population sampling,w ithin-populationv ariance (0.930 -0.950) 6 dominates thet otal variance of the world population. Although it has been demonstrated that homoplastic mutations increase the likelihood of common identicalb ys tate alleles among unrelated individuals, therebyr educing variance for the individual, adjusted estimates for within-population variance (0.812-0.854) 8 still exceed between-population variance. This indicates an overall similarity between populations -a s defined by current geopolitical or other proxydesignationsand strong variance within such populations. Withinpopulation variance is expected to decrease,h owever,w hen examining populations based on hierarchical genetic similarities rather than proxyd efinitions. Further,asystematic hierarchical analysiso ft he genetic composition of subpopulations would allowg roups that have strong genetic homogeneity to be identifieda nd reveal relationships that are probably due to extendedfamilialt ies. These relationships mayp ersist across geopoliticalb orders, but aree xpected to produce ag enetic framework for describingt he biological links detected between memberso ft he total dataset. This framework represents as et of reference populations with which individuals withunknown personal ancestrycan be compared and assigned to am ost-likely population. Personal origin at ac ertain level of resolution and potentials hared ancestryd ata can be inferred for the unknown individual by the characterisationo ft he memberso fi ts matching reference cluster. Using the Centre d'Etude du Polymorphisme Humain (CEPH)H uman Diversity Panel dataset, we describe and validate its decomposition into fine-scale resolution reference populations to which unknown individuals can be assigned with measured likelihood, revealing relevant ancestral information on am ore recent time scale for the unknown individual.

Subjects and methods
The microsatellite data consisted of 377 autosomal loci typed on 1,056 individuals from 52 populations worldwide.D ata correlated to samples'geographical origins were obtained from the CEPH Human Diversity Panel resource.T his has previously been used to address different hypotheses in several other analyses. 1,6,9 -13 Samples were excluded from analysis according to reported instances of mislabelling. 6,14 Analysis of the dataset wasa ccomplished using structure v. 2.0. 4 In order to obtainfi ne-scale resolution clusters, a hierarchical breakdown of samples wasp erformed on each clustera nd then subcluster,a ssembling increasingly homogeneousg roupings of individuals until the lowest level could not be decomposed further.T he substructureo ft he dataset wast reated as unknown; all runs in structure were performed with the 'no pop info' parameter and no proxyd esignations were applied during the hierarchicala nalysis of the samples. For all runs, the structure algorithm wasapplied with aburn-in of 10 3 and with 10 5 datac ollection steps. The admixturea nd correlated allele frequencies models 15 and an infer-alpha prior of 1( for use in determiningK )w ere used in all runs. While structure accounts for the correlation of alleles due to divergence from ac ommon ancestral population, correlations inducedbygene flowacross unrelated populations areignored. Using the current dataset,F u et al. 12 found high correlations among populations which could lead to overestimation of genetic differentiation among the populations.Anew mixture model wasproposed which estimates correlation explicitly due to shared historyand gene flowfromwhich future analyses are likelytobenefit.Also within structure,missing data are ignored when estimating allele frequencies and admixturep roportions for individuals in populations at each update.A dmixture proportion estimates arel ess accurate for individuals having missing data, but the exclusiono fs uch individuals is not recommended unless most of the genotype data are missing. 16 Considering that in this set 98 per cent of the samples had $ 90 per cent complete data, it wasd etermined to include the total complement of individuals for this investigation. Alternate methodologies for managing missing data have been explored in the literature.Y ang et al. 13 found structure performed better in ad ataset with 3.27 per cent missing data overall, when subjects were restrictedt ot hose having only complete data. As eparate analysis package,B APS 2, handles missing allelic pointst hrough data augmentation. 11,17 Future applications of the current dataset would probably benefit from similar strategies.
Af undamental difficulty in using unsupervisedc lustering methods such as structure is that the user must specify the number of clusters( K )i nto which to partition the data and then determine which solution best represents the data. To address this in our analysis, aselection method wasdeveloped, defining criteria with which to select an appropriate K for the data at each level of analysis. The strategy for selecting K wast wo-fold, representingab alance between maximisingt he posterior probability of the solution while takingi nto account the similarity of the solutions produced for independent runs with identicali nput and parameters( see Appendix). Alternate analysis packages offer related Bayesian methods that avoid this difficulty,w here allele frequencies, individual assignments and K are estimated simultaneously. 11,17,18 Possibilities exist for future comparativea nalysiso fr esults using thed ifferent available methods.

Genetic characterisation of clusters
At each level of analysis, several types of quantitativeg enetic data were collected as each parent clusterb roke into multiple child clusterst od emonstrate that partitions applied to the data were productivei na ssembling close genetically-related individuals and excludingo thers. Measures collected for this dataset includedF ST values, intra-and inter-cluster allelesharing statistics and average gene diversity ( H )o fc lusters. 19 H wasa lso used to characterise genetic affinity within small clusters suspected to contain membersf rome xtended family groups. Intra-population allele-sharing statistics were collected for each individual i as the meano ft he number of shared alleles between i and i 1 to i n within ac luster.L ikewise,i nterpopulation measures were calculated as the mean number of shared alleles between i and i 1 to i n in all sibling clusters. Inter-population allele-sharing statistics were calculated using sibling clusterso nly,a si ndividuals in these clustersw ere more closely related to the cluster of interest than individuals in more distantly differentiated clusters, producing as tricter measure with which to indicate between-clusterallele sharing.

Unknown sample assignment
Following the decomposition of the total dataset, all clusters were fixed at each level of resolution, giving discrete hierarchical population assignments to each individual. Using a' leave one out' method, am ember of the dataset was stripped of its cluster designation and then used as an unknown individual to be assigned to the set of reference populations in order to test thea bility to assign individuals to the reference populations with accuracy.P opulation assignments were made by an analyticalB ayesian assignment algorithm implemented in GeneClass2 20 using Rannala and Mountain priors. 21,22 As with the admixturem odel implemented in structure,t his algorithm allows for partial assignmentofindividuals to several populations, based on the differential assignment of chromosomal fragments to clusters.I nG eneClass2, the probability of an individual being assigned to ac lusteri st he product of the probability of membership of each locus under consideration. 22 Successw as achieved if the majority of the individual's assignment wastothe original clusteri nto which it wasplaced by structure;f ailure occurred if it wasa ssigned into ad ifferent cluster. Eachi ndividual wast ested in this manner at each level of resolution to approximate the ability accurately to assign atruly unknown test sample to thecluster likely to have the most shared ancestry.

Geographical/lineage-specific characterisation of reference clusters
The decomposition of the dataset wasd ocumented over five rounds of analysis, representing progressively tighter levelso f resolution. Using the defined selection criteria to choose K , the total dataset initially brokei nto seven major clusters, representingg eographically distinct continentald esignations and twoi solated populations: 1. were completely isolated from all other populations at fine levels of resolution in the hierarchy, even if they were geographically proximal to other sampled groups. 5. Purported extended family groups showing strong genetic homogeneity were extracted from among population lineages through the recursivea pplication of the hierarchicala nalysis ( Ta ble 1). Further characterisation of the geographical and lineagespecific decomposition is presented in the Supplemental Material (see below), in the context of thes even major lines identifieda fter the first round of analysis.

Genetic characterisation of clusters
Data for several genetic indices were collected to document the extent of differentiation of clusters as they were identified through the hierarchy. Figure 1i llustrates mean F ST values measured over the course of the hierarchyf or each of the seven major lines. These datap oints are summaryv alues for all subclusterst hat branchedo ut from the initial main line defined at the second tier of analysis. In the course of the decomposition,F ST values remained high. Although at certain midpoints the MiddleE ast/European/Pakistani, African and East Asian clustersd emonstrated weaker measures of differentiation between the proposed clusters,t hey experienced their highest F ST ratings towards the end of the hierarchy. Substructure detected at each tierw as tested by 10 4 permutation steps ( p , 0.00001).
H wasa lso measured and summarised as am ean value for the seven main lines at each level in the hierarchy( Figure 2). Generally,t hese measures illustrated ad ecrease in the diversity of the composition of the clusters over the course of the decomposition.
Intra-and inter-cluster allele sharing data were collected as each parent cluster divided into K child clusters and analysed in relative frequency histograms. Using window-smoothing techniques, the resultingc urvesw ere observeda sa pproximatelyn ormal ( Figure 3). Median within-a nd betweenpopulation measurements for each curve were observeda s point estimates indicating the extent of differentiation of the newly defined subpopulation from its sibling clusters. Figure 4 summarises the mean number of pairwise matches observed within and between clusterso vert he course of the hierarchy for all seven major lines. An increase in intra-cluster allele sharing waso bservedo vert he course of the hierarchyf or all lines. Inter-cluster allele sharing also increased, but, on average,i ntra-cluster statistics were alwaysg reater.
Observations were alsom ade as to the productivity of population separation by analysing the degree of differentiation of the inter-and intra-cluster distributions. The null hypothesis that the allele sharing distributions for the sampled populations were the same wast ested at al evel of significance ( a )o f0 .05. Measured b -values quantified the probabilityo f incorrectly acceptingt he null hypothesis when allele sharing distributions for the inter-and intra-cluster populations Ta ble 1. Quantitative description of purported family lineages identified in the CEPH Human Diversity Panel. Recursive hierarchical analysis allowed identification of the tightly related subgroups found in the sampling of the major geographic areas of the world. were distinct. The b -values are an indicator of the extent of genetic divergence of the newly formed sibling clusters. Examination of four of the seven main lines -t he Biaka Pygmy, Oceanic,A merican and Kalash clusters and their subclusters-shows that, on average,there wasexcellent differentiation of the intra-cluster peak from the inter-cluster peak throughout the hierarchy. The African population demonstrated af airly high b -value with its initiald ivision, indicating weaker differentiation of the intra-and inter-cluster distributions. TheE ast Asian line demonstrated strong separation from its sibling clusters witht he initiald ivision; with succeedingr ounds of analysis, however, the mean b -value indicated less genetic distinctionbetween subsequently divided populations. Similarly,g enetic differentiation wasw eak in the Middle Eastern/European/Pakistani main line,b ut increased over the course of the hierarchy. Subclustersh aving similarly diverse genetic composition were the result of partitions where b -values were high.

Unknown sample assignment
To approximate the ability to assign unknown individuals to the structured reference clusters,a'leave one out' method was utilised and results from these tests were quantified ( Ta ble 2). In the first three levels of the hierarchy, nearly all samples were re-assigned to their original clusters.I ns ubsequent levels, a drop in successofre-assignment wasobserved; howeveralarge proportion of clustersi na ll levels of the hierarchyc ontinued to performw ell ( . 90 per cent success rate). In observing the rate of successo vert he course of the hierarchyf or the total dataset, it wasa cknowledged that the mis-assignment of particular individuals might indicate that more stable genetic groupings exist than those defined in the original decomposition of the dataset. When mis-assignedi ndividuals' cluster membership classifications were adjusted to those defined in re-assignment tests by GeneClass2, 20 an overall increase in successr ate wasn oted at all levelso fr esolution in subsequent re-assignment procedures. This step wast ermed 'postclustering adjustment'.

Discussion
Am ethod has been described integrating supervised and unsupervised Bayesian clustering algorithms in which unknown individuals can be assigned to likely reference populations for the purpose of inferring personal ancestrydata. This demonstrated the successful assembly of informative clusters, wherein the most closely related individuals wereassembled and more diverse individuals wereexcluded, even at the finest levels of resolution. The differential termination of the main lines at various levels of resolution was due to differential conclusion of analysis for the main lines in the course of the decomposition.

Review PRIMARY RESEARCH
Geographicala nd genetic evaluationsc onfirmt he efficacy of the procedure.

Geographical/lineage-specific characterisation of reference clusters
Several trends were observedw hen cluster composition was analysed by correlating geographical origins to cluster members. In all populations sampled, smallg roups were identified that exhibitedm arkedlyi ncreased levels of homogeneity (Table 1). These clusters were generally identifiedd eep in the hierarchya st hey separated from larger population groupings. They are suspected to contain memberso fe xtended family groupsc ollected in the sampling, which formafurther level of genetic resolution within already highly differentiated populations.
At broader levelsofresolution, it wasobservedthat many of the populations that experienced high levels of differentiation were either isolated by distance in the sampling (Yakut, Russian, Mozabite etc) or were populations likelyt oh ave experienced prolongedp eriods of genetic isolation (Basque, Biaka, Orcadian, American etc). This wase vident in the Pairwise mean number of shared alleles

Relative frequency
Intra-cluster Inter-cluster

Intra-cluster
Inter-cluster Figure 3. Normalised pairwise mean allele sharing within clusters and among sibling clusters. It is common for substantial overlap to be observed between the distributions, but intra-population measures are typically greater than inter-population measures. 23

(a) Within
Karitiana, among other sibling clusters from the Americas. This population subdivision defined by structure demonstrated strong population differentiation, even at finer levels of resolution, such that the distributions werec ompletely separated. (b) Within Israel -Negev, among European and other Middle Eastern sibling clusters. Other population divisions defined by structure showed more overlap between the distributions, indicating that although therewas strong homogeneity within the intra-cluster population, similar alleles werefound in sibling clusters, further indicating that the populations weremore closely related. The degree of overlap was quantified to describe the relatedness of the samples being partitioned. Level of hierarchy Number of shared alleles

Inference of ancestry
Review PRIMARY RESEARCH Ta ble 2. Re-assignment characteristics of unknown samples to reference clusters defined in the hierarchical decomposition of ad ataset by structure.R e-assignment success rates were high over the first several levels of resolution in the analysis. The decrease in re-assignment success over the course of the hierarchywas primarily attributed to an increase in the number of samples exhibiting mixed membership properties among multiple clusters. Despite the decline in overall success, al arge proportionoft he defined clusters at each tier demonstrated highly successful re-assignment rates. These high-performance clusters have the potential to provide highly informativea ssignments for truly unknown individuals in ancestryt esting procedures. Additionally,p ost-clustering adjustments, wherein certain outlier samples were systematically relocated, had the effect of increasing success rates at all levels of resolution.
Rawr e-assignment Densely sampled Pakistan populations exhibited some population substructured ownt ofi ne levels of resolution. Population-specific clustersw ere anchored by the Burusho, Brahuia nd Hazara,b ut otherwisei tw as difficultt od etect clear genetic distinctions among the other tribal groups. The density of sampling when compared with other world regions mayp revent the separation of distinct populations observed elsewhere. This situation wasalso observedinsouthernChina, where sampling wasd ense.I ti sp ossible that the different hierarchical propertieso bservedw ithin populations can be attributed to the incongruent sampling schemes observed within thedata, where sampling of closely-related populations detects ag enetic gradient that diminishes the ability to resolve populations according to proxyd esignations. By contrast, some broad population-level groupings have ah igh degree of within-populationsubstructuredue to sampling of populations that ared istantly isolated from one another,o rd ue to thorough sampling of population isolates that are likely to have accumulated distinct allele frequencies. Further population structure is observedw ithin some clustersa saresult of the sampling of related individuals, causing stratificatione ven at fine levels of population resolution.
Meanwhile,t he substructure of other populations, having individuals with mixed membership in multiple populations and other transitional attributes, is more difficult to resolve. This maynot be due to intrinsic populationcharacteristics but rather to dense sampling of geographically contiguous populations, likely to have more admixed properties among the sampled members. The sampling from theAmericas shows alow degree of within-population varianceand high betweenpopulation differentiation. 6 Although this has been demonstrated in this dataset previously,t he representation of the Americas is fragmentarya nd would be expected to produce strong quantitativedifferentiation by virtue of the geographical isolationo ft he sampled populations and the sampling of population isolates, as well as the existence of multiple family groupsw ithin these populations. When contrasted with the dense representation of populations in East or Central Asia, the quantitativem easures showing weaker differentiation of subpopulations arep ut into context.
In some populations, distinct genetic characteristics were detected despite the tightg eographicalp roximityo ft he samples.Among the three Middle Easterngroups sampled, the Druze (cluster S4A-4) were the most distinct; the Palestinians (clusterS 4B-1) and Negev Bedouins (cluster S4A-6)a lso formed somee xclusivea ssociations (see Supplemental Material). Although geographically close, thesep opulations have been genetically isolated from one another as demonstrated in the ability to identifyg enetic differentiation.
Common trends observedi nt he decomposition of the sampled world populations suggest populations ubstructure characteristics that mayb eo bservedi nt he hierarchical decomposition of other sample datasets. Initial broad breaks in the hierarchyw ere successfuli ni solating genetically disparate populations that were typically separated by large physical distances. Subsequent rounds of analysis further down the hierarchyw ere attempted on populations that had stronger genetic similarities than in previous tiers. This resulted in divisions that exhibitedmore overlap and mixedcharacteristics between the newly formed subclusters. Additionally,m ixed membership among subclustersw as most often observed in densely sampled areas amongp opulations that were geographically intermediate to distant populations. Population isolates often anchored highly differentiated clusterst hat contained various portions of more heterogeneous groups. Regionst hat were represented by distantly sampled populations showedc learer differentiation amongs ubgroups. Through hierarchical analysis, however, it wasp ossible,e ven within geographically proximal populations, to completely isolate certain populations into their ownu nique subcluster. Additionally,s maller groups of individuals that shows trong genetic homogeneity can be identifiedf roml arger population groups, and probably exhibit the presence of closer familial relationships. Ah ierarchicala pproach to detecting population structure allowedfi ne-resolution clusters to be defined, representingmore recent relationshipsfromamongthe dataset. The genetic clusters defined at all levels of resolution can serve as at emplate with which unknown samples can be compared and assigned for the purposes of ancestrytesting. The potential for highly informativea ssignments exists due to the genetic, geographical and lineage-specific composition of many of the clustersi dentifiedi nt he total dataset.

Genetic characterisation of clusters
Descriptiveg enetic values identifiedt hrough the progression of the hierarchyd emonstrated that the imposed population partitions were productive in assembling closely-related membersa nd excludingg enetically dissimilar individuals from the constructed reference populations.
F ST values were usefuli no bserving trends in substructure detected at the various levels of the hierarchy (Figure 1). Allof

Inference of ancestry
Review PRIMARY RESEARCH the main lines were characterised either by high F ST values at all points in the hierarchyo rv alues that increased over the course of the decomposition. Even at the highest levelso f resolution, F ST values for the Oceanic and American samplingsi ndicated discrete subpopulations, probably due to increased levels of relatedness among the sampled individuals. Although Middle East/European/Pakistani, African and East Asian clusters demonstrated weaker differentiation between the proposed clusters at certain midpoints in the decomposition,t hey experienced their highest F ST ratings at the leaves of the hierarchy. This wasa ttributed to the strong genetic diversity seen within the populations in the early cluster divisions. Initial partitions within these lines were productive in imposings eparations between geographically sensible populations, even though the within-population diversity was still strong. Continued hierarchicala nalysis formed groupings that preservedc ohesion amongs trongly related individuals, thus leadingtothe stronger F ST values detected at the leavesof the hierarchy. This analysis revealed the substructureproperties of each of the seven major lines examined in this study and also supported the composition of the subpopulations over the hierarchy, demonstrating, on average,s trong F ST values showing high significance levels by permutationt ests.
Complementaryt ot he findings indicated by the F ST metrics obtained over the hierarchy, H measures also showed ad ecrease in diversity and an increase in homogeneity within the newlyd efined subclustersa te ach level of resolution ( Figure 2). As with the other metrics,t his provided an empirical validation of the proposed structure of the hierarchy, even at the finestl evels of resolution.
Within-and between-cluster alleles haring data, and b -values measuring the differentiation of inter-and intra-cluster distributions, were collected as each parent cluster divided into K child clustersa sam eans of expressing quantitativelyt he productivity of the imposed partitions ( Figure 4). Fora ll lines, the mean number of intra-cluster shared alleles increased over the course of the hierarchybut wasless pronounced in the East Asian group.H igh variance within populations probably contributed to the difficulty in resolving substructure within this populationc ompared with othersa nalysed.O na verage, however, the mean number of shared alleles within ac luster did increase with new population subdivisions for all seven major lines. This indicated that the subdivisions introduced at each new tier were productive in assembling closely-related individuals and excluding others. As previously indicated, the within-populationmedian is usually greater than the betweenpopulation point estimator. 23 On average,this wasobservedin the analysis. This givesa ni ndication that the individuals within clusters are more genetically similar than individuals found outside of their ownc luster,t hus supporting the compositionofthe clusters constructed over the hierarchy.
b -values were also used to gauge the extent of population differentiation detected among newly defined clusters. At otal overlap of inter-and intra-cluster distributions indicated that the populationd ivision into K child clusters produced an ew subpopulation definitionthat wasgenetically indistinct from its sibling clusters,s uggestingt hat substructurec ould not be detected in the sampling. Atotal separation of the distributions represented the achievement of ultimate productivity in defining subpopulations that demonstrated strong separation from sibling clusters, indicating the K divisions of the parent clusterw erem ost productive in the defined partitions ( Figure 3). Examination of four of the seven main lines -the Biaka Pygmy, Oceania, American and Kalash clustersand their subclusters-showedt hat, on average,t here wase xcellent separation of the intra-cluster peak from the inter-cluster peak throughout the hierarchy. This demonstrated the presence of strong subpopulations having distinctivea llele sharing characteristics from other closely-related and recently differentiated groups. This supported the population partitions determined by structure even at the mostspecific levels of resolution where familialg roups were extracted. The African population exhibited af airly high b -value with its initiald ivision. This wasdue to strong diversity among samples within populations. Further rounds of analysis helped to differentiate these groups into subpopulations that were more genetically homogenous and excluded othersi nto sibling clusters.T he EastA sian line demonstrated strong separation from its sibling clustersw ith the initial division. With succeeding rounds of analysis, however, the mean b -value indicated weaker differentiation between subsequently divided populations.A sa nalysis progressed down the hierarchyf or this line,t he populations were increasingly similar,m aking it more difficult to obtain distinct subpopulations with distinct genetic composition. Similar properties were observedi nt he Middle Eastern/ European/Pakistani main line.
These analyses provided quantitatives upportt ot he compositionoft he reference clusters defined over the hierarchyof the decomposition of the sampling. In general, it wasobserved that the partitions defined were productivei na ssembling individuals of close genetic similarity and excludingo thers. In many cases, the strongest genetic substructure of many lines waso bservedb eyond the broad levels of resolution, revealing distinct genetic groupings and previously unrecognised extended family groupings to which unknown individuals have the potential to be assigned. Ancestry for the unknown can then be inferred from ac haracterisationo ft he matching clusterc ontents.

Unknown sample assignment
In performing re-assignment tests using GeneClass2, 20 several trends were observed, revealingc haracteristics of the dataset and its decomposition.T he rate of success wash igh in early levels of analysisa nd,a si td eclined over the course of the hierarchy, an attempt wasm ade correctly to attribute the causes to particular factors. The size of clusters decreased through the progression of the decomposition analysis. To test Ekins et al.

Review PRIMARY RESEARCH
if smallr eference clusters ize adversely affectedt he re-assignment tests, correlation coefficients were calculated to assess the relationship of success in re-assignment to clusters ize.T hese indicated only aw eakp ositivea ssociation (Table 2). Further examination showedt hat generally smaller clustersh ad highly successful re-assignment rates, probably due to decreased diversity within the groups suspected to contain members from ac ommon lineage (Table 1). Among other quantities tested, the strongest correlation waso bservedb etween success rate of re-assignment and the degree of mixed membership observedf or individuals in clusters.M embership coefficients calculated by structure indicated thep robability of assignment of an individual to thenewly defined child clusters.Individuals having membership in multiple clusters,r ather than a strong signal in as ingle cluster, were more likely to fail on re-assignment.
Marked improvements in re-assignment rates at all levelso f resolution were seen subsequent to post-clustering adjustment. Many of the individuals that achieved more stable population clusterd efinitions had mixed membership signals in the early stages of the hierarchy. Post-clustering adjustmenta llowed these individuals to be placed in am ore stable grouping, as reflected in increased re-assignment success rates; however, some samples were still mis-assigned after this adjustment. As the hierarchyp rogressed, partitions in clusters were made among increasingly similar groups, creating as ituation where individuals tended to have mixed membership in multiple clusters.T hus, when complex relationships existeda mong the individuals, precise population assignments to as ingle group were less representativeo ft he propertieso ft he dataset. The degree of mixedmembership among multipleclustersfor an individual wasag ood predictor of re-assignment success. The re-assignment characteristics of the various geographical samplesw ere explored. Regional characteristics were seen in individuals that were mis-assigned in this test, both in terms of the frequency of mis-assignment and in thed egree of mis-assignment relative to the original cluster placement (see Supplemental Material).
We have outlined an ew approach to the challenge of inferring ancestryf or individuals of unknown origin. The two-stage method integrates the novelu se of an unsupervised algorithm to construct ah ierarchical framework of reference clusters with as uperviseda lgorithm to performc luster assignment of the unknown, as suggested by Baudouin et al. 22 Unknown individuals can be assigned to anyl evel of resolution desired, witht he potential for highp robabilitya ssignment to ah ighly informativec luster defined in the hierarchy. Howi nformativet hese cluster assignments are varies with the lineage and geographicals pecificityo fc lusterm embersa nd the confidence in the membership of the cluster definitions. The confidence in cluster composition can be estimated by the stability observedw hen performing self-assignment procedures. Many clusters were shown to have stable group membership,i nt hat all individuals were successfully re-assigned to the samec luster,w here otherss howedm ore volatility.The probability of clustercomposition wasestimated for each clusteri nthe hierarchical framework by observing the proportion of successful re-assignments when the defined population structurew as subjected to self-assignment tests. To getherw itht he likelihood calculated by GeneClass2 20 in the assignmento ft he unknown, this estimation contributed multiplicatively to the probability of unknown sample assignments to its matchingr eference cluster.T his allowed appropriate weighting of unknown assignments to reference clusters that have less stable composition of clusterm embers. Confidence scoresf or assignmento ft he unknown obtained in this manner were generally comparable to theo riginal probability of assignmente stimated by structure at the time of the individual'sp lacement into the hierarchical framework. Although some reference clustersw ill not allowc onfident assignment of unknowns-due to al ow probability of reference clusterm embership -m any clusterst hat have the potentialf or highly confident and informativea ssignment of unknown samples exist at all levelso fr esolution in the hierarchy( Ta ble 2).

Conclusion
With the entirehierarchical structuredefinition as aparameter, unknown individuals have the potential to be assigned to highly resolved cluster definitions that represent specific localities and also likelyf amily groups. Thep resented level of resolution givesn ew insight to characteristics of this dataset that were not revealed in previous analyses using different techniques. As in this investigation,R osenberg et al. 6 used structure witht he admixturem odel. Using an alternate clustering strategy,C orander et al. 11,17 utilised mixture modelling and laid conditions on the geographicals ampling of individuals such that subjects from the same site were assigned as ag roup to the productc lusters. At the global and population levels of resolution, much of the broad structure detected wass imilar in all studies ( Ta ble 3). With the hierarchical strategyp resented here,f urther resolution of the Mandenka, Orcadian and Russian populations generally into a single cluster wasobserved, while the Adygei were not seen as ag enetically distinct group as observede lsewhere. Fine-scale substructure wasd etected beyond broad population level classifications using the current approach, which identified lineage-specific and extended family level groupings,m any of which were detected at the leavesofthe hierarchy(see Ta bles 1 and 3). Such specific groupings, quantitatively supported by strong differentiation measures from genetic indices, can serve as candidate reference populations to which unknown individuals can be assigned with measured likelihood and highly informatives hared ancestry data can be learned.
It is anticipated that these findings will be of significance to individuals that have an interest in delineating more recent

Inference of ancestry
Review PRIMARY RESEARCH genetic relationships that surpass the broad population level classifications that are regularly explored. High-resolution genetic groupings and unknown assignments trategies mayb e valuable in such global applications as disease linkage studies, parametersf or extended familials tudies, for use as proxy medical classifications to assess epidemiological risk and in forensic applications. 24

Geographic characterisation of subclusters
Clusters at each level of analysis were characterised by individual members' lineage and geographicalo rigin. Cluster composition wasdepicted on aMercator map projection, with latitude/longitude plot pointsf or each individual represented proportionallyo nt he map.E llipses representing the geographicalcoverage of each cluster, weighted by sampledensity, were constructed and assisted in identifying the populations most heavily represented in the proposed clusters.T he length and angles of the major and minoraxesw ere calculated from a covariance matrix as twos tandard deviations from the calculated geographic centroid, weighted by sample density.

Biaka Pygmy
The Biaka Pygmyl ineage wasahighly specific genetic grouping, excludinga ll other geographically proximal African populations in its initiala ssociation. Further rounds of analysis produced four small subgroups containing twot ofi ve memberst hat demonstrated increased levels of homogeneity (Table 1).

Sub-SaharanA frica
All other sub-Saharan African individuals, with some mixture from Middle Easterni ndividuals, clustered together in a main line separate from the Biaka Pygmy. This setw as found to be best partitioneda t K ¼ 3. In the resulting subclusters, the Mbuti Pygmyp opulation (clusterS 1A-3) and all individuals of the San lineage (cluster S1A-2) formed two completely isolated genetic clusters ( FigureS 1A). The Bantu in Kenyaand the geographically distant Yo ruba and Mandenka belonged to the same genetic clustera tt his level in the hierarchy( cluster S1A-1). At the next level of resolution, most of the Mandenka (cluster S1B-2) were distinguished into their owns ubgroup ( Figure S1B). Among the geographical subgroups, fives mall clusters were identifiedr anging in size from twot ot hree individuals, all showing decreased gene diversity (Table 1).

Oceania
Oceanic populations ( FigureS 2) demonstrated as harp division between the geographically isolated subpopulation of PapuaN ew Guinea (cluster S2-1) and Melanesians on the Bougainville Islands (cluster S2-2). Theg enetic sampling of these populations is highly structured and allows for an unequivocal division, not only between the major geographical regions, but also within the resulting subgroups. Three small subgroups were identified within the geographical clusters,d emonstrating increased homogeneity ( Table 1).

The Americas
Likewise,t he American sampling wasf ound to be highly structured. Of the fivep opulations represented, the majority of samples were uniformly assigned to clustersw ith others from the same origin populationw ith little evidence of gene flow ( FigureS 3). Only the Colombian and Karitiana had a slight proportion of samples clustering with the Maya (cluster S3-1), the rest forming their owne xclusivea ssociation (clusterS 3-2). The Pimap opulation( cluster S3-3)w as completely distinct -a nd even the geographically proximal Karitiana (cluster S3 -5)a nd Surui (clusterS 3-4) showed Ta ble 3. Multi-level structure partitions of genetically distinguishable geography-based groupings identified in the sampling. Geographically-defined groups at global and population levels, in which the strong majority of individuals wereu niformly and uniquely assigned to as ingle cluster,w ereconsidered genetically distinguishable from others in the sampling. Lineage level structure was identified when small subgroups of individuals demonstrating elevated indications of relatedness (see Ta ble 1) weref ound within geographically-defined groups.

Resolution
Geographic grouping Global level Africa, a,b Oceania, a,b the Americas, a,b Middle East/Europe, a,b East Asia a,b highly differentiated populationc haracteristics. Small subgroupsd emonstrating decreased diversity were detected within each of the geographical subpopulations, as expected with the explicit familialsampling strategyfor theselocalities. 9 Tw enty-four such groupsw ere identifiedi nt he American sampling (Table 1).
An orth -south geographical cline in cluster membership wasd etected in the initial separation of the Pakistanis ampling Figure S1. Geographical coverage of hierarchical genetic subgroups detected in Sub-Saharan Africa. (a) K ¼ 3was found as the best representation of substructure at this level in the hierarchy. Clusters werespecific to geographya nd population, such that the San and Mbuti formed distinct clusters, excluding individuals from all other populations. (b) The next level in the hierarchya llowed cluster S1A-1t obes ubdivided into K ¼ 2g enetic groups, where the majority of Mandenka form an exclusiveg enetic grouping from the remaining population representatives. Figure S2. Geographical coverage of hierarchical genetic subgroups detected in Oceania. An ear-complete separation of the geographically isolated Papuans and Melanesians was obtained with K ¼ 2subgroups. Further genetically homogeneous subgroups werefound in cluster 4-2, probably indicating the presence of extended family groups within the sampling ( Table 1).

Regional classification analysis of misassigned unknown individuals
In order to classify and describe failures in re-assignment for pseudo-unknown individuals, each individual wast raced from its original clusterd efinitiont oi ts re-assigned cluster. FigureS6depicts mis-assigned individuals at each tier,interms of their degree of mis-assignment as it relates to the average gene diversity of the twoclustersinvolved. The degree of misassignment wasquantified by the number of tierst hat must be regressed in order for the twoc lusters involved to converge. Initially,itwas observedthat many mis-assignments took place when an individual from asmaller cluster with more restricted genetic membership wasr e-assigned to as ibling clustert hat had broader,m ore inclusiveg enetic statistics. Theu nits on the x-axis quantify the proportionald ifference of the average genetic diversity index between the originally assigned and mis-assigned clusters. Points observedt ot he farr ight of the y-axis (x ¼ 0) were individuals that were re-assigned to clusters that had more genetically diverse memberst han in its initially defined cluster, while pointstothe left of the y-axis (x ¼ 0) were re-assigned to clustersw ith lowert han average gene diversity.  Wo rld regionsd isplayed distinct re-assignment characteristics over the course of theh ierarchy. Clustering of symbols representingd ifferent world populations were observeda ta ll levels of resolution. Individuals originating in Pakistan covered the total number of mis-assignments in level two. In subsequent tiers, Pakistanis were mis-assigned to clusterst hat converged distantly in the hierarchyw itht he originally assigned clusters. In all tiers,P akistanis were among the most numerous misassigned individuals and were re-assigned to clusterst hat had lower H measures than their original cluster.M is-assigned individuals were similarly numerous among MiddleEasterners. These individuals were mis-assigned to clusterst hat had both lowera nd higher H indices than the initialc lusters. East Asian individuals were alsom is-assigned with similar frequency; however, EastA sians were more frequently re-assigned to sibling or closer-degree clustersh aving higher H statistics. Europeani ndividuals were mis-assignedw ith lowerf requency until level 5, where discrepancies were more numerous. Europeans were generally seen to move to sibling or seconddegree clusters with similar gened iversity measures. All other regional groups experienced high successr ates in the re-assignment process. When mis-assignments were detected, they were mosto ften characterised by an individual being re-assigned to aclusterwith alow degreeofdisparity from the original, which had much wider genetic diversity than its initiallyd esignated cluster.A frican, American,O ceanic and the Pakistani Kalash individuals re-assigned relatively well throughout the hierarchy. Following post-clustering Ekins et al.

Review PRIMARY RESEARCH
adjustments, individuals that were re-assigned and became more stable in their clustera ssignments were included in previously high-degree mis-assignedc lassifications, as seen in Figure S6.

Inference of ancestry
Review PRIMARY RESEARCH Figure S5. Geographical coverage of hierarchical genetic subgroups detected in East Asia. (a) The initial division at K ¼ 2i ntroduced a north -south partition. (b) Subsequent analysis allowed the Ya kut to be nearly completely isolated. (c) Division in the southern cluster (S5A-2) produced four distinct clusters, one with as outhern orientation (S5C-1), awesterly group (S5C-2), ac ompletely isolated Lahu cluster (S5C-3) and as trongly anchored Japanese cluster (S5C-4). (d-f)F urther analysis allowed the separation of ad istinctly Japanese group (S5F-2) and mixed membership clusters in the remainder of the groupings.
It wasd esirable to produce am onotonic value that can be used to compare sets of independentr uns at ag iven K. In order to characterise this process, it wasn ecessaryt od efine howi ndividuals arec lassified into clusters from the data obtained from structure in the N £ KQ -matrix. Previous investigatorsh aveu sed minimum membership coefficient thresholds to assign individuals to particular clustersw ithin a run, withv alues ranging from $ 0.25t o $ 0.75. 5,15 As previously applied,f or some analyses it is mostu sefult od ispense with lesser admixturee stimates in ancillaryc lustersa nd bin an individual completely to as ingle cluster, usingt he maximum membership coefficient, as the best forced fit for that individual.
Once the clusterf or an individual i wasd efined by its maximum membership coefficient value,i tw as associated with the other individuals from that runwho were also placed into the same cluster,t ermed set G 1 .I nasubsequent independentrun of structure,the maximum membership coefficient of individual i again determined its cluster assignmenta nd the individuals withw hich it grouped in this second runw ere defined as the set G 2 .T he proportion of individuals shared between sets G 1 and G 2 wasd etermined and termed G m .
The counto fi ndividuals of this set N Gm quantified the number of individuals from G 1 that individual i also clusteredw ith in G 2 .Asimilarity score for individual i , S i , measured the proportion of individuals in common that it clusteredw ith across the independentr uns, giving am easure of howw ell individual i adhered to other individuals from its original cluster G 1 . N G 1 is the counto fi ndividuals in set G 1 .
Am easure of the total similarity, S t ,o ft he cluster assignments for all individuals across twoi ndependent Major mode of highestlikelihood composed of 7runswith S =1,and similar highlikelihoodfor solutions Lower likelihood mode composed of two runs Minor mode composed of single run, with lowestlikelihood Figure A1. Example of modes detected for as ubset of the total dataset, n ¼ 10 runs at K ¼ 4. Threemodes were explored by the structure algorithm in this dataset. The Pr( K )v alues for runs r 1 to r n and an n £ n matrix of similarity scores S for the set of runs is calculated. As ymbol (specified in the key) is assigned to each run r and plotted on the x-axis at the 2 log 10 ( Pr[ K ]) value for runs r 1 to r n ,v ersus S representing the level of similarity calculated between r and r 1 to r n .T his method expands upon previous systems for selecting ar epeatable K in the sample space.R osenberg et al. 6 also generated an n £ n matrix of similarity values and took the mean of all values in the matrix to assess levels of similarity for runs generated at various K .T his method was found to be restrictive, in that certain K values wererejected due to as ingle disparate run that skewed the overalls imilarity below the selected similarity threshold (0.97) for acceptance of the K .F or the n £ n matrix generated for this dataset, seven of the ten runs produced as imilarity score of S ¼ 1relativetoo ne another.T he remaining three solutions were # 80 per cent similar to the other seven, and had am uch lower likelihood. Screening K based on the mean of all S scores (0.868 versus the minimum defined S threshold0 .97) would disqualify this K from consideration as aviable representation of the structure of the dataset. Because am ajor mode of high probability and high similarity can be identified within the dataset, this illustrates the selection of as tatistically likely and empirically stable substructureof the dataset.
applications of structure wase stimated from the mean of all values of S i for the totaln umber of individuals, N t ,i nt he full dataset.
For comparison, independentr uns must meett wo criteria for this method to produce meaningful estimates of the similarity of the solutions. Thet wo solutions must comprise the same individuals and then umber of clusters, K ,m ust be consistent for the twos olutions being compared. Additionally,t hisc alculation is directional, resulting in a different value for S t when N G 2 is used as the normalising factor in equation (2). Ther esults area symmetrical but are generally comparable values. This summarys tatistic can be used in post hoc analysis of structure runs to determine the best K with which to fit the inputd ata. Figure A2. Flowchartfor multi-level determination of K and detection of hierarchical fine-level substructurefor ad ataset. The total dataset was initially decomposed using structure into K clusters according to the best run, as determined by the selection criteria. This produced K subgroups that were subsequently subjected to the decomposition process, and so on recursively until the finest substructure for each subgroup line was detected, as signalled by K ¼ 1f or the best solution.