Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation

Understanding the distribution of human genetic variation is an important foundation for research into the genetics of common diseases. Some of the alleles that modify common disease risk are themselves likely to be common and, thus, amenable to identification using gene-association methods. A problem with this approach is that the large sample sizes required for sufficient statistical power to detect alleles with moderate effect make gene-association studies susceptible to false-positive findings as the result of population stratification [1,2]. Such type I errors can be eliminated by using either family-based association tests or methods that sufficiently adjust for population stratification [3-5]. These methods require the availability of genetic markers that can detect and, thus, control for sources of genetic stratification among populations. In an effort to investigate population stratification and identify appropriate marker panels, we have analysed 11,555 single nucleotide polymorphisms in 203 individuals from 12 diverse human populations. Individuals in each population cluster to the exclusion of individuals from other populations using two clustering methods. Higher-order branching and clustering of the populations are consistent with the geographic origins of populations and with previously published genetic analyses. These data provide a valuable resource for the definition of marker panels to detect and control for population stratification in population-based gene identification studies. Using three US resident populations (European-American, African-American and Puerto Rican), we demonstrate how such studies can proceed, quantifying proportional ancestry levels and detecting significant admixture structure in each of these populations.


Introduction
Substantial progress has been made usingg enetic markerst o elucidate the evolutionaryh istories of populations, yett his work has primarily been accomplished using large numberso f individuals and small numberso fg enetic markers. 6 patterns of population stratification. 8,9 Such studies can facilitatethe exploration of the genetic structurethat mayexist among and within populations and also provide av aluable source of ancestryinformativemarkers(AIMs) to quantify and adjustf or this structurei ng ene-identification studies.

Results
Here,w ea nalysed 11,555 singlen ucleotide polymorphism (SNP) markersi n1 2p opulations amples usingan ew microarray-genotyping platformc alled WholeG enome Sampling Amplification (WGSA; Affymetrix, Santa Clara, CA). Populations were selected to represent ab road spectrum of world variation( Ta ble 1). Four populations stand out as having similarly elevated heterozygosity( Burunge,S panish, Indian and Altaian). Four groups (Nahua, Quechua, Nasioi and Mbuti) have lowerl evels of variability,while the twoE ast Asian populations and the Mende are intermediate.T hese results are largely consistent with expectations for populations known to have experienced restrictionsi np opulations ize (eg Mbuti, Nasioi, Nahua and Quechua), reducing levels of genetic variability relative to other populations; 9,10 however, ascertainment bias in terms of the population(s) in which markersw ere first discovered precludes makings trong statements about differences in variability usingS NP data. 11 In addition, ascertainmentb ias -such as that resulting from both al imited representation of populations and small numberso fi ndividuals in discovey panels-can lead to deviations in linkage disequilibrium estimates, 12 artefactually elevated F ST levels (see Ronald and Akey in this issue of Human Genomics)and higher derived allele frequencies in non-African than in African populations. 13 Despite the generali mportance of considering ascertainmentb ias on a number of populationg enetic parameter estimates, there is no evidence or theoryt hat predicts problems from ascertainment bias on estimates of measures of individual relatedness or deviations from Hardy-Weinberg equilibrium (HWE). Randomly mating populations are expected to showgenotype frequencies that are consistent withH WE expectations. The results of tests for HWE are presented as the proportion of loci that have deviations from equilibrium expectations ( Table 1). Somep opulations shows lightly higher or lower proportions of significant results. The most notable deviation is seen when all the populations arecombined, at which point over half (56 per cent) of the SNPs shows ignificantH WE deviations. These deviations highlight the importance of takingp opulations tructure into account in gene-association studies. Extensiveadmixturestructure is created by combining these samples,l eading to HWE deviations and, presumably, allelic associations among unlinked markersa tl oci showing large frequency differences across populations. 3 The proportion of the total genetic variationd ue to differences among populations wase stimated using F ST. Figure 1s hows ah istogram of the F ST distribution, with the autosomal SNPs plotteds eparately from the X-linkedS NPs. The average level of F ST for autosomal SNPs (0.148) is within the range of previously published F ST estimates (5 -15per cent), confirming the well-known fact that most variability in human populations is observedw ithin populations. [4][5][6]14,15 The average F ST observedf or the X-chromosomal SNPs (0.224) is substantially higher than that for theautosomal SNPs ð p , 0 : 0001Þ ; which is consistent withb oth the smaller effectivep opulation size and the higher levels of natural selectionfor X-chromosome genes. 16 -18 It is also notable that these distributions are not described well by averages, since they areh ighly skewed and have long tails, highlighting the fact that unlinked loci can have different evolutionary histories. 19 For calculating heterozygosity and F ST ,population divisions were assumed to be known and individuals were grouped using ethnica nd geographical information. Givent he large number of markersi no ur dataset, population genetic analyses can be performed at the level of the individual, making no presumption of group membership. 18,20 Tw omethods were used to investigate clustering among individuals: neighbour-joining trees 21 and principal coordinates (PCs) analysis, using the allele-sharing distance (ASD) 22 for all pairwise combinations of individuals. Figure 2s hows an eighbour-joining tree of individuals, constructed with the ASD measure matrix, using11,078 autosomal SNPs. The root of the tree,b ased on the combined ape outgroup,i sl ocatedb etween theM ende and Mbuti. This supports an African origin for modernh umans. The next group to diverge from the main trunk is the EastA frican Burunge.M ost populations have population-specific branches of substantial length, the largest being the Melanesians and the indigenousA mericans.I na ddition to the Burunge, the South Asian Indians and the Altaians have relatively short population-specific branches, consistent withg ene flow between theseg roupsa nd other populations. The largest internal branchs eparates the three African from the non-African populations, and the next group to divergei s the Spanish, followedb yt he South Asian Indians. No clear separation of the upper and lowerc astep opulations is seen here (but see Figure 3).
Although trees provide au seful means of illustrating relationships among populations or individuals, they are limitedb yt he assumption of bifurcating topologies. PC analysisi sa na lternativea nalytical method, which lacks this assumption. Figure 3a shows the first three PC axes for all populations. As with the tree,individuals from one population clustert ightly,t ot he exclusion of individuals from other populations. Thefi rstP Ca xis shows as eparation of the African and non-African populations, with theBurunge being closer to the non-Africans than either of the other two African populations. Thes econd PC axis shows the indigenous Americans and Melanesians to be on opposite sides of the axis. On the tree,t he twoi ndigenous American populations are separated into monophyletic clusters,w hile the PC analysiss hows overlapping clusters. When focusing on the Eurasian populations (Figure 3b), there is ac linal relationship across all three PC axes for these populations, which is

Large-scale SNP analysis
Review PRIMARY RESEARCH consistent with their geographic positions from Spanish in the lowerl eft to Japanese in the upper right.N otable aret he near separation of the Indian sample into lowera nd upper caste,w ith theu pper caste individuals positioned closer to the Spanish. 23,24 Additionally,t he Altaians are intermediate between the East Asians and the Europeans, afi ndingt hat is consistent with Y-chromosomal studies showing CentralAsian origins for components of the Europeang ene pool. 25 Another wayo fe xploring the PC analysis results is to examinethe pairwise plots of the PC components.Asthe first four components were significant usingt he broken stick test, not all can be plotted in three-dimensional space.T he six possible pairwise plots arep resented in Figures 4a -f.
In addition to these 12 geographically well-defined population samples, we have analysed threec osmopolitan samples collected in the USA (African-Americans,E uropean-Americans and Puerto Ricans). These populations are known to have been subject to both within-continent and amongcontinent admixturei nt he recent past. We estimated the individual biogeographicala ncestryl evels for each person in these three samples. These maximum likelihood estimates of proportional ancestry( Figure5 )s howagreater tendency for the European-American subjects to cluster together, by comparison with the other twop opulations amples. The African-Americansa nd Puerto Ricansb oth show relatively high levels of variability in individual ancestry In  levels, with most of the non-African ancestryi nt he African-Americans being from Europe.T he Puerto Ricans show some individuals with morei ndigenousA merican ancestry, as well as substantial We st African ancestry.
Another test for the presence of admixtures tructure is based on correlations in individual ancestryi ndices calculated from independent (unlinked) panels of markers. 26 We tested for significant correlations usingt wo types of individual indices, PCs (Table 2) and biogeographical ancestry( Ta ble 3), calculated separately from the even and odd chromosomal SNPs. To do this, we divided the SNPs into twogroups; all of the SNPs on even chromosomes in one group and all of the SNPs on odd chromosomes in the other group.Unlessthere is structure which is related to the axes of ancestrym easured by these indices within ap opulation, no significant relationship between the twoe stimates is expected. 25 The correlation results on the PC components showt hat only three (upper caste Indian, Altaian and Nasioi) of the 12 world populations showe vidence of population structure.T he combined Indian sample (upper castea nd lowerc aste together) also shows significant correlations, while the combined East Asian (Japanese and Chinese) population does not. Alternatively,a ll three cosmopolitan samples tested (African-American, European-American and Puerto Rican) shows ignificant correlations between the even and odd chromosomeP C analyses. Significant correlations arealsoseenf or the estimates of biogeographicala ncestryi nt hese three populations (Table 3). It is notable that, not only are there high correlations in the African-American and Puerto Rican samples, but also in the European-American sample,i ndicating the presence of admixtures tructure in ap opulation generally assumed to be homogeneous. 27

Discussion
The largen umber of markersu sed in these analyses has providedano pportunity to assess genetic variationatthe level of the individual in an umber of populations from around the world. These multilocus genotype data on al arge panel of SNPs provide an ew level of resolution in the distribution of variationw ithin and amongp opulations. Individuals cluster into groupsc omprising other individuals from their owno r closely-related populations when diverse groups from around the world area nalysed. 28,29 By contrast,w hen samples from the more cosmopolitan US resident populations are   analysed,c lustering patterns arel ess discrete.S ubstantial levels of variationi na ncestrya re observedw ithin the African-American and Puerto Rican samples, while smaller,b ut significant, admixture structure is evident in the European-American sample.Whether the admixture structure in the European-American sample is the result of intra-o r intercontinental gene flowi sa ni mportant outstanding question. Thus, although discrete clustering of individuals mayb eu seful in describing some of the variation in diverse, well-defined population samples, continuous measures -such as biogeographicala ncestryo rP Ci ndices -are required to describe the samea xeso fp opulation structure in populations that have experienced recent admixture.

SNP genotyping
WGSA technologyw as used to genotypei ndividuals in this study using the GeneMapping 10K ArrayX ba 131 (Affymetrix Inc., Santa Clara,C A). Details of this method have been published elsewhere; 29,30 in brief,f ractions of the genome areo btained by restriction enzyme ( Xba I)  digestion of genomic DNA, ligatedw itha daptorsa nd subsequently amplified with auniversal primer that is directed to the linker.T he amplified target (a smear of polymerase chain reaction products of 400 to 800 base pairs[ bps]i n length)i sf ragmented, labelled witht erminalt ransferase and biotin-ddATP and hybridised overnight to synthetic microarrays. 31,32 Genotypes arec alled by interpreting signals from allele-specific probes using am odel-based algorithm. The accuracy of this method is in excess of 99.5 per cent. SNPs were chosen from TheS NP Consortium (TSC) database on the basis of their predicted locationo n4 00 -800 bp fragments generated by in silico digestion of human genome sequences with various restriction enzymes. Predicted SNPs were then assayeda gainst ap anel of 108 individuals from diverse populations.Iftwo individuals were observedwith each of the three genotypes, and the clustering patterns were acceptable,t he SNP wasc onsidered to be confirmed and retained as part of the panel.

Samples
The population samples used in this study were collected under Internal ReviewB oard approvals from the various institutions involved. TheM buti population samples were collected in theI turiF orest, the Mende samples from Sierra Leone. TheC ushitic-speaking Burunge samples were collected in Ta nzania but aret hought to be of Ethiopian descent.T he Spanish samples were collected in Va lenciai n EasternS pain.T he Nasioi were collected in Bougainville, Melanesia. The Altaian samples were collected in Siberia, Russia.T he upper and lowerc aste groups were boths ampled from Vishakapatnam, Andhra Pradesh, India. TheC hinese (NA17011-NA17020) and Japanese (NA17051-NA17060) samplesare from US residents, curated at the Coriell Institute. Quechua were sampled in Lima ( n ¼ 9) or Cerro de Pasco, Peru,at4,338 meters(n ¼ 11). In the former case,the subjects were highland natives, as both parents and grandparents were borno nt he Altiplano.Q uechua subjectsw ere selected to represent as ubgroup of subjects witht he lowest possible Europeana dmixture from al arger total sample of n ¼ 71. Similarly,t he Nahua,w ho were sampled in the city of Tlapa, Guerrero, Mexico,w ere also selected as asubset of individuals showing lowE uropeana ncestry, as measured with an independents et of markers. African-Americans( subset of 42 from NA17100-17199) and European-Americans(subset of 42 from NA17200-NA17285) arer epresented by samples curated at the Coriell Institute.For these analyses, one individual initially classifiedas'Caucasian' (Coriell Institute Cat#NA17205) was excluded from the European-American sample,a sh e/she clusters with South Asians, which,i nc ombination with the lack of monophyletic clustering of South Asians and Spanish in this study,h ighlights the inappropriateness of the category 'Caucasian' in biomedical research. The Puerto Ricansa re women borni nP uerto Rico and living in New Yo rk City at the time of data collection.

Statistical analyses
F ST wasc alculated using We ir and Cockerham'su nbiased estimator. 33 Pairwise individual genetic distancesw ere estimated using the ASD. 22 The tree of individuals, based on the ASD distance,w as constructed using the neighbour-joining method, 21 usingthe Molecular EvolutionaryGenetics Analysis software package (MEGA version 2.1). 34 The PC analysisw as carried out using NTSYS software( Rohlf,F .J . [1992], NTSYS-pc version 1.70). The statistical significance of PC axes wasdetermined usingthe broken stick model, resulting in four significant axes. These axes together explain 23.6 per cent of the total variation (12.6p er cent by the first axis, 5.5 per cent by the second,3 .5 per cent by thet hirda nd 1.9 per cent by the fourth).A ll pairwise PC axis plots for these four axes are presented in the online supplementaryi nformation.T he STRUCTURE 2.0 8 computer program wasu sed to infert he presence of genetic structure in the sample.T he analysisw as performed both with and withoutt he admixture model for K ¼ 2t oK¼ 6, the model previously having been determined to showt he highest posterior probabilities for these data. At otal of 25,000 simulation iterations were runf or the burn-in period; 75,000 additionali terations were runt og et parameter estimates. Biogeographicala ncestry estimates were calculated for the 42 African-American subjects, the 41 European-American subjects and the 20 Puerto Rican women, using the maximum likelihood algorithm previously described, 26 wherebyt he allele frequencies for the three parental populations were taken to be indigenousA merican (Nahua and Quechua averaged together), We st African (Mende) and European( Spanish). For testingt he correlation Ta ble 3. Correlation coefficients for comparisons between biogeographical ancestryestimates from the even and odd chromosome marker sets. between subsets of markers, autosomal SNPs were divided by chromosome into those on odd and even chromosomes, respectively.S pearman'sc orrelation coefficient wast hen calculated for four significant PCsa nd the three ancestral components between usingt he odd and even estimates of these statistics. As has been demonstrated, estimates -even for highly admixed populations -will be uncorrelated unless there is substantial non-random mating in the population that is related to ancestry. 35