Measuring and using admixture to study the genetics of complex diseases
© Henry Stewart Publications 2003
Received: 25 August 2003
Accepted: 25 August 2003
Published: 1 November 2003
Skip to main content
© Henry Stewart Publications 2003
Received: 25 August 2003
Accepted: 25 August 2003
Published: 1 November 2003
Admixture is an important evolutionary force that can and should be used in efforts to apply genomic data and technology to the study of complex disease genetics. Admixture linkage disequilibrium (ALD) is created by the process of admixture and, in recently admixed populations, extends for substantial distances (of the order of 10 to 20 cM). The amount of ALD generated depends on the level of admixture, ancestry information content of markers and the admixture dynamics of the population, and thus influences admixture mapping (AM). The authors discuss different models of admixture and how these can have an impact on the success of AM studies. Selection of markers is important, since markers informative for parental population ancestry are required and these are uncommon. Rarely does the process of admixture result in a population that is uniform for individual admixture levels, but instead there is substantial population stratification. This stratification can be understood as variation in individual admixtures and can be both a source of statistical power for ancestry-phenotype correlation studies as well as a confounder in causing false-positives in gene association studies. Methods to detect and control for stratification in case/control and AM studies are reviewed, along with recent studies showing individual ancestry-phenotype correlations. Using skin pigmentation as a model phenotype, implications of AM in complex disease gene mapping studies are discussed. Finally, the article discusses some limitations of this approach that should be considered when designing an effective AM study.
Genetic analysis of phenotypes and diseases has traditionally followed two approaches: family-based linkage analysis and population-based association studies. While in linkage analysis it is the co-segregation of alleles in families that is measured, population-based studies use non-random associations between phenotypes and alleles in populations to identify causative genes. Linkage analysis has proven to be immensely successful as a means of identifying genes for a number of single gene diseases with simple Mendelian inheritance (eg see OMIM database). Complex diseases are multifactorial, polygenic and often characterised by late age of onset, incomplete penetrance, locus heterogeneity and environmental exposures and, despite significant efforts, have not been amenable to family-based mapping.
Linkage disequilibrium (LD) is an important aspect of genetic association studies and is generated in a population through mutation, selection, drift, non-random mating and admixture . Allelic associations due to LD are significant and are correlated with physical distance within small genomic regions but decay over time due to recombination [2–4]. LD-based association studies have been successful in both fine scale mapping [5, 6] and initial disease gene mapping in homogeneous populations that have undergone recent bottlenecks (eg Hirschsprung disease in Mennonites , Bardet- Beidle syndrome in Bedouins ). Allelic associations can result either from direct functional effects of the alleles tested or indirectly through non-random associations between the allele measured and nearby functional alleles. Since functional alleles in most genes are still unknown and are indeed an object of the research, LD is an important feature of how genes can be screened for alleles that alter disease risk. Thus, there has been substantial focus on the extent of LD across the genome and the definition of statistical methods for disease gene mapping using LD [9–11]. In large cosmopolitan populations, however, LD may be difficult to detect when the mutation is old, since the amount of remaining LD may be small. Additionally, false-positive associations due to population stratification are important confounders in LD-based association studies.
Intermixture between previously isolated populations leads to the creation of admixed populations. The process of admixture itself creates LD between all loci, linked and unlinked, that have different allele frequencies in the parental populations. The magnitude of admixture linkage disequilibrium (ALD) in an admixed population depends on the allele frequency differential between the parental populations, the level of admixture, the admixture dynamics, the time since admixture and the recombination rate between the loci . While ALD between unlinked markers decays rapidly (within two to four generations), ALD between linked markers decays more slowly. The exponential decrease in ALD with genetic distance facilitates the differentiation of ALD that is high between markers that are close together and genetically linked, from ALD generated at unlinked loci. Thus, if the parental populations differ in a trait or disease due to different frequencies of risk alleles, it should be possible to identify the loci containing these alleles using admixture mapping (AM) [12–14].
Many US residents can trace their genetic ancestry to more than one continent. The European colonial period that started in the late 1400s brought together in the New World populations that had been geographically isolated, namely, Europeans, West Africans and Native Americans. Given the recent and common origin of all human populations, this admixture had only a small average effect on the gene pools of these new populations. In other words, for most genomic regions, the pre-colonial (or parental) populations had similar allele frequencies and, at these, admixture was of little consequence. At some other loci, however, there had been some change in allele frequency in the time since the separation of parental populations and it is at these loci where admixture has had an important effect. Since populations like African Americans, African Caribbeans and Mexican Americans were formed in the recent past, allelic associations in these populations that were created by admixture extend over large distances. Admixed populations represent a useful resource for mapping complex-disease genes by using this long-range ALD , which requires fewer markers to screen the genome than other populations or approaches. Understanding the genetic consequences of admixture is important because it can be both a confounding factor and a source of statistical power in gene identification studies.
Diseases with possible genetic components based on ethnic differences in disease rates and hence amenable to admixture mapping
Relative risk ratio
South Asians (central adiposity,)
Pacific Islanders, Aboriginal Australians
Non-insulin dependent diabetes (NIDDM)
South Asians, West Africans, Peninsular Arabs, Pacific Islanders and Native Americans
African Americans, West Africans
Coronary heart disease
West African men
End-stage renal disease
Native Americans and African populations
African Americans, Hispanic Americans
European Americans, Chinese, Japanese
Africans and African Americans
Chinese, Japanese, African Americans, Turkmens, Uzbeks, Native Siberians, New Zealand Maoris
Admixture-based methods rely on using suitable markers and estimates of allele frequencies from appropriately identified parental populations. Since ALD is fairly new and extends over larger distances, fewer markers are required for AM studies. Markers informative for ancestry have been used in several contexts and have been referred to as 'ideal,'  'private'  and 'unique' . Informativeness of such markers can be measured as the allele frequency differential (δ), which is the absolute value of the difference of a particular allele between populations [12, 24]. Microsatellites and insertion/deletion polymorphisms with δ > 0.3 were recently called 'ethnic-difference markers' (EDMs)  suitable for mapping by admixture linkage disequilibrium (MALD). Additionally, markers with high δ and very high log likelihood allelic ratio (LLAR) between populations have been designated 'population specific alleles' (PSAs) . This report followed from earlier work where markers with large allele frequency difference were identified to be appropriate for admixture studies [27, 28], and most (> 95 per cent) of the arbitrarily identified biallelic markers had δ < 50 per cent . Thus, the authors proposed that ideal PSAs should have δ > 50 per cent and also indicated that for multiallelic loci, a composite δ could be estimated as one half the summation of the absolute value of allelic frequency differences for all alleles at that locus . It has also been shown that markers with lower δ values, of approximately 30 per cent, can provide up to 80 per cent power for detecting associations at distances of 5 cM with a large enough sample size (N = 1,000) .
Pfaff et al. , suggested referring to markers suitable for admixture studies as 'ancestry informative markers' (AIMs), given that the central feature of these markers is the ancestry information content (f) . The present authors agree that the term AIM more accurately describes these markers and does so using language that is less likely to be misunderstood and misinterpreted [14, 17, 28]. Marker information content 'f' denotes the locus-specific Fst and is a value representative of the differentiation between two populations at a single locus. This is equivalent to Wahlund's standardised variance for allele frequency. Simulation studies for estimating the information content of markers with varying levels of f have shown that for 1,000 markers with average information content for ancestry at 40 per cent between two ancestral subpopulations, approximately 80 per cent of the information about ancestry can be extracted from an initial genome screen [13, 29]. After initial identification of regions showing admixture, more markers can be typed in these regions to increase extraction of information to nearly 100 per cent.
It is well established, however, that only 5-15 per cent of the total genetic variation results from differences among human populations [30–32]. Moreover, most alleles are shared between populations, and alleles common in one population are also common in other populations. Thus, most genetic markers are unaffected by admixture and it is imperative to choose markers that show high levels of d (and f) between the parental populations. Recent studies by several groups have focused on identifying panels of markers suitable for admixture studies. One notable study screened 744 microsatellite markers for composite d values and LLAR in four different populations and identified a genome spanning set of 315 markers (average spacing 10 cM, δ ≥ 0.3) for mapping in African Americans and 214 markers (average spacing of 16 cM, δ ≥ 0.25) for mapping in Hispanics . A DNA pooling method was used to identify 151 AIMs (microsatellites and short insertion/deletion polymorphisms), with δ > 0.3 for mapping in Mexican American populations to distinguish between European-American and Native-American contributions . Ninety-seven AIMs were identified for mapping in African-American populations  that show limited variation within Africa . The authors' group has reported AIMs over the past few years [14, 17, 26, 35, 36]. Additional resources are available for obtaining marker frequency, and genotype and haplotype information, from The SNP Consortium (TSC; http://snp.cshl.org), the National Center for Biotechnology Information's 'dbSNP' website (http://www.ncbi.nlm.nih.gov/SNP), the Marshfield Database (http://research.marshfieldclinic.org/genetics/Default.htm) and the ongoing HapMap project.
Since the amount of ALD created is proportional to the level of admixture in a population, it is important briefly to review studies on admixture levels across populations. Those populations that are likely to be useful for admixture studies include African Americans, Mexican Americans, Cubans and Puerto Ricans in the USA, African Caribbeans, various Latin American populations, various groups in Central and South America and the Caribbean islands, Anglo Indians in India and 'coloured' populations of South Africa. Various statistical approaches have been used to estimate admixture proportions in these populations and have been reviewed in detail elsewhere . These include a least squares method, a weighted least squares method [16, 38, 39] and likelihood methods [38, 40]. A recent review of admixture studies and admixture proportions of various Latin American populations is provided by Sans . African Americans are a well-studied group with substantial European and West African contributions and a smaller Native American contribution [27, 35, 42, 43]. A survey of current literature indicates that European admixture ranges from 3.5 per cent in the Gullah Sea Islanders of South Carolina , to 28 per cent in New Orleans . Admixture estimates in African-American populations can be highly variable across the USA, which is likely to reflect local variation in the demographic histories and social norms.
US Hispanics form a complex socio-political conglomerate including Puerto Ricans, Cubans, Spanish Americans, Mexican Americans. Various groups from Central and South America can also be studied using ancestry AM. The proportional contributions from parental Europeans are estimated to be the largest, followed by a substantial Native American ancestry and varying amounts of West African ancestry [16, 17, 44]. In a sample of Mexican Americans from Arizona, the admixture estimates obtained using a weighted least squares method showed 29 ± 4 per cent Native American, 68 ± 5 per cent European and 3 ± 2 per cent West African contribution . A recent study reports the following estimates for a Hispanic population from the San Luis Valley, Colorado: 62.7 ± 2.1 per cent European, 34.1 ± 1.9 per cent Native American, 3.2 ± 1.5 per cent West African . In Puerto Ricans from New York City, the estimates obtained were 53.3 ± 2.8 per cent European, 29.1 ± 2.3 per cent West African, 17.6 ± 2.4 per cent Native American . In a separate Mexican-American population sample from California, European ancestry was estimated to be 60 per cent and Native American contribution was estimated at 40 per cent . As with African-American populations, there is substantial variation across populations. From these results, it is evident that, when studying any new admixed population sample, it is important to accurately determine the proportional contributions and not to rely on previously obtained estimates from a similar population. Additionally, it is instructive to have information on the levels of stratification related to admixture that are present in the population under consideration .
Traits and diseases more prevalent in one population than in others are amenable to admixture analysis and some examples are listed in Table 1. Most of the diseases shown in this Table have a complex aetiology affected by multiple genes and environmental factors. Earlier studies [45, 46] focused on admixed populations as units of analysis in exploring relationships between ancestry and phenotypes . These authors showed that non-insulin-dependent (Type 2) diabetes mellitus prevalence is correlated with admixture proportions among a selection of populations with varying levels of Native American ancestry. Data like these provide compelling evidence for frequency differences in risk modifying alleles, but such data have not been collected for many diseases. Another related approach is to test for individual admixture-phenotype correlations within an admixed population. Correlations between ancestry and phenotypes have been detected and reported by various authors [14, 17–19, 44, 45, 47].
Theoretical and experimental studies have explored the parameters that characterise and affect admixture studies [15, 24, 28, 35, 42, 50, 51]. The acronym MALD was proposed [28, 50] to designate the mapping method proposed originally by Chakraborty and Weiss, which exploited the long range allelic associations created through ALD . Parameters critical for MALD include the genetic distance between markers and disease locus (θ); number of generations since admixture (t); proportion of admixture (m) from one parental population; the allele frequency differential (δ) between parental populations; and sample size (N) [12, 28, 52]. Simulation studies suggest that sample sizes of 200-300 patients, typed for 200-300 evenly spaced markers, each having allele frequency differentials >0.3, have a >95 per cent chance of locating the causative gene, when there has been no new admixture from the parental population in the last four generations and no other sources of population structure or sample heterogeneity [28, 50].
Other approaches proposed for using admixture include a method based on the transmission disequilibrium test (TDT)  that assesses excess transmission of alleles derived from high-risk ancestors to affected offspring of parents who are heterozygous at the marker locus, containing one allele from each of two ancestral populations . A second TDT-based likelihood approach was developed that compared the transmission of haplotypes with non-transmission in affected offspring in an admixed population following a multipoint method. It obtained a likelihood statistic to determine the significance of various models under different scenarios .
One fundamental limitation of MALD as initially described and in its early extensions, is the effects of stratification on causing false-positive association [12, 24, 28]. The TDT is one means of correcting for this stratification. Another is by conditioning on parental admixture . Marker data at all loci are combined to estimate ancestry of alleles at each locus. When allelic ancestry at marker loci is known, this approach is analogous to a linkage analysis, hence the term AM is more appropriate than MALD for describing this method and to distinguish it from LD approaches [13, 14, 29]. The underlying variation in ancestry of chromosomes of mixed descent is modelled to extract all of the information about linkage that is generated by admixture. For example, where a locus is assumed to account for variation in skin pigmentation between two parental groups, eg West Africans and Europeans, individuals can be classified according to whether they have 0, 1 or 2 alleles of West African descent at this locus. By comparing these three groups for mean pigmentation level, holding all other factors constant, variation in pigmentation can be observed depending upon the number of alleles of West African ancestry in an individual. Controlling for parental admixture eliminates association of the trait with ancestry at unlinked loci. By removing the background effects of ancestry, it is possible to observe the locus-specific effects on a trait/disease [14, 17]. Allelic ancestry at a locus is inferred from the marker by using the conditional probability of each allelic state given the ancestry-specific allele frequencies. A complex hierarchical model with many nuisance parameters is used to model the distribution of admixture in the population. This is implemented using the ADMIXMAP program (at http://www.lshtm.ac.uk/eph.eu/GeneticEpidemiologyGroup/htm), which follows a Bayesian approach with Markov chain simulation, and incorporates the admixture of each individual's parents and the random variation of ancestry on chromosomes inherited from each of the parents in the model [13, 14, 29].
Variation in individual admixture introduces population stratification, which in turn can inflate the number of significant associations that are observed [53, 55, 56] and is a potential confounder in association studies [29, 57–59]. Various statistical approaches have been developed to detect and control for stratification within a population sample [14, 15, 17, 42, 60–62]. For example, the Dt/D0 test examines the relationship between the observed LD and the predicted ALD between unlinked marker pairs for detecting structure within the sample. Using individual ancestry as a conditioning variable in analysis of variance tests, it is possible to eliminate association of the trait with unlinked alleles [14, 17]. The Bayesian approaches implemented by McKeigue et al. and Pritchard et al [13, 61]. offer an advantage over classical maximum likelihood based methods [44, 63] by allowing for missing genotype and ancestry data and modelling admixture hierarchically. Methods have been developed to control for parental admixture  and to account for uncertain BGA estimation .
Diseases showing ancestry-phenotype correlation
Test statistic reported
Non-insulin-dependent diabetes mellitus (NIDDM)
Mexican Americans and Pima Indians
Amerindian ancestry with NIDDM
Kendall's τ = 0.848 ± 0.221, [p ⊬ 8.1 × 105]
Amerindian ancestry with NIDDM
0.943c (p < 0.001)
1) Body mass index (BMI)
European admixture with BMI, plasma glucose, 2-hour glucose
0.455b (95% CI: 0.301-0.688)
2) Plasma glucose
Native American ancestry with NIDDM prevalence
Skin pigmentation (reflectrometry)
1) African Americans
Melanin index versus % African ancestry
1) 0.21a, (p < 0.0001)
2) 0.16a (p < 0.0001)
* Mapped phenotype to two loci: TYR and OCA as candidates which influence normal pigmentation variation
3) European Americans
3) 0.001a (p = NS)
Systemic lupus erythematosus (SLE)
Caribbeans (without Indian or Chinese ancestry)
SLE and African Ancestry
(95% CI: 1.7-485 after SES adjustmentb)
African admixture (ADM):
1) Insulin sensitivity (SI),
1) with SI
1) (p < 0.001)a
2) Fasting insulin (FA),
2) with FA
2) (p < 0.01)a
3) Acute insulin response (AIR)
3) with AIR
3) (p < 0.001)a
Spanish admixture with large VO2 at high altitudes
Bone mineral density (BMD)
Puerto Ricans from New York
0.065a (p = 0.042)
European admixture with lower BMD
Skin pigmentation (Lightness index)
Hispanics from the San Luis Valley, Colorado
0.0821a (p < 0.001)
Proportional European ancestry with increased Lightness
We thank Dr Paul McKeigue and Dr Esteban Parra for helpful discussions on the subject. We also acknowledge helpful comments from an unknown reviewer. This work was supported in part by grants from NIH/NIDDK (DK53958) and NIH/NHGRI (HG02154) to M.D.S.