Cosmopolitan linkage disequilibrium maps

Linkage maps have been invaluable for the positional cloning of many genes involved in severe human diseases. Standard genetic linkage maps have been constructed for this purpose from the Centre d'Etude du Polymorphisme Humain and other panels, and have been widely used. Now that attention has shifted towards identifying genes predisposing to common disorders using linkage disequilibrium (LD) and maps of single nucleotide polymorphisms (SNPs), it is of interest to consider a standard LD map which is somewhat analogous to the corresponding map for linkage. We have constructed and evaluated a cosmopolitan LD map by combining samples from a small number of populations using published data from a 10-megabase region on chromosome 20. In support of a pilot study, which examined a number of small genomic regions with a lower density of markers, we have found that a cosmopolitan map, which serves all populations when appropriately scaled, recovers 91 to 95 per cent of the information within population-specific maps. Recombination hot spots appear to have a dominant role in shaping patterns of LD. The success of the cosmopolitan map might be attributed to the co-localisation of hot spots in all populations. Although there must be finer scale differences between populations due to other processes (mutation, drift, selection), the results suggest that a whole-genome standard LD map would indeed be a useful resource for disease gene mapping.


Introduction
Linkage disequilibrium (LD) is the non-random association of alleles at linked loci reflecting the presence of ancestral haplotype segments. The extent of LD varies across the genome,a nd its patterni sd ominated by recombination and the number of generations over which this has taken place. Chromosome segments with highL Da nd lowh aplotype diversity are punctuated by narrowr egions withv ariable rates of recombination, someo fw hich are bestd escribed as 'hot spots'. 1,2 Patterns of LD arealso influenced by other processes, such as genetic drift, selection and mutation, but recombination and time have the greatest impact on long-range LD structure.W ith the advent of panels of single nucleotide polymorphisms (SNPs) typed at high density in an umbero f chromosome regions, 3,4 it has become possible to determine the LD structure for whole chromosomes.
Maniatis et al. 5 describe an approach for determining patterns of LD which computes distancesb etween adjacent SNPs in the map,measured in LD units (LDUs). Locations on the LDU map are somewhat analogoust ot he centimorgan (cM) locations of linkage maps. Unlikelinkage maps, however, LDU map patterns also reflect duration since the last major population bottleneck, or,m ore correctly,t he partly cumulative effects of multiple historical bottlenecks. Population differences in duration account for some substantial variations in LD patterns between populations.
To the extent that LD patterns reflect recombination,i ti s observedt hat recombination hot spots correspond to regions where LD breaks down rapidly within au nitp hysicald istance (a high LDU/kilobase [kb] ratio).B yc ontrast,r egions with al ow LDU/kb ratio have been shown to correspond to recombination cold areas and regions of relatively lowh aplotype diversity. 6,7 When the LDU locations for markersa re plotteda gainst the corresponding kb locations, as eries of plateauxand steps is revealed, the latter typically spanningvery few kb in high densitySNP maps.This supportsthe findingof Jeffreys et al.,t hat much recombination in the genomei s 'intensely punctate'. 1 In as tudy by McVean et al.,L Dd ata were used to estimate local recombination rates. 8 Their approach, based on an approximation to the coalescent, identifiesr ecombination hot spots and assigns au niform recombination rate across intervening regions. An LD map differs, in that it accommodates variation in the extent of LD among different genomic regions and between populations, reflecting differences in populationh istorya nd duration.
The discovery that aproportion of the genome comprises blocks of lowhaplotype diversity prompted the International HapMap project 9 to examinearange of populations in the hope of determiningpanels of haplotype tagSNPs that will be useful for disease gene mapping. There is ahigh degree of divergence in haplotype composition and diversity between different populations, 10 however, complicating the use of tag SNPs for multiple populations. By contrast, and despite the underlying diversity in haplotype structure,long-range patterns of LD tend to be conservedand differences due to duration can be modelled. For this purpose,Lonjou et al. proposed the developmentofastandard 'cosmopolitan' LD map. 11 The data of Gabriel et al. 12 and other samples for smallchromosome regions were used to construct acosmopolitan LD map for each region, and this wascompared with population-specific maps. The authorsconcluded that acosmopolitan map,when linearly scaled to reflect population duration, recovers 95 per cent of the information contained in the individual populationmaps. Differences in duration to amajor bottleneck, or to cumulative effects of bottlenecks, are accommodated using 'scaling factors' which are population specific.The observeddifference in scaling wasparticularly strikingwhen contrasting African-derived populations withthe non-African groups. This presumably reflects the dominant effect of the bottleneck imposed by the 'out of Africa' migrations which were undertaken by the ancestors of modernnon-African populations.
LD maps areconstructed in such away that one LDU corresponds to one 'swept radius', 13 which is the extent of LD on the kb scale that is useful for disease gene localisation. Wide variations in the extent of LD in different genomic regions make uniform spacingofSNPsonthe kb or cM scale inadequate for coverage of adisease candidate region. By contrast, even spacingofSNPsonthe LDU scale incorporates relatively more markersinregions where LD is declining rapidly within aunit physicaldistance,ensuring more complete coverage.
If astandard LDU map is constructed for awhole chromosome,adirecta ssessment of thenumber of LDUsr equiring coverage in achromosome region for any population mightbe made.W ith this in mind, we constructed and examined the properties and utility of astandard cosmopolitan LDU map in a sample of SNPs typed in alarge genomic region and four populations. If the utility of such am ap is shown to be supported, the developmento fs tandard LD maps for every chromosome and adatabase of population-specific scaling factorsw ould be auseful resource for disease genemapping. Applications would include the determination of adequate screening densities and spacingo fS NPs, and the provision of maps for multilocusdisease gene mapping by association -for which LD maps have been shown to have high power. 14

Data sample
We used the dataset described by Ke et al., 4 which comprises as ample spanning1 0,098 kb of chromosome2 0q12-13.2 typed at high density (5,954 SNPs in total) in four populations.T hese data are available from: http://www.sanger.ac. uk/HGP/Chr20/ld-hmg/. The sample has an average density of one SNPp er 1.7 kb.M arkers were typed in four populations,c omprising 97 African-Americans (AF), 96 UK Caucasians (CA), 47 UtahCentre d'Etude du Polymorphisme Humain (CEPH)f ounders (CE) and 42 East Asians (Japanese and Chinese; AS). We tested all SNPs in the data sample provideda tt he above website address and found no departures from Hardy-We inberg equilibrium. 15 In order to construct ac osmopolitan LD map,w er educed each population sample of diplotypes to the corresponding haplotype counts( alleles were coded 1o r2and the haplotype designations were therefore1 1, 12, 21 and 22). Pairwise SNP haplotype frequencies, converted into counts, were computed for each population sample following the algorithm described by Hill. 16 The haplotype counts were combined across populations by summation among matching locus pairs. To achievet his, we ensured that all markerss hared amongm ore than one population were coded consistently.W et herefore ensured that, for example,a na llelec oded as '1' in one population wasc oded the samei na ll others.
From the haplotype counts, pairwise association probabilities ( r )a nd the corresponding information (Kr )w ere computed following Collins et al. 17 For ag iven pair of SNPs, the r metric takes values between zero( no association) and one (complete association). For the special case of association between SNP pairs, r is equal to the absolute value of D 0 .The computation of r in the more general case of disease and marker association, where disease haplotypesare identified, has also been described. 18 We constructed cosmopolitan LD maps from the SNP pairwise association data at different marker densities to examinewhether map distancesare additive. We then selected as ingle mean density of 6kbf or evaluating thec osmopolitan map againstp opulation-specific alternatives. The6kb spacing wass elected because,t oc over the humang enome at this density,a pproximately5 00,000 SNPs would be required, corresponding to the initialt arget of the International HapMap Consortium. 9 UnlikeK e et al., 4 we selectively removedS NPs to achieveu niforms pacingo nt he kb map, therebya voiding large gaps. The following algorithm was applied to achievet his.
Starting from the end of the map closest to the pt elomere, the first typed SNP wasc hosen as the 'starting SNP', and two other SNPs were identifiedo ne ither side of ap ositiona selected number of kb away.T he SNP closest to that position wasc hosen. The chosen SNPt hen became the new 'starting SNP' and the process wasc ontinued along the whole map. The length of the region (10,098kb) wast hen divided by the number of SNPs selected to calculate the average density over the region. The process wasr epeated usingar ange of positions at various kb distancesuntil the desired mean density wasc reated.

Cosmopolitan linkage disequilibrium maps
Review PRIMARY RESEARCH LD map construction LD maps 5 are constructed usingt he 'Malecote quation', r ¼ (1 2 L ) Me 2 1 d þ L ,w hich describes the decline in association r as af unction of distance d. This has the same form, with different parameters, as the equation derived by Malecot for isolation by distance. 18 The population genetics theoryunderlying the derivation and application of this model for LD is described by Morton et al. 13 The parameterso ft he model include M -the maximum association at zerodistance, reflecting association at the last major bottleneck. L reflects both the residual association at large distance together with the bias in r , which is related to sample size,a nd 1 is the exponential decline of r with distance in kb.The parameters 1 and M are estimated by fitting multiple pairwise values of r and corresponding information, K r ,u sing composite likelihood. We used thepredicted L ( Lp), 13 which dependsonsample size, rather than estimating the L parameter.L onjou et al. 11 found that the local effect of block structure can inflate the estimate of L ,leadingtodistortions in the LD map through the creation of 'holes' between adjacent SNPs. The LDMAP program (http://cedar.genetics.soton.ac.uk/pub/PROGRAMS/ LDMAP/, with the online manual at: http://cedar.genetics. soton.ac.uk/public_html/helpld.html, 5 computes 1 for each intervalb etween pairso fS NPs. The parametersa re estimated so that thec omposite likelihood is maximised and the length of the i th interval, in LDU,i sg iven by 1 i d i .T he treatment of map holes, where 1 i d i is assigned an upper limit value of 3LDUs, has been described. 6 Thea uthorse stablished this upper limit by maximum likelihood and found astrong correlation between the location of holes and the recombination rate.The identification of theseintervals is clearly important for determining regions which requiref urther SNP typing. LD units have been shown to be additiveinhigh-density SNP maps for awide range of marker densities. 4 There is, however, likely to be some loss of additivity in regions of the map where holes are present. One of the additional advantages of combining samplestoc reate acosmopolitanmap is that the increased sample size and addition of markersi np oorly covered regions reduces the number of holes.

Te sts on marker density
As the full data sample contained markerst yped at av eryh igh density,w ew ere able to construct cosmopolitan maps using sub-samples at reduced densities by means of the method described. This allowedu st oe xaminet he stability of map lengthsand the effect of the number of holes. Maps withmean SNP densities 6kb, 8kb, 10 kb,1 2kba nd 15 kb were created for this purpose.T he average minora llele frequency remained stable across the differentd ensity samples at , 0.23.

Te sts on cosmopolitan maps
LD maps were created for each population and acosmopolitan map wasc onstructed from combined data, as described. The fit of the multiple pairwise data to the cosmopolitan map was examined and the corresponding errorv ariances were computed for the population-specific LDU maps.T he relative efficiency of the cosmopolitan map can be judged by comparing errorv ariances. The composite log likelihood has the forml n lk ¼ 2 S K r i ð^r i 2 r i Þ 2 = 2; where^r i is the observed association between the i th pair of SNPs and r i is estimated from the Malecot model replacing distance di nk bw itht he corresponding distance in LD units. 5,18 Therefore, thefi to f the population-specific pairwise data to any map (kb,p opulation-specific LDU or cosmopolitan LDU)c an be evaluated. The errorv ariances for ag iven population for the kb map (V kb ), the population-specific LDU map (V LDU )a nd the cosmopolitan map (V COS )f ollowL onjou et al. 11 Thed egrees of freedom are computed as N 2 (m 2 1) 2 r, 6 where Ni s the number of pairs, mi st he number of loci( therefore, there are m-1 intervals in which 1 mayb ee stimated) and ri st he number of additional parameterse stimated.
We define N i and N c as the number of pairsi nt he i th population sample and cosmopolitan data sample,r espectively. The number of SNP markersi nt he i th population samplea nd cosmopolitan samplei s, respectively,m i and m c .
The errorv ariances for kb,L DU and cosmopolitan maps are computed as: where 1 and M aree stimated; where M is estimated and 1 is estimated in each map interval; where m c 2 1i ntervals in the cosmopolitan map have been previously computed using the proportion of data represented by the i th populations ample,a sN i /N c ,a nd 1 and M are estimated.
The relativee fficiency (RE) of the cosmopolitan maps was calculated to determine howm uch of thei nformation was recovered; RE ¼ V LDU /V COS .V alues of 1 obtained when the population-specific data arefitted to the cosmopolitan map are each divided by the 1 for the cosmopolitan map to obtain the scaling factors.

Results
Ta ble 1d escribes the characteristics of cosmopolitan maps constructed at different marker densities. The 6kbc osmopolitan map contains 1,691 markers, of which 405 are represented in as ingle population, 74 in twop opulations, 181 in three populations and 1,031a re found in all four.T he LDU map lengths vary over the range 187 -204 LDUs and the number of intervals where thel imit of 3LDUs has been assigned is between 2and 7. Given the large number of loci in the maps (670 -1,691), therelatively small number of intervals in this categorys uggests that there is good marker coverage at all densities and that the LD map is well characterised. Figure1 illustrates the conservation of the LD map contoursu ndert he different density selections.
From Ta ble 1, it is evident that maps with more holes tend to be somewhat longer,although this is not alwaysthe case,for example in the comparison of the 12 kb and 15 kb mean spacingm aps. The effect of holes on mapl ength is difficult to predict, since additiono fm arkersw ithin ar ecombinationintense segment creates more intervals and mayt herefore generate additional holes indicating LDU ¼ 3a sa nu nderestimate. Presumably,h owever, in most cases the limit is an overestimate and additiono ff urther markersi nt he interval will reduce map length. Giventhat the 6kbcosmopolitan map corresponds to approximately5 00,000 SNPs in the genome and has only twoh oles, it seemst he most suitable to use as a basis for evaluating cosmopolitan maps.
The fit of the multiple pairwise datat ot he kb mapi s described in Ta ble 2. The means wept radii for the four populations arei nt he range 80 -105 kb,w itht he AF sample having the least extensiveL D. This is consistent with ap revious study 11  The population-specific LD maps ( Ta ble 3, Figure2 )g ive LDU maps of length 204 -272 LDUs for thei ndividual populations witht he longestL Dm ap for theA Fp opulation, again reflecting the greater time for recombination to erode LD.T he cosmopolitan map is somewhat shorter (187 LDUs) but the number of holes in the population-specific maps is considerably higher (in the range 9-17 compared with only 2i nt he cosmopolitan map). The discrepancy suggestst hat increased SNPt ypingi nt he intervals with holes mayr educe the overall length of the population-specific maps. To test this, we rebuilt LD maps for the four populations but included all of the SNPs typed in each population sample within each intervalc ontaining ah ole.W ew ere then able to examinet he effect on map length under conditions where some additional typing is possible.T here were 51 holes for the four populations (   Ta ble 3. Linkage disequilibrium unit maps constructed for each population. Ta ble 4e xamines the fit of the population-specific data to the cosmopolitan map.T he 1 parametersf or each population, divided by that for the cosmopolitan map ( 1 ¼ 1.152, Ta ble 3), give the appropriate scaling factors( Ta ble 5) to apply to the cosmopolitan map so that it can represent any population. For the AF population, the scaling factor is 1.33, suggestingt hat the cosmopolitan map should be lengthened by 33 per cent. Relativetothe Caucasian population, the scaling factor would be 1.43 (Table 4). This figure compares with previous scaling factor estimates (11) in the range 1.63 -1.84, obtained in much smaller genomic regions. There is considerably greater similarity in the 1 parameter estimates for the other populations which do not have recent African ancestry.
Ta ble 5g ives the REs of the kb and cosmopolitan maps as ratios of the corresponding errorv ariances. The RE of the kb map,c ompared with the individual population LD maps,   varies between 50 and 69 per cent, the relatively lowfi gure being expected, because kb maps do not reflect patterns of LD. By contrast, there is good supportf or the use of the cosmopolitan map to represent LD patterns for all populations, when 1 and M are estimated. The relative efficiencyi sb etween 91 and 95 per cent, suggestingasmalll oss of information. Figure  3i llustrates the utilityo facosmopolitan map by comparing the AF LD map with ac osmopolitan mapt ransformed by the appropriate scaling factor.T he rescaled map is somewhat shorter,b ut the likely effect of the relatively large number of holes for the AF populationmap suggests that the map maybe somewhat inflated in overall length.

Discussion
This study supports and extends previous findings 11 which demonstrated that careful modelling of LD patterns in human populations reveals both similarities and differences that have biological interpretation and reflect differences in population history. Recent studies have shown the importance of narrow, intense,r ecombinationh ot spots in determining LD patterns and the tendencyf or these to be co-localised in all human populations, which accounts for the remarkable convergence in the overall LD 'plateau' and 'step' structure.W ehaveshown that the predominant source of differences between the maps arises from population duration, and that appropriate linear scaling of a' standard' map recovers most of thei nformation in population-specific maps. The loss of information, after appropriate scaling for ag iven population, is between 5a nd 9p er cent. Ac osmopolitanL Dm ap,w hich combines data over ar ange of populations, has advantages for map integration, through being derivedf romalarger data sample and thereforeh avingf ewer regions of uncertainty (holes). Such a map mayalso have alarger number of loci and is, therefore,of potentially higher resolution. The integration of standard LD maps with genetic and physical chromosome maps will be useful for disease genem appingb ya ssociation.K e et al. 19 described the integrated locationd atabase, LDB2000 (http:// cedar.genetics.soton.ac.uk/public_html/), which gives locations on alternatives cales for genes and polymorphisms. The integrated maps specify kb locations -obtained from the current human genomes equence -together with male and female genetic maps in cM and cytogenetic band assignments. Once LD maps arei ntegrated, it will be possible to infer locations for all markersonall scales, by interpolation. TheLD map might thereforeb eu sefulf or local improvement of the resolution of the genetic linkage map,w hich is currently unreliable below , 1cM, reflecting the relatively small numberso fm eioses in the linkage families. Genetic linkage maps have greatly facilitated the multipoint linkage mappingofmany major disease genes. Theimportance of exploitingl inkage map locations for this purpose,r ather than usingt he physical( kb)m ap,i sm aximal for candidate regions where therei sasubstantial deviation from the genome-wide average of 1cM . 1m egabase.A sa ni llustration, in the hemochromatosis gene region, 20 al ow recombination rate wasi nitiallym isinterpreted as as mall physical distance and this mayh aved elayedt he positional cloning of thegene.Itislikely that the correct characterisation of LD patterns will be of equal, or perhaps greater,importance for association mapping. Astudy by Maniatis et al. 14 has shown substantially increased powera nd precision for multilocus modellingo fd isease -marker association when using an LD rather than ak bm ap.L ocalisation within an LD map was shown to increase powerb y , 38 per cent on average,c ompared withak bm ap,i ne xtensives imulation studies. To effectively apply association mapping for complex traits, it is not sufficient to construct ah igh-resolution recombination map, 8 since association mappingr equires thec haracterisation of LD patterns. Although recombination hot spots arec olocalised among populations, differences in duration means that the steps have different heights, and this is particularly apparent when comparing African-derived populations with other populations. Furthermore,arecombination map does not accommodate the effects of other processest hat shape LD patterns, such as drift, selection and mutation.
Methods to construct high-resolution LD maps are likely to improve. It is fortunate that as ubstantial body of data is now available for an umbero fp opulations (http://www.hapmap. org/); this will alloww hole-chromosome LD map construction to be undertaken. Thes amples from HapMap offer high marker density and,therefore,LDmaps with well-defined LD structure can be constructed, permitting further evaluation of the utility of LD maps and integrated maps for disease gene mapping. Maps constructed at relatively lowd ensities 6 remain useful, in that they providec redible starting estimates for LDU distancesinacandidate region which mayberevised through the addition of more markersa nd larger sample sizes.
Recent evidence pointst oasubstantial loss of power through SNP selectiont or etain haplotype diversity. 21 A proposed multistage design, in which tested models and marker density change adaptively,uses an LD map to guide the addition of SNPs into ac andidate region, rather than selective removal. In the multistage design, stage 1i sagenome scan by linkage or association at relatively low-resolution, stage 2 refines an identifiedc andidate region at moderate marker densities and the final stage requires functional tests on SNPs at high resolution, with the goal of identifying one or more causal polymorphisms. Appropriately scaled LD maps willb e valuable throughout, but particularly in the early stages, to identifyp oorly covered regions which particularly require additional SNPs, therebyr educing the probability of failing to screen ac ritical region.