Functional single nucleotide polymorphism-based association studies

Association studies hold great promise for the elucidation of the genetic basis of diseases. Studies based on functional single nucleotide polymorphisms (SNPs) or on linkage disequilibrium (LD) represent two main types of designs. LD-based association studies can be comprehensive for common causative variants, but they perform poorly for rare alleles. Conversely, functional SNP-based studies are efficient because they focus on the SNPs with the highest a priori chance of being associated. Our poor ability to predict the functional effect of SNPs, however, hampers attempts to make these studies comprehensive. Recent progress in comparative genomics, and evidence that functional elements tend to lie in conserved regions, promises to change the landscape, permitting functional SNP association studies to be carried out that comprehensively assess common and rare alleles. SNP genotyping technologies are already sufficient for such studies, but studies will require continued genomic sequencing of multiple species, research on the functional role of conserved sequences and additional SNP discovery and validation efforts (including targeted SNP discovery to identify the rare alleles in functional regions). With these resources, we expect that comprehensive functional SNP association studies will soon be possible.


Introduction
Associations tudies of common, complexly inherited human diseases have the potential to provide us with insights into causes of enormous humans uffering. 1 While thousandso f such studies have been published (typically usings ingle nucleotide polymorphisms [SNPs]), only ah andful of these findingh aveb een clearly and consistently replicated. While some findings are doubtless real, 2 debate continues over most. There are only as mall number of genetic variants that have been clearly and consistently associated with ac ommon disease,m any of which are listed in Ta ble 1.

Ty pes of association studies
Researchers, typically,c arefully weigh comprehensiveness and efficiency in designing an association study.Ahighly comprehensives tudy would assess everyv arianti nt he region(s) under study,r egardless of type,l ocation and allele frequency.Ahighly efficients tudy would be designed to reduce costs, including genotyping and/or multiple testing costs. Genotyping costsc an be savedb yd etermining which SNPs arei nl inkage disequilibrium (LD). For example,i fy ou knew that twoS NPs were in completeL Di nt he specific population of interest, youw ould only need to genotype one to assess them both. Multiple testing costs can be reduced by only looking at SNPs with ah igh ap riori chanceo fb eing associated. Note that as multiple testingc orrection should account for the effectiven umber of independentt ests performed,genotyping only one of twoSNPs in completeLD does not reduce multiple testingc osts; if the SNPs arei n complete LD,o nly one effectivei ndependent test is being performed,r egardless of whether one or twoS NPsa re genotyped (Bonferroni correction is overly conservative). As 'per SNP' genotyping costs continue to fall, it seems likely that multiple testing costs will become the predominant concerninefficiency.Therefore, we discuss efficiencyinterms of the ap riori likelihood for an SNPt ob ea ssociated with the phenotype studied.
Different types of large-scale association studies and the balance they strikeare shown in Figure 1, although, obviously, many studies are hybrids of these types. These approaches, which have been applied to candidate genes, regionsand recently to the whole genome, 21,76 are discussed in detail below, along with another technique (re-sequencing), which can currently only be applied on asmall scale.Additional techniques that maybeuseful in 'special' populations, such as isolated founderand admixed populations, are discussed elsewhere. 77

Re-sequencing
When there is strong ap riori evidence that ag ene mayb e involved in ad isease,i ti sp ossible to sequence that gene in cases and controls. 43,80,81 This requires no prior knowledge of variants in the region and allows researchersc omprehensively to evaluate all variants in ag ene,r egardlesso ft heir allele frequency.U sually,i ti sn ecessaryt og roup the very rare variants ( , 1p er cent) for powerc onsiderations. 43,80,81 While this approach is nowpossible for one or afew candidate genes, it is by no means comprehensivea cross the genome and dramatic reductions in sequencing costsa re necessaryf or its implementationo nalarge scale. 82 -84 LD Given the high rate of LD in the genome,m any variants do not need to be directly genotyped in order to be assessed. They mayi nstead be assessed by genotyping another SNP in high LD.The goal of LD-based ('tagging') approaches is to test as ufficient number of common SNPs so that SNPs that are not directly tested are assessed through their high correlation with the genotyped SNPs. This can create efficiencyi n genotyping but does not reduce multiple testing costs (as discussed previously,m ultiple testing corrections should account for the effectiven umber of independentt ests, rather than the number of SNPs genotyped). Additionally,t he efficiency of the approach is modest,s ince there is al ow a priori chance that as pecific assessed SNPi sa ssociated with disease.Byfocusing only on regionswith high LD (in which a single SNP is likely to tag several other SNPs),o ne improves the efficiency because there is an increased likelihood for any assessed SNP (ie for one test) to be tagging af unctional SNP that is associated with the phenotype of interest. 75 Ta gging allows most common SNPs to be comprehensively assessed in linkage regions, 85 Figure 1. Association study approaches: Efficiency versus comprehensiveness. Studies varyi ntheir efficiency (the ap riori likelihood of at ested single nucleotide polymorphism [SNP] being associated with ad isease), which has an impact on genotyping and multiple testing costs. Highly efficient designs (as defined by multiple testing costs) areshown on the right, with less efficient designs on the left. Studies also varyinc omprehensiveness, both in terms of the allele frequency spectrum assessed (A) and the extent the region under study is assessed (B). Highly comprehensives tudies extend from top to bottom. The efficiency (or comprehensiveness) for as pecific study type relative to another in this figure is certainly not meant to be quantitative but merely indicative of the direction (bigger or smaller). This figure is applicable to large-scale studies of candidate genes, regions or the whole genome.D ifferent functional SNP approaches arerepresented in blue,w hile non-functional approaches are represented in green. Re-sequencing is currently only feasible for examining one or afew candidate genes and is therefore not depicted. (A) Using linkage disequilibrium (LD) approaches, rare alleles are less likely to be tagged and hence the rare allele region is not covered. Since non-synonymous SNPs (nsSNPs) areassessed directly, association with rare alleles can be readily detected; however, this is limited by the availability of these SNPs. The light colour in the rare allele region is to indicate that coverage is dependent on SNP discovery. In this figure, we consider the most obvious functional SNPs, the nsSNPs. We presume the efficiency of the other functional categories mayb esignificantly lower.( B) Ty pically,t here is a trade-off between efficiency and comprehensiveness. One mayl imit the study to nsSNPs in order to have high efficiency at the cost of comprehensiveness. Further increase in the efficiency (and decrease in comprehensiveness) can be achieved by focusing only on nsSNPs predicted or known to have af unctional consequence.S imilarly,i th as been proposed that as tudy utilising SNPs that tag the highest number of other SNPs (ie SNPs in high LD regions) would be more efficient (but less comprehensive) than as tudy aiming at LD coverage of the full genome. 75 of allele frequencies because it tends to work poorly on rare polymorphisms. 88 -92 Given the clear importance of rare polymorphisms (Box1 ), this presents as ubstantial drawback. While some analyticalw ork suggests that long haplotypes may be used to achieveadegree of 'tagging' of the rare allele,t his comes with ad ramatic multiple testingc ost. 106 Thea dequate assessment of rare alleles requires direct interrogation.

Functional SNPs
Functional variants aret he most likely to be associated with diseases (in fact,n on-functional variants should only be associated secondaryt oL D); therefore, genotyping studies using only functional SNPs are relatively efficient. Since these variants ared irectly assessed, these studies are comprehensive in terms of allele frequency,c overing rare and common variants present in the databases or discovered during focused SNP discovery.O ur poor ability to predict functional SNPs, however, means that this approach is generally farf rom comprehensiveinterms of coverage of the region under study. Nevertheless, by focusing on the most obvious classes of potentially functional SNPs, such as those causing non-synonymous changesi np roteins, researchersh aveh ad notable successes with association studies in candidate genes 107 or linkage regions. 3,22 It is nowp ossible to apply this method on ag enome-wide scale, 75,108 which increases comprehensiveness with somer eduction in efficiency.

Extending the (potentially) functional SNP approach
There are many attractivef eatures of the functional SNP approach, including its efficiency and ability to assess rare and common alleles. Additionally,apositivea ssociation automatically provides ac andidate causativep olymorphism.
Am ajor criticism of the functional approach is its lack of comprehensiveness, 96 and extending the coverage has been difficult, giveno ur poor ability to predict functional SNPs. We can, however, broadly define functional SNPs as SNPs in any class predicted to have an above-average chance of having af unctional effect. Recent progress in comparativeg enomics is likely to dramatically increase the comprehensiveness of this approach.

Box1 .C ommon variant/common disease versus rare variant/common disease
For the purposes of this review,w euse the standardd efinition of ap olymorphism as avariant whose minor allele frequency (MAF) is above 1p er cent, and define common alleles/polymorphisms as those with MAF . 10 per cent, rare alleles/polymorphisms as those with MAF 1-10p er cent and veryr are alleles/variants as those with MAF , 1p er cent. In the past decade,t here has been substantial debate over the importance of common alleles versus rare alleles (or even very rare variants) in common, complex human diseases. Theoretical work has been used to argue all points of view: that causative common disease alleles are most likely common alleles, or rare alleles, or veryr are alleles. 93 -95 One keyargument for common alleles relies on the perceivedg reater practical difficulties in studying rare alleles rather than common alleles. First, analysis methods arep articularly sensitive to genotyping errors of rare alleles and rare alleles have been particularly prone to genotyping errors. 96,97 Recent improvements in genotyping technologies, however,d ramatically lessen these concerns. 98,99 Secondly,r are alleles aremore likely to be population specific and therefore arem ore likely to generate spurious associations due to population substructure. Again, improvements, this time to analytical methods, allow us to detect and adjust for these artifacts. 100,101 Thirdly, it has been argued that the power to detect associations with rare alleles appears low when compared with that to detect common alleles. While this is certainly true if one assumes the same genotypic relative risk, this assumption is arbitrary, and if one instead uses another arbitrarya ssumption of equal population attributable risk, then the power to detect rare alleles would be significantly better than that for common alleles. Probably,amore reasonable approach is to consider as pecific genetic effect size (eg defined by likelihood of the odds (LOD) scoreinsibling-pair analysis) of al ocus and assume that causative alleles generate this specific effect size. 102 Given this assumption, the power to detect common and rare alleles is fairly similar (data not shown). Finally,r are alleles are difficult to 'tag' and thereforen eed to be assessed directly,c reating two problems: alleles must be in databases in order to be assessed and genotyping all of the rare alleles in the genome would be at least an order of magnitude larger than contemplated for the linkage disequilibrium (LD)-based approach for common alleles. These concerns, while substantial, maybea ddressed by single nucleotide polymorphism (SNP) discoverya nd focusing genotyping efforts on rare SNPs that area lso potentially functional.
One theoretical argument for rare alleles is that purifying selection should keep the frequency of deleterious functional alleles low. Indeed, in as tudy of approximately 30,000 non-synonymous SNPs, we confirmed previous observations that SNPs predicted by PolyPhen 103,104 to be damaging have significantly lower allele frequencies than SNPs predicted to be benign. This effect is largely due to an enrichment of damaging SNPs in the MAF , 10 per cent category. 105 Perhaps the strongest argument comes from an examination of Ta ble 1, which indicates that both common and rare alleles are important. In light of these data, it is clearly essential for common disease association studies to investigate rare, as well as common, alleles.
Below, we address somet raditional functional elements (non-synonymous, splicing and promoter SNPs), as well as functional sequences emerging from the study of genome conservation.

Non-synonymous
The most obvious class of potentially functional SNPs is those causing non-synonymous changes in proteins( nsSNPs). Over 60 per cent of known Mendelian disease mutations and almost all the consistent, common disease mutations in Ta ble 1 involven sSNPs. 109 While there is ac lear ascertainmentb ias for studying and confirming associationsw ith nsSNPs,t hey are inarguablyi mportant in disease.
Additional evidence that many nsSNPs aref unctional and subject to selection comes from candidate gene sequencing studies, which find that 60 per cent of the expected number of nsSNPs arem issing. 110,111 Furthermore,n sSNPsh avel ower minora llele frequencies than do synonymous SNPs. 110,111 When we examined all codingS NPs currently in the SNP database (dbSNP), we also found ad earth of nsSNPs; these are expected to comprise two-thirds of coding SNPs 111 but instead comprised less than one-half (20,463 nsSNP out of 42,387c odingS NPs).T he deficiency of nsSNPs wase ven more notable when the analysisw as limited to conserved coding regionsi nw hich only one-third of SNPs were non-synonymous (8,828 of 23,397). (SNPd efinitions were derived from the Ensembl database,a nd conservedr egions were as defined previously. 112 ) Large-scale studies of nsSNPs maintain high efficiency while allowing reasonable coverage. 75 One could choose to further increase efficiency (and decrease comprehensiveness) by limiting as tudy only to nsSNPs with ah igh predicted likelihood of being damaging. Asubstantial proportion of such SNPs have already been implicatedi nh umand isease. 103,113 Splicing Perhaps the next most obvious class of potentiallyf unctional variants is SNPs around splice junctions. Mutations that affect splicing underlie 15 per cent of mutations in Mendelian diseases and hence arel ikely to plays omer ole in common diseases. 114 Splicing is catalysed by weakly conserved5 0 and 3 0 splice sites and abranch site,aswell as exonic and intronicenhancers and silencers. Sites farfromsplice junctions can affect splicing, and af ew mutations in these distant sites have beens hown to cause humand isease. 115 -120 It appears, however, that most control of splicing lies in the 20 base pairs( bp) flanking each side of exon -intron boundaries. 120 These regions contain a high density of splicing enhancers (SEs), 120 have fewer SNPs than sequences further from splice junctions 120 and contain most of the known splicing mutations. 114 We find that these sequences are significantly conservedand have arelativedearth of SNPs (Table 2).
Rather than testing all SNPs within the vicinity of a splice junction, one could increase efficiency by limiting the analysist oS NPs specifically predicted by computational models to affect splicing. 121,122 Conversely,o ne can increase comprehensiveness by assessing SEs beyond 20 bp of splice junctions. SEs are mostp revalent in exons. 123,124 Some synonymous SNPs have also been shown to alter splicing. 122 Several programs aren ow available to predict SEs. 125,126 In addition to SNPs within 20 bp of the junction, the interrogation of synonymousS NPsp redicted to disrupt SE activity 126 increases study comprehensiveness.
Ta ble 2. Conservation and relative single nucleotide polymorphism (SNP) density in different types of functional regions. For each functional region, we reportthe odds ratio that an ucleotide in that region will be avariant by comparison with the rest of the genome (essentially,t he relativeSNP density) and standarderror.T he expected number is obtained using the validated SNP in the genome (4.9 M) and the total number of base pairs of the genome within ap articular class of functional elements. An umber less than 1i ndicates a deficiency in SNP number.W ealso report the fold conservation (as defined previously 112 )c ompared with the genome average.

Odds ratio^standard error
Fold conservation

Promoters
Promotersa re cis-elements that lie upstream of transcription starts ites and arer esponsible for transcription initiation. 127 The existence of regulatoryv ariants affecting transcription has long been established 128,129 and that have been shown to playarole in humand isease. 130,131 Even though the exact promoter sequencem ay not be easily discerned, recent work has shown that the 500 bp upstream from the transcription starts ite is almost alwaysa ble to function as apromoter. 132 Defining the promoter,however, requires determining the 5 0 end of transcripts, which is typically donee xperimentally and hence is laborious. 133 -135 As shown in Ta ble 2, conservation in the promoter sequences is threefold higher than expected.
In addition to promoters, numerous other cis-acting elements (for example enhancers) contribute to gene regulation. These elements have been more difficulttoidentify because they can lie within coding sequences, introns or as fara s1megabase away. 120,136,137 Defining these elements is am ain goal of the ENCODE project. 138 Genomicw ork aimed at identifying transcription factor binding sitesa nd other regulatorys equences experimentally and informatically is ongoing, 87,139,140 and study of conserveds equences holds promise for the identification of these regions.

Conserveds equences
Computationale fforts have consistently found that approxi-mately5per cent of the humang enome shows conservation with other species. 112,141 -148 Although some regions mayb e conservedd ue to lowm utationr ates, clearly many,a nd perhaps most,o ft hese regions aref unctionally important. 149 Indeed,m ost codinge xons and many untranslated regions showi nterspecies conservation,a lthough these only account for aminority of conservedregions. Conservedelements have been showt oa ffect gene transcription levels, 150 -156 RNA editing 112 and genome stability. 157 Additionally,c onserved regions are enriched in intronic stretches surrounding alternatively spliced exons and have an excess of predicted secondarys tructure 112,143,158 and matrix-scaffold attachment regions. 159 Furthermore,t hey aree nriched in stable gene deserts, which have been postulated to contain long range cis-regulatoryr egions. 112 Tw ol ines of evidence suggest that many SNPs in conservedr egions are subject to selection and, hence, are presumably functional: these regions contain a relative dearth of SNPs (Table 2), and theS NPs present there showashift in allele frequency distribution towards rarer alleles. 160,161 The identification of conservedn on-coding elements has generated ap aradigm shiftf or the definition of functional elements. Without knowing thee xact function of each element, sequences conserveda cross species define am ap of likelyf unctional regions in the genome and SNPs in the regions are candidates for functional SNPa ssociation studies.
The study of conservedr egions is av ibrant field,w ith diverse methods of defining conservation and views on the correct number and types of species to compare. Some groups have focused on very large regions while othershaveexamined conservation of regions as small as 4bp. 112,143,144 Analyses can be performed using very closely related species (such as primates)o rv eryd istant species (such as ar ange of eukaryotes). 112,143,144 Thestudy of species that aremoderately distant ( , 75 million years) has yielded many of the conserved elements, 162 while study of primates has provided insight on primate-specificr egulatorye lements. 146 In additiont o identifying conservede lements subject to purifying selection, comparativeg enomics has identifiedg enes withe vidence of positives election. 163,164 Similar analyses maye ventually be able to identifyn on-codinge lements subjected to positive selection.
The proportion of functional elements that can be identifiedb yc omparativeg enomics is not yetc lear. In as tudy using sequences from multiple yeast species, essentially all the known non-coding regulatoryr egions were identifieda s conserved. 157 Another study in yeast could identifyc onserved elements at ther esolution of 6b pt ranscription factor binding sites. 165 In mammals,u sing the currently available genomic sequences, most of thec odings equencesa nd known regulatorys equences are conserved. 166 Thea nalysiso fm ore mammalian genome sequences will undoubtedly refine the current picture of conservede lements, although it is not clear that it will reach the same resolution achieved in yeast. 162 Nevertheless, it is likely that some functional sequences may not be identifiedt hrough comparativeg enomics. If these SNPs do not fall into another obvious class of functional elements (like promoter regions), they mayb em issed by function-baseda ssociation studies.

Generating aw hole genome set of functional SNPs
The current feasibilityo fg enome-wide function association studies dependsuponthe total number of functional SNPs and the extent to which such SNPs are represented in the databases. In the following discussion, we define functional SNPs as SNPs that fall into any of the above classes (ie non-synonymous, splicing, promoter,c onserved 112 ). Ongoing improvements in the definition of conserved regions mays lightly change these estimates.
To estimatet he total number of functional SNPs, we have utilised publicly available data from ENCODE regions. Te nr egions (500 kilobases each) were re-sequenced in 48 unrelated individuals (16 Yo ruba,1 6C entre D'Etude Du PolymorphismeH umain [CEPH], eightH an Chinesea nd eight Japanese). TheS NPs in these regions, including those already present in the dbSNP and those newly discovered in sequencing, were then genotyped in the full 270 HapMap samples.
We first determined the total number of functional SNPs currently in dbSNP (usingt he above definitions). We then used the ENCODE regionst od etermine the allele frequency distribution (ie percentage rare and common) of conserved-region SNPs already in the dbSNP (ignoring those newly discovered by the ENCODE re-sequencing effort). We subsequently used information on the newly discovered ENCODE SNPs and our internal SNP discovery efforts to infer the percentage of SNPs missing from the dbSNP.T his allowedu sfi nally to estimatet he total number of such SNPs. Implicit in this estimation is that the distribution of the allele frequency of functional SNPs is the same as the distribution of the subset of theseS NPst hat arei nc onservede lements (which account for over 75 per cent of the functional SNPs).
There are approximately 380,000 functional SNPs in dbSNP build 124. We infer from the ENCODE data that approximately 190,000o ft hese are common and 85,000 are rare (the remaining SNPs arev eryr are or database errors). Resultsw ere similar using data from both the CEPH and Yo rubans amples.T hese results differ markedly from the expectations under the standard neutral model that there should be similar numberso fr are and common SNPs, suggesting that rare SNPs arem issing in the dbSNP database. 167 Of the conservedr egion SNPs detected in the ENCODE Yo rubans amples, the dbSNP database contained 23 per cent of the rare and 55 per cent of the common SNPs. Coverage wash igher for conserved-region SNPs detected in the ENCODE CEPH samples, as the dbSNP database contained 35 per cent of the rare as well as 71 per cent of the common SNPs.G iven that limited numberso fc hromosomes typically are used for SNPdiscovery,both the dbSNP database and ENCODE areb iased to miss rare SNPs a .T he extent of this bias estimated usingo ur internal SNP discovery efforts suggestst hat dbSNP coverage of rare SNPs is between approximately 25 per cent (in Caucasian) and approximately 15 per cent (in African).
From the above data, we estimate that therea re approximately 350,000c ommon and 570,000r are functional SNPs in the Yo ruban samples and 270,000c ommon and 340,000 rare functional SNPs in the CEPH samples. Hence, as tudy that assayedo nly common functional SNPs would requireasimilar number of SNPs as an LD tagging study. 161,168 Even greater genotyping efficiency could be found by combining the approaches. Additionally,the number of rare functional SNPs is within the ability of newgenotyping technologies. 98,99,169 Discussion Associations tudies based on functional SNPs areh ighly efficienta st hey study the set of SNPs most likely to cause disease.I nt he past, these studies have been criticised as not being comprehensive due to our incomplete knowledge of the functional elements of the humang enome.R esearch into conserveds equencesa nd the continuing influx of genomic sequences into the public domain promises to delineatem any of these elements and increase the comprehensiveness of functional SNPa ssociation studies. The use of functionalbased association studies can, in principle,a dequately assess rare alleles, poor coverage of which is am ajor drawback for LD-based association studies.
It mayb ep ossible to improvet he balance between the comprehensiveness and efficiency (defined in terms of multiple testingc osts) of af unctional SNP-based study by incorporating the ap riori probability that an SNP is functional into the statistical tests used for analysis. For instance,o ne might setaless stringent p -value threshold for anonsense SNP than for one in ap utativep romoter.A dditionally,o ne might set al ower p -value threshold for an SNP that wasi nt wo functional categories rather than in as ingle functional category. Fore xample,T able 3i ndicates that SNP density (which over thewhole genome probably reflects selection and, hence, functionality) is particularly lowi nc oding regions that are also conservedo rfl ank splice junctions.
For comprehensivef unctional-based association studies to become practical,several goals need to be accomplished. First, the definition of functional elements needs to be refined through the availabilityofmore genomic sequences.Secondly, SNP discovery efforts must be continued and expanded. Ta rgeted re-sequencing in the functional regionsm ay be necessaryinorder to compensate for bias against rare alleles in the databases,e specially those that are population-specific and hencem ore likely to be functional. 105 Thea vailability of extra sequencing capacity and efficientS NP discovery technologies can help to achievet his goal. 170 Thirdly,S NPs must be genotyped in the major ethnicp opulations to determine allele frequencies. HapMap nowi ncludes millions of SNPs, althought hese areb iased to common SNPs. 161 Given the high-throughput genotyping technologies available, testing additional candidate functional SNPs to identifyt he common and rare SNPs can be readily performed.Indeed, we have recently undertaken the task of genotyping approximately 30,000n sSNPsf romt he public databasest oi dentify as et of a SNP discovery efforts interrogate al imited number of individuals and hence are more likely to find ac ommonm inor allele than ar are minor allele.F or example,astudy using only one individual (twoc hromosomes) has a5 0p er cent chance of including both alleles of a5 0p er cent allele frequency SNP, but only a2per cent chance of finding both alleles of a1per cent frequency SNP.Hence 1per cent alleles are more likely to be missed in both dbSNP and the targeted re-sequencing than 10 per cent alleles. In addition, SNPs in dbSNP and those identified in this targeted re-sequencing efforta re more biased to be morec ommoni nadifferent ethnic population where they may have been discovered.I ndeed when studying alleles that are rare in the Caucasian population, we found the frequency in other populations to be higher for SNPs already in dbSNP than for SNPs identified through SNP discovery in the Caucasian population (MF unpublished results). approximately2 0,000 that arep olymorphic in at least one population. 105 With the availabilityo ft he functional elements and the SNPs, only approximately 270,000 -350,000S NPsm ust be genotyped to assessc ommon functional SNPs in the genome. Furthermore,t he genotyping of 300,000 -500,000 additional SNPs will allowa ssessment of rare functional SNPs which have been implicated in many common diseases and are inadequately assessed by other approaches. Ta ble 3. SNP density per kilobase (kb) and counts in different types of functional regions. The diagonal provides single nucleotide polymorphism (SNP) density for each region type and the off-diagonal provides density for regions of twot ypes, either because one type is as ubtype (coding is as ubtype of transcript) or because of overlapping transcript definitions (a region maybei nthe promoter of one transcript, yetc oding in another).