From DNA to RNA to disease and back: The 'central dogma' of regulatory disease variation

Much of the focus of human disease genetics is directed towards identifying nucleotide variants that contribute to disease phenotypes. This is a complex problem, often involving contributions from multiple loci and their interactions, as well as effects due to environmental factors. Although some diseases with a genetic basis are caused by nucleotide changes that alter an amino acid sequence, in other cases, disease risk is associated with altered gene regulation. This paper focuses on how studies of gene expression variation might complement disease studies and provide crucial links between genotype and phenotype.


Introduction
Understanding thec auses of human disease is one of the most fundamental goals of modernmedicine.Individuals differ with respect to disease susceptibility,d isease progression and effectiveness of treatment. Identifying the factorsc ontributing to these differences, and elucidating their interactions as they contribute to aspectso fd isease phenotype,i saprecursor to improved prevention, detection and treatment of disease.
Much of the understandingo fh umand isease derives from the study of those diseases that segregate in families in a Mendelianfashion,w here the causative variants and the genes in which they reside have been identifiedt hrough classical family linkage approaches 1 and through studies in large pedigrees and in isolated populations based on founder effects. 2 The vast majority of common diseases exhibitamore complex mode of inheritance,h owever,a ggregating in families but rarely exhibiting Mendelianinheritance.Examples of diseases of this type include diabetes, obesity,s chizophrenia and asthma. Understandingo ft hese 'complex' diseases is improving, although still limited, but it is clear that genetic variationplays an important role in susceptibility to disease,for examplei na utoimmune and infectious diseases. 3 Most complex disease is thought to be caused by the combined effect of genetic variants at af ew loci or multiple loci, each with only modestf unctional effects on susceptibility. Additional roles arep layedb ye nvironmental factorsa nd their interactions. Mapping theg enomic regions contributing to disease creates new directions for disease research and is an important step towards improving human health.
Approaches to identifying the genes involved in complex disease can be generally grouped into twocategories: candidate gene studies and linkage/association studies. Candidate gene studies use knowledge about the biology of adisease,and about genes in physiologically or biochemically relevant pathways, and attempt to correlate genetic variation at these'candidate genes' with disease phenotype.U nfortunately,for most diseases, this type of information is not available or complete enoughtop rove widely useful, and including them in some of the analyses is more likely to increase the noise than it is to reduce the search space for the disease.G enome-wide linkage studies and association analyses serve as alternative approaches to surveying the contribution to disease of genetic variants located anywhere in the genome.T he genome-wide aspect means that these studies do not requirea ny apriori hypothesis that ap articular region is involved, although predictions about the potential effect of specific variants (eg non-synonymous singlenucleotide polymorphisms[SNPs]) can be incorporated in the models.Int his respect, these approaches are unbiased. Family-based linkage studies entail identifying genetic variants in families that co-segregate with disease more often than would be expected by chance.In general, linkage studies have achieved limitedsuccess in identifying genomic regionsi nvolved in complex disease,i np art because they are underpowered to detect moderate genetic effects. Furthermore,because identification of aregion or regions associated with the disease or trait requires identifying those alleles that segregate with the disease in families, which in turnd ependsonr ecombinationwithin the families, it can be difficult to narrowaregion exhibiting significant linkage. An alternativemethodology is to performa ssociation analysis, which looks for correlation of genetic variants with aspects of phenotype,b ut does not requireapedigree structuref or the individuals. Association analyses arem ore powerfulfor the detection of common disease alleles with small to modest effects, 4,5 and increasingly are being used successfully in studies to identify genes contributing to disease. 6,7 Although many phenotypic differences among individuals are attributable to variants in codingDNA, 8 variants in noncoding DNA can have profoundeffects on phenotypes, including disease phenotypes. For example,regulatoryv ariants affecting transcription initiation, splicing, RNA stability and translational efficiency areknown to playr oles in conditions including autoimmune disease ( CTLA4 9 ), malaria ( DARC 10 ), variouscancers(SMYD3 11 )and other examples (shown in Ta ble 1). In studies of suchdiseases, gene expression mayserve as an intermediate phenotype between disease phenotype and genotype. 30,31 Gene expression, or mRNAlevels, can be modulated by variants in coding or non-coding DNA (eg transcription factorsorbinding sites within promoters). Whole-genomeassociation studies of gene expression (expression quantitativetrait loci[eQTL] mapping) may generate hypotheses for disease susceptibility by identifying those regions of the genome with functional effects on gene expression. These might then serveascandidate regions for evaluating for association with disease phenotypes, as is discussed below.

Resources for genome-wide analysis
An efficient approach to the study of humand isease benefits from the use of shared resources. For example,i n order to performgenome-wide linkage or association analyses, suitable DNA markersa re required. The humang enome is estimated to harbourm ore than 10 million SNPs, present at . 1p er cent frequency, 32 and theseS NPs are located throughout the genome in regions of coding and non-coding DNA. Publicly available databases of SNP alleles, assays and genotypes area ccessible online (egd bSNP 33 and HapMap 34 ). High-throughput genotyping platformsa nd reductions in genotyping costs nowm akew hole-genome genotyping feasible for largenumbers of samples. Gene expression can also be quantified in ah igh-throughput manner usingc ommercially available microarrays, permitting the detection of small differences in expression levelsa mongs amples.
The establishment of cell lines creates resourcest hat can be used by multiple research groups from around the world to surveyv arious cellular phenotypes. With respect to the study of gene expression, it is desirable to establish cell lines from different tissues becauseg ene expression is highly dependent on developmental and cellular context and,i ndeed, some diseases manifest their phenotypes only in specific tissues. In addition, thec ell perturbationst hat accompany the establishment of cell lines suggest the study of geneexpression in primaryt issues, although, clearly,t he choice of sample dependso nt he purpose,s tage and feasibilityo fastudy,t he sample size required and its availability.D espite some shortcomings of cell lines as perfect proxies for the complete set of humantissues, data can be collected on alarge scale with respect to sample size and reproducibilitya nd can provide candidates for further study in other samples. Currently, there are relatively few data on gene expression across the diversity of healthyh uman tissues or across multiple individuals from different populations.T hese data from healthyi ndividuals will provide important information on the range of naturally occurring gene expression variationand will serve as abaseline againstw hich to compared isease-associated molecular phenotypes.

Statistical issues in genome-wide analysis
Although genome-wide association studies arethoughttohave more powert han family-based linkage studies, they present strong challenges in the formo fs tatisticali nterpretation. For example,asimple genome-wide association study maytest hundreds of thousands of SNPs for association to ap henotype (or,m oret ypically,m ultiplep henotypes), and more complicated modelsa llowing for SNP -SNP interactions vastly increase the already largenumber of statistical tests. With such alarge number of tests, the significance threshold must be adjusted to control for the number of false-positiveassociations. Although procedures for multiple test correction exist -f or example, Bonferroni correction, false discoveryr ate 35,36 and permutations of phenotypes relative to genotypes 37 -i t remains unclearw hich is the bestm ethod to apply in this context.Itisalso not trivial to infer the biological significance of an association from statistical significance,b ecausea llele frequencies, variance of the phenotype,density of markersand linkage disequilibrium (LD) can have atremendous impact on the statistical significance inferred.
Human genetic variation is structured into haplotypes, such that alleles at nearbyl oci often shows trong statistical association with one another.B ecauseo ft his association, known as LD,alarge region mayc ontain multiple SNPs exhibiting as ignificanta ssociation withag iven phenotype. Although this structure of humang enetic variation facilitates association mapping, it can complicate subsequent fine-scale mappingt on arrowt he associatedr egion and locate the causal variant, as discussed below. Another concerni na ssociation studies is the potentialf or false associations caused by population stratification, 38 so care must be taken to reduce these effects through appropriate experimental design and data analysis. 39,40 eQTL mapping The interrogation of gene expression to facilitate thed esign and interpretation of disease association studies can be a powerful tool for thei dentification of biologically functional variants and the interpretation of biological effects. 41 By introducing aq uantifiable and easily measurable biological outcome,i ti sp ossible to assess ther elevance of statistical significance and eliminates ome of the issues raised above. The use of gene expression variation in the context of disease mappingc an be viewed in twow ays ( Figure 1). On the one hand, hypotheses can be generated by discovering functional variationusinge QTL mappinga nd subsequently testingt hose functional variants in large case-control samples. On the other hand, following ad isease association study that has identified several signals located within regions of non-coding DNA, eQTL mappingc an be used to interpret and dissectt he functional effect of the candidate disease variants. Below, the twod irections will be explored in more detail.

Generating functional candidates using eQTL mapping
Regionsw ith functional effects on gene expression can be localised through the use of association mapping. Gene expression, or mRNAl evel, is aq uantitativep henotype that can be assayedi nm ultiple individuals. When the same individuals are surveyedf or genetic variationa tm arker loci, for example SNPs,a ssociation analysist ests whether variation at each SNP can explain the observedp henotypic variation. The rationale behind this analysis is that markerst hemselves are either the causalv ariant or are highly correlated (in LD) with the causalv ariant.
Associationm apping of gene expression variation has been successful in many species, including human, 42 -45 yeast, 46 -48 mouse, 49 -52 rat, 53 fish 54,55 and maize. 52 To gether, these studies provide several striking observationsr elated to the nature of functional variation influencing gene expression. First, variationi ng ene expression levelsa mongi ndividuals is common -a nd much of that phenotypic variationh as a genetic basis. Much of the association signal is located cis-t o the gene of interest, 45,52,56 although trans-actingv ariants have also been observed. Hotspots of gene regulation (ie regions of the genome influencing expression of several genes) have been observedi ns ome, 44 but not all, studies.
There ares everal ways in which the study of the regulation of gene expression can enhance disease studies as well as narrowt he choice of candidate regions for disease association studies. Where information exists about the contribution of particular genes to ad isease phenotype or susceptibility, understanding ther egulatoryc ontrol of those genes mayassist in elucidating the complete set of effects. In addition, understanding the regulation of categories of genes,o rg enes of ap articular pathway,m ay providet argetsf or further follow-up in disease studies. It mayp rove more time-efficient and cost-effectivet oh avealisto fm any potentialf unctional variants located throughout the genome,h owever, and test them againstalarge number of diseases. Whole-genome eQTL studies can providealistofregionsofthe genome with functional effects on the expression of known genes (Figure 1a).S NPsl ocated within these regions can then serve as candidates for disease association studies, much in the same wayt hat non-synonymous SNPs areo ften considered because of their potential functional effect. There ares everal advantages to this type of targeted approach over aw hole-genome scan. First, because the number of SNPs to be genotyped in each individual is reduced, many more individuals can be surveyedinadisease study without vastly increasing costs. The reduction in the number of markerstested can eliminate some of the problems of multiple test correction, more sensible thresholds can be used and smaller effect variants can be From DNAtoR NA to disease and back Review REVIEW  detected. Secondly,a ny significanta ssociationsd etected between SNPa nd disease phenotype provide both a mechanism (gene regulation) and the identity of the affected gene.F inally,t he fact that potentialc ausal regulatory variants were initially discovered in healthyi ndividuals and subsequently have been associated withd isease means that such variants are common and arel ikely to contribute significantly to the disease risk of the population.
The methodology above carries the risk of focusing only on certain types of genomic variants, while it is known that much of genomef unction is still missing. Aw ay to circumvent this problem is to enhance disease studies by incorporating the data on functional regulatoryr egions while using commercially available whole-genome SNP genotyping chips in disease studies, in order to performt he association analysisu sing Bayesian methods that assign different prior probabilities to SNPs on the array. Under such as cenario, SNPs located in regions with known functional effects on expression of specific genes -a si dentifiedt hrough eQTL studies -w ould be assigned ah igher prior probability of being associated withap henotype.I na ddition, one might assign ahigher prior probabilitytoSNPs in known promoters, enhancers or transcription factors. Thus, one could focus on the effects of candidate variants without missing other important signals. Another substantial advance of knowing regulatoryv ariants before performing ag enome-wide association study is that one can correlate phenotypes and regulatoryn etworksa nd utilise such information in the statisticalm odelling of thed isease.
Supporting genome-wide disease association studies: Narrowing on disease-associated non-coding signals Genome-wide association studies aren ow increasing in frequency, 6,57 and although it would have been preferable to have identifieda ll functional regulatoryv ariants in advance, investigatorsw ill be faced with the challenge of interpreting some of the strongest association signals. Many of the association studies have am ulti-phase design, wherein af raction of the SNPs with the tops tatistical significance in the first phase are genotyped in as ubsequent phase in an ew seto fi ndividuals. The statistical exercise must eventuallyg ivew ay to biological interpretation, however, and the identification of the causal variant will be necessary. Although most of the confirmed disease-causing variants arel ocated in coding regions,this observation is due to an ascertainment bias in the ability to predict the potentialf unctional consequences of nucleotide variation. As theh umang enome is composed of only , 3-5p er cent codingD NA, and studies increasingly attribute function to non-codingD NA, it might be expected that much of the disease-causing variation will be non-coding and that many of the significant peaks in an association analysis will fall in regions devoid of genes.
Disease association studies often identify non-coding regions of the genome exhibiting significanta ssociation withd isease. The exploration of those non-coding regions will benefit from the surveyofg ene expression variationa nd howi tr elates to genetic variation (eQTLm apping). For anyd isease-associated non-coding region (eg from acase-control study), it is possible to test whether the disease-associated SNPs and haplotypes are also associated with genee xpression variationo fn earby genes (as identifiedfromeQTL studies; Figure 2). This enables conclusions to be drawn about the nature of the function of the causal variant. For instance, if the same haplotype that appearstoincrease the disease risk also appears to be associated with high expression of an earbyg ene, it is possible to start makingsomeconnections between the biology of the affected gene and the disease itself.M oreover, one can hypothesise (and hopefully test) howl evels of expression of ag enem ight affect disease risk. This simple connection between the two types of study could providen ot only the identityoft he gene that is linked to the disease,b ut also the consequence of genome variationt hat linked the gene with the disease.I t mayalso provides ome clues about other candidates (upstream transcriptional regulators, interacting proteinse tc).
Several studies illustrate the utility and validity of usinggene expression variationf or disease fine mapping. Tw oo ft hese studies have focused on identifying functional nucleotide variationb yf ocusing primarily on the regions surrounding each of aset of genes( cis-), but also considering other regions located trans-t ot he genes. 42,45 These studies showedt hata large fraction of genes (10 -20p er cent) have significant variationt hat affect their gene expression in cis,a nd in some cases in trans.T he regulatoryv ariants that affectg ene expression variationc an be mapped witht he samer esolution as disease variants in genome-wide association studies because, in both cases,t he resolution dependso nt he LD structure of the human populations. These studies, which allowt he identification of regulatoryh aplotypes, need to be verified before functional experiments can be performed.T he most appropriate wayt op erformafirst-pass verificationi st ot est whether allelic imbalance in expression is correlated with heterozygosityi nt he same SNPs as those that showed genotypic association with gene expression.
Even with this information, and thefact that the effect of a causal variant mayb ek nown to have an effect on gene regulation, it is still al ong wayf romb eing able to identifyt he exact DNA variant that causes the regulatorye ffect and subsequent increased disease risk. This is as tage where things become complexfor many reasons. For example,althoughthe genome-wide distribution of LD is quite variable,a verage LD in the human genome extends over large regions, which makes it challenging to fine map ac ausalv ariant in many regions.I nt he bestc ases cenario,a ssociated SNPs would be identifiedi naregion of very lowL D, thus reducingt he number of potentialc ausal variants to test subsequently.M ore often, an associated region of approximately 10 -20kilobases will be identified. 45 Although fine mappingi napopulation with reduced LD (eg Africans) mighta ssist in identifying shorter associated regions,i ti sa tt his stage where extensive amountso fi nformation about genome function arec rucial. The diversity of methodologies for large-scale interrogation of the human genome for function is increasing; the resulting information will be very important for prioritising which of those associated DNA segments to focus on first.

Interpreting regulatoryv ariation
The identification of the causal variant can benefit from incorporating information about genome function. Many studies to determine functionality within the humang enome sequence are nowi np rogress usingh igh-throughput, genome-wide methodologies. TheE NCyclopedia Of DNA Elements (ENCODE) project is the best example 58 of this type of study.T he aim of this project is to attribute af unctional identity to each nucleotide of the humangenome.Inits pilot phase,1per cent of the humang enome (44 genomic regions)has been studied extensively for function, interspecies conservation and populationg enetic variation. The comprehensivea nalysiso ft hese 44 regionsw ill provide important clues for the pattern and structure of genome function and will allowp redictions for the nature of variations behind complex disease and phenotypic variation. This, and other ongoing studies, will offer afirst-pass annotation of functional elements in the human genome and will providet he framework for detailed characterisationo ff unctional variation.
If an established and confirmed association of ar egion with disease and gene expression variation exists, and there is light annotation of the associated region for coding and non-coding elements, it is possible to apply brute force approaches to identifyt he specific DNA changes that are causal.Arecommended strategy is to performe xtensive resequencing of potentially functional segments of ther egion in high and lowe xpressing individuals. The number of individuals required to be assayedd ependso nt he magnitude of the functional effect and the predicted within-population frequency of the causalv ariant. This can be assisted by initial powercalculations that allowprediction of what is likely to be identified, givent he study design. Theo ptimal approach seemst ob et os ample sequences from individuals at each of the twoe nds of the phenotypic (expression) distribution, and then proceed inwards.
As soon as as et of genomic segments have been resequenced, one shouldl ook for variants that appear to have equalo rb etter correlation with the phenotype than that observedi nt he initial association study.T his can be determined by genotyping all of the potentially functional variants (identifiedinthe resequencing approach in asubset of the individuals) in the originalc omplete sample.D epending on the strength of these correlations, and where the highly suggested that genetic variation that increases disease risk is also associated with gene expression variation of gene A( assuming that the associated SNPs and haplotypes arethe same). This probably indicates that the disease risk is ar egulatoryeffect and that the amount of transcript (or protein) of gene Ai sc ritical for the development of the disease. correlated variants are located, the appropriate approach should be adopted for directf unctional testing of causal haplotypes. Such approaches can include reporter constructs, binding assays, RNAs tability assays and chromatin modification assays usinga ll of the alternativeh aplotypes.

Summary
In this paper,s ome issues have been discussed that arise from the incorporation of gene expression variation data in disease studies. Theo verall message is that gene expression can greatly assist the discovery of disease variants, as well as the interpretation of the biological effects of causal variants. Further exploration of gene expression variationi nm ore samples and more cell types will greatly enhance both our understandingofphenotypic variation in humans and also the nature of regulatoryv ariationa nd its impact on complex disease.