The genetics of regulatory variation in the human genome

The regulation of gene expression plays an important role in complex phenotypes, including disease in humans. For some genes, the genetic mechanisms influencing gene expression are well elucidated; however, it is unclear how applicable these results are to gene expression on a genome-wide level. Studies in model organisms and humans have clearly documented gene expression variation among individuals and shown that a significant proportion of this variation has a genetic basis. Recent studies combine microarray surveys of gene expression for thousands of genes with dense marker maps, and are beginning to identify regions in the human genome that have functional effects on gene expression. This paper reviews recent developments and methodologies in this field, and discusses implications and future directions of this research in the context of understanding the influence of human genomic variation on the regulation of gene expression.


Introduction
Gene expression in eukaryotic organisms is ac omplex trait influenced by genetic,e nvironmental and epigenetic factors. Although there arem anym echanisms of gene expression regulation (Figure 1), including chromatin condensation, alternatives plicing, DNA methylation, transcription initiation, mRNAs tability,t ranslational controls, post-translational controls and protein degradation,t ranscription initiation is the most common pointo fc ontrol, with elements located both cis (proximal to the gene) and trans to the coding locus interacting to control initiation of transcription. 1 The simplest model of transcriptional regulation involves the binding of transcription factors( TFs) in a sequence-specific manner to TF binding sites, shorts tretches of DNA usually near ag ene,t herebya ltering rates of transcription. Both the identity of TFs present and their binding affinities playa ni mportant role in transcription initiation. Genetic mutations that alter either the nucleotide sequence of aT Fb inding site,t he nucleotide sequenceo ft he transcript (ega ffecting stability) or the transcription factor aminoa cid sequencea re just somee xampleso ft ypes of mutations that can have substantial effects on mRNA transcript levels.
The control and maintenance of appropriate levelso f transcription for each of the genes expressed in ag iven cell type are vital cellular processes. Many humand isorders are caused by molecular changesthat have an impact at the level of gene expression. 2 Often, alterations in gene expression are extreme and involvee ither many-fold overexpression (eg Burkitt'sl ymphoma 3 )o rp artial/complete loss of expression (eg alpha thalassaemia 4 ). Ty pically, thesea re monogenic or rare disorders where the effect is strong and the mutations arep resent at lowf requency.I no ther instances, however, such as Ty pe 1d iabetes, 5 subtle changes in expression can have small phenotypic consequences that are conditional on the genetic background of the individuals. For example,E aves et al. 5 and Karp et al. 6 used differential gene expression patterns to map susceptibility genes.I ti s hypothesised that complex disorders arel ikelyt ob e associated with gene expression variation, since susceptibility alleles have mostly quantitativer ather than qualitative differences between individuals. 7 The use of gene expression variationa sa ne ndophenotype 8 between nucleotide polymorphism and disease susceptibility can proveu seful in identifying the underlying genetic basis of complex disorders and designing appropriate modelst ot est it. 9 Gene expression levels can thus be used as genetic markersw hich can help in linking nucleotide variationw ithad isease phenotype. Ac omprehensives tudy of segregating gene expression variation will provide an important starting point for the utilisation of gene expression as an intermediate level of phenotypic attributes between an ucleotide polymorphism and a complexd isorder.

Challenges and approaches for identifying regulatoryr egions
One of the major challenges ahead is to identifythe DNA sequences that carry the cis signals for regulation of the spatial and temporal expression of all of the genes in the human genome. 10 -12 This is adifficult taskbecauselittle is known about the characteristics of the structureofregulatory regions.One of the main limitations is that little is known about the functional units that comprise a cis regulatoryregion, and current knowledge is restrictedtotranscription factor binding sites that maybeonly one partofaregulatoryregion organisation. Transcription factor binding sitesare found throughout the genome but the organisation and abundance requirements of such sites within regulatoryelements is not yetclear. 13 -15 Given the redundancy and shortlength of binding sites, it is expected that specificity is achieved through ahigher order code of organisation. Therefore, this code needs to be deciphered to be able reliably to identify them in the genome.Current knowledge of regulatorysequence organisation lacks uniform and easily interpretable sequence characteristics, suchasthe open reading frame (ORF) and the splicing signals in coding sequences. In other words, we aremissing information about the code that the cell recognises in order to identify cis regulatoryregions in the rawsequence data. Moreover, and the dimensionality of this code is currently unknown.
Experimental identification of variation in regulatory sequences currently relies on extensiveu se of model systems such as yeast, mouse and cultured humanc ell lines. Studies to detectd ifferences in genee xpression caused by known promoter polymorphisms have generally used reporter geneassays with allele-specific promoter constructs; 16,17 however, these experiments have several limitationst hat maket hem impractical for whole-genome analyses. First, they require knowledge of the promoter region and candidate functional variants to test; at present, there are experimentally validated promotersfor less than 10 per cent of human genes, 18 and the properties of long-distance regulatorsr emain unknown and untested. Secondly, these methods are indirect and performed either in vitro or in cellular conditions (tissue, developmental stage,e nvironmental stimuli) distant from the tissue context. 19,20 Therefore, anyi nference about the potential regulatoryr ole of as equence relies on thea ssumption that the experimental system is similar to the in vivo conditions. Thirdly,e xperiments that target candidate regulatoryr egions based on computationalp redictions arep erformed without previous knowledge of the target gene-for example, distant enhancers and suppressors -and thereforethe significanceof the identifiedsequence for genome function cannot be fully revealed. Finally,experiments that makeuse of the proper genomic and cellular context are intensiveand slowwhen intended for large-scale analysis, 21,22 so for practical reasons they cannot be applied to all of the genes of the human genome.
Detailed functional experiments to elucidate mechanisms of gene regulation have typically been carried out at the level of individual genes or sets of genes. Although the mechanism of gene expression is well-documented for some genes in great detail (HOXgenes and the genes encoding b -globin, a -globin etc), it is unknown howt ransferable these results are to the whole genome.W ith the developmento fm icroarrayt echnologies, it is nowfeasible to quantify the transcript abundance of thousands of genes simultaneously and efficiently in a single experiment. These technologies have medical applications,f or example in identifying genes that ared ifferentially expressed in ad isease state versus non-disease controls, 23 to classify disease into subtypes 24 and to examine differences in transcript profile amongd ifferent tissues or organs. 25 More recently,r esearchersh aveb egun to use these technologies to quantify naturally-occurring variation in gene expression for many genes amongm ultiple 'non-diseased' individuals of as pecies.

The Genetic basis of Gene Expression Variation
Large-scale surveys of gene expression variationinhumans can provide important baselinei nformation about 'normal' naturally-occurring variationa mongi ndividuals. These data can be used to assess the significance of variation observedi n experimental studies where groups of individuals (eg disease versus non-disease controls) are compared. In additiont o being useful to the medical community,t hese studies will fundamentally increase our understanding of the causes of naturally-occurring gene expression variation. The total phenotypic variancea mongi ndividuals ( V P )f or any trait can be broken down into ac omponent of variance due to genotype ( V G ), ac omponent due to environment ( V E ) and ac omponent duet od ifferent genotypes in different environments ( V GE ), according to the following equation: The genetic component of phenotypic expression variation reflects interindividual genetic differences that result in interindividual expression differences.
Little is known about the genetic basis of natural variation in gene expression. There are questions of fundamental biological importance,i ncluding, but not limitedt o: Recent work in model organisms and humans has begun to address these and other questions. Studies in model organism systemsh aved ocumented significant, naturally-occurring variation in genee xpression among individuals, including yeast, 26,27 Drosophila 28,29 and mouse, 30 -33 although additional studies have made similar observations in fish, 34,35 maize, 33 primates 36 and humans. 37 -42 As it has become accepted that naturally-occurring variation in gene expression amongi ndividuals is ac ommon phenomenon,f ocus has shifted toward trying to quantify the contribution of genetic factorst ot hat variationa nd to locate the responsible genomic regions.
Ya n et al. 42 were among the first to demonstrate ag enetic component of expression variationi nh umans. For six of 13 loci surveyedi n9 6C entre d'Etude du Polymorphisme Humain (CEPH) 43 individuals, they observeds ignificant differences in mRNA transcript abundance for the twoa lleles of heterozygousi ndividuals (allelic imbalance).F urthermore, when families of individuals exhibiting allelic expression differences were examined; one-third of them showed expression patterns consistent withu nderlying Mendelian inheritance of functional variants. Other recent studies of allelic imbalance in humans and mice provided similar evidence for af unctional genetic influence separate from that attributable to imprinting. 18,30,31 In al arge-scale microarray study,C heung et al. 38 provided further evidence of familial aggregation of expression profiles. The authorss urveyed genome-wide patterns of gene expression in immortalised lymphoblastoid cells of humans and identifiedaset of genes whose transcript level varied greatly among3 5u nrelated CEPH individuals. To determine whether the variation was influenced by genetic differences segregating amongi ndividuals, mRNAt ranscript levels of the most variable genes were quantified in several samples of individuals of different degrees of genetic inter-relatedness, including asampleof49unrelated CEPH individuals (the35individuals mentioned above plus an additional14), offspring from fiveCEPH families and ten pairs of monozygotic twins. The authorso bservedt hat genes exhibited less variability in transcript abundance in more closely related individuals, suggestingaheritable component of gene expression variation among individuals.
Some studies have gone as tep further and used large-scale studies to estimate the percentage of genes that exhibit significanth eritability.I nastudy of gene expression in lymphoblastoid cell lines of CEPH pedigrees,S chadt et al. 33 reported extensived ifferences among 56 individuals of four CEPH families in mRNAt ranscript levels, and through heritability analyses were able to estimatet hat approximately 29 per cent of these genes had ag enetic component influencing these levels. Monks et al. 39 followedu pt his studyw ith a massivesurveyofexpression of 23,499 genes in 167 individuals of 15 CEPH families. Of the detected genes, 31 per cent exhibited significant heritability (false discovery rate 0.05), with am edian heritability of 0.34.
The above studies in human and other species demonstrate gene expression is an abundantly variable phenotype with agenetic component; thus, gene expression -o rm RNA transcript level among individuals -can be considered as a quantitativetrait. In general, quantitativet raits exhibit continuous phenotypic variation among individuals, and the genetic component of that variationiso ften due to contributions of more than one locus. By combining microarray quantification of gene expression amongi ndividuals with marker genotype data (eg singlenucleotide polymorphisms; SNPs) for the same individuals, it has become possible to map the genomic regions containing factorsr esponsible for natural variationinh umangene expression by performing association analyses. In these analyses, first referred to as 'genetical genomics', 44 transcript abundance of each of thousands of genes is treated as aquantitativephenotype 9 that is under genetic control. Association analyses are used to map functional regulatoryr egions by associating genotype at an individual marker locus with the expression of each gene (Figure 2A and  2B). These methods differ from family-basedlinkage analysis that traces genotypes and phenotypes of related individuals, looking for polymorphismst hat co-segregate with the phenotype (Figure 2C).

Progress through use of genome-wide association mapping
One of the advantages of al arge-scalea ssociation approach is that it mayb ep ossible to identify functionally important regulatoryv ariants without requiring any previous knowledge about specific cis or trans regulatoryr egions. Because these methods link expression variation of ap articular genet ot he genomic sequences directly or indirectly affecting it, therei sa causal connection between phenotype and genotype.T he identification of many significant associationsbetween markers and individual gene expression phenotypes will allow researcherst oa ddress the issue of the relative proportion of cis or trans regulatoryv ariation for each of many thousands  of genes.T hese studies also have the potentialt oi dentifys ets of genes exhibiting correlated expression patterns and may identifyc lusterso fr egulatorso fm ultiple genes suggesting networks of co-regulated genes. Furthermore,b ecause these methods look at the effects of naturally-occurring alleles,t hey mayb ea ble to identifyr egulatoryr egionst hat have subtle effects, as opposed to the large effects generally observedi n knockout experiments. Some recent work applying linkage and association analyses to expression variationi nh umans has led to the identification of regions of the genome influencing observedv ariation. Ar ecent study using microarrays to measure gene expression variation 40 employedg enome-wide linkage analysist om ap regions influencing gene expression in immortalised Bc ells of 14 CEPH families (all parents and am ean of eight offspring per sibship). Thea uthorsp erformed linkage analyses for the expression phenotype of 3,554 genes (observedt ob eh ighly variable in as ampleo f9 4C EPH grandparents) and the genotypeso f2 ,519 SNPm arkersi nt he samei ndividuals. They identifiedn early 1,000 genes exhibiting significantl inkage. Of the 142 genes with the strongest evidence for linkage,1 10 (77.5 per cent) had only a trans-acting regulator,a lthough 27 (19 per cent) had only a cis-acting transcriptional regulator (defined in this study as 5m egabases from the target gene). Interestingly,t hey identified regions that were hotspots of transcriptional regulation where there were clusters of SNPs with strong linkage to the expression phenotype of multiple genes. Aq uantitativet ransmission disequilibrium test performed on 17 of the 27 phenotypes displaying significant cis linkage identified14phenotypes exhibiting both significant linkage and association.Aregression-based association analysiso ft he same 17 phenotypes in 94 CEPH grandparents confirmed significanta ssociation between the same1 4g ene expression levels and an SNPl ocated within or neart he gene.A dditional surveys in humans, mice and maizeh ave confirmed that genetic variation located cis to the locus in questionh as functional effects on the transcript level of that gene. 18,30,33 It is essential to pointo ut that the distinction between cis and trans effects becomes less clear,a nd sometimes problematic,when one looks at genome-wide expression data. If one is taking ag ene-centric view of the genome and is interested in the proportion of genes that have cis or trans regulatory variationa nd the relative contribution to genetic variance per gene, then it is appropriate to use this distinction because the view of the data remains gene-centric.I f, however, one is interested in the overall contribution of genetic variationt o gene expression variation as aw hole-genome property,t hen the terms cis and trans become irrelevant, since all genetic variationofany nature (amino acid, transcript or cis-regulatory region; Figure 1) is mapped to unique locations in the genome.T here is no such thing as ag enetic variant trans to aw hole genome becausea ll genetic variationi se ncoded in the DNA.

Considerations
The massivea mounts of datap roduced in microarraye xperiments require some significant statisticalc onsiderations. First, it is necessarytoassess the quality of the measurements reliably and omit low-quality data from thep rimarya nalysis. Normalisationm ethods are then applied to the data to adjust for any sources of variability due to the experiment (different arrays, hybridisatione fficiency differences, mRNA preparation differences etc) that mayi nterfere with detecting those differences that reflect real biological variability.T hese methods ared ata transformations, and therea re many procedures to choose from, some of which mayb em ore relevant to certain microarrayp latformsa nd rawd ata distributions. The normalised microarrayd ata and marker genotypes can then be subjected to association analyses, in which the genotype of the individuals is the primaryclassification variable and the response variable is the normalised transcript level of each gene (Figure2B).Because these procedures test for association between each gene expression phenotype and many marker genotypes, the threshold for assessing significance must be adjusted to control for the massivea mount of multiple testing inherent in testing each gene.
Can thesem ethods lead to the identification of the individualn ucleotides responsible for naturally-occurring variation in gene expression in humans? Regulatoryv ariant identification in humans is complicated by the non-random association of alleles at different loci( linkage disequilibrium [LD]) in the human genome.I no ne of the available human cell line populations,t he CEPH pedigrees, for example,o n average the LD is high. 45 If LD is higha mong markersf or a region showing av erys trong association between genotype and expression phenotype,i tc an be difficult to pinpointt he causal functional variant, as multiple variants covering al arge region might all exhibit the same strong association. In cases likethese,itmay be possible to narrowdownthe length of the responsible genomic region by generating am ap withah igh local marker density,e ffectively identifying markerst hat are not perfectly correlated withe ach other.A lternatively,fi nescale mapping mayb ef acilitated by examining several populations that exhibit different patterns of LD.Finally,sequencing the region around as trongly associated marker mayp ermit identification of the responsible regulatoryv ariant that is in LD with the associated marker.A tt his stage,i ndividual nucleotide variants linked to the marker aret ested for association with the phenotype.C andidate regulatoryr egions (and variants) can be tested with experimental procedures to determine their potential to modulateg ene expression.
In conclusion, the availability of genotyped (or nearly genotyped) human pedigrees (CEPHand other populations; Coriell'srepositories; the International HapMap Consortium 45 ), as well as more sensitiveand less expensivemicroarraytechnologies for gene expression and genotyping, means that the time is right for carrying out large-scale genome-wide association studies. This will contribute greatly to our understanding of the genetic basis of complex phenotypes in humanpopulations, and maylead to noveldiagnostics, preventativemethods and therapeutics for humandisease.