Immunogenomics: Molecular hide and seek

Similar to other classical science disciplines, immunology has been embracing novel technologies and approaches giving rise to specialised sub-disciplines such as immunogenetics and, more recently, immunogenomics, which, in many ways, is the genome-wide application of immunogenetic approaches. Here, recent progress in the understanding of the immune sub-genome will be reviewed, and the ways in which immunogenomic datasets consisting of genetic and epigenetic variation, linkage disequilibrium and recombination can be harnessed for disease association and evolutionary studies will be discussed. The discussion will focus on data available for the major histocompatibility complex and the leukocyte receptor complex, the two most polymorphic regions of the human immune sub-genome.


Introduction
The ongoing 'hide and seek' between pathogens and their host immune systemshas ledtoamolecular arms race which makes immunology one of the most interesting, but also one of the most complex, areas of research. By contrast with the harmless children'sg ame,t he outcome of this molecularr ace determines health versus disease and, in many cases, survival versus death. Ap rerequisite for immunogenomic research is the availability of comprehensivea nd fully informatives equence variationa nd gene maps. Ah igh-quality sequence map of the human genome and successiveg enerations of annotation have been available since 2001. 1,2 Using this sequence as a reference,g lobal and population-specific variationm aps of increasing resolution were subsequently generated by the InternationalS NP Consortium and HapMap Projects. 3,4 Because of the extreme polymorphism encountered in some immune regions, separate efforts were focused on regionssuch as the major histocompatibility complex (MHC; reviewed by Allcock et al. 5 ), the leukocyte receptor complex (LRC; reviewed by Ya wata et al. 6 )a nd others, to complement the global variationm ap.I na ddition, several specialised databases, including the Immuno Polymorphism Database 7 and the InternationalImmunogenetics Information System, 8 catalogue many of the classical immune genes and their allelic variants. Using immune ontologyd efinitions, 9 the first comprehensive gene map and database of the humani mmune sub-genome wasr ecently reported to consist of 1,562 genes (about 7p er cent of human genes) that are distributed at varying densities across all chromosomes except the Yc hromosome. 10 In the following review,some of the benefits and caveats of using ah olistic approach to immunogenomic analysis will be discussed withr espect to genetic and epigenetic variation and linkage disequilibrium (LD).

Genetic variation
Based on both experimental data 11 -15 and inferences from population genetics data, 16 -18 there is evidence that recombination is not uniformly distributed across the human genome.I na ddition to the heterogeneity of recombination rates, it wass uggested that the humang enome could be subdivided into relatively shortf ragments with little or no evidence of historical recombination. 19 Under this premise, ag enome-wide LD mapi sb eing constructed by the Inter-nationalH apMap Project, which aims to generate ar esource of DNA variation in humanp opulations with different ancestry. 3 In its first phase,w hich is close to completion, genotyping dataa to ne single nucleotide polymorphism (SNP)/fivek ilobase( kb) density will provide enough resolution to portrayt he LD landscape of the genome, including regions of the immune sub-genome.
LD maps of complete chromosomes are already available; 20 -22 however, detailed LD patterns at high SNPd ensity covering large chromosomal regions encoding immunerelated gene clustersh avel ong been awaited. LD data have so farb een reported for twos uch clusters: the LRC and the MHC.W ithin the LRC,t he haplotypicn ature of the killer immunoglobulin-liker eceptor ( KIR)g enec luster causes a DNA size variationo f1 20 -200 kb as ac onsequence of the presence/absence of genes in some haplotypes. 23 -26 This feature,i na dditiont oi ntron similarity owing to recent duplications, restricts the availability of informativeS NPs for LD studies 27 and poses particular restrictions on large-scale genotyping protocols. Consequently,adetailed LD map for the LRC is not available.For the MHC,acomprehensive LD map is available 16 as ad irect consequence of existing SNP resources. 28 Historically,c onsiderable attention has been paid to the MHC region,resulting in awealth of valuable information in a varietyofareas -that is, origin and maintenance of nucleotide diversity in classical humanleukocyte antigen ( HLA)g enes; gene conversion, average recombination rates, presence and distribution of the recombination hot spots; and extended LD in long-range haplotypes. 11,14,15,29 -31 The high-resolution LD map available for the MHC permits the detailed study of the allelic correlation, which can be particularly useful for the selection of tagSNP sets for disease association studies. Since the historyofr ecombinationbetweentwo SNPs can be estimated using D 0 ,D NA segments showing no evidence of historical recombination have been defined as LD-blocks. 19 Consequently,h aplotypes within such LD-blocks are likely to share ac ommon ancestral haplotype,a nd the genealogy of an LD-block might be different from the genealogy of neighbouring LD-blocks. Phylogenetic analysiso fh aplotypes observedi na, 110 kb LD-block (rs3115569-rs2022533) at the MHC class II -III boundary,the region of lowest recombination rate in the MHC (0.0093 centimorgan [cM]/ megabase [Mb]), surrounded by tworecombination hot spots, 16 showedt hat the patternofn ucleotide differences between haplotypes within this LD-block is difficult to account for by mutation only,but could readily be explained by recombination events (Miretti et al., unpublished). Following detailed inspection of the haplotype nucleotide sequence alignmentatvariable positions ( Figure 1B), putativerecombination sitescan be identified. The subsequent splittingo ft he LD-block at a recombination site ( Figure 1C) resulted in haplotypic networks presenting distinct clustering in both fragments ( Figure 1D), which were alsodissimilar from that of the complete LD-block. While it needs to be examined in alarger sample of LD-blocks from diverse populations, the presence of recombination within LD-blocks suggeststhatthe occurrence of LD between two SNPs mightn ot be sufficient to unambiguously detect historical recombination beyond the historyoft he sample population where LD is being measured. Recombinationhot spots within LD-blocks -t hat is, hot spots that have not left an imprint on LD -haverecently been postulated to be either old hot spots en route to extinction or too young to leave a mark on haplotype diversity in Europeans. 18 Additional experimental data based on spermc ell recombination analysis would help to confirmt he signature of recombination identifiedh ere and to determine if it is an example of a potentially evolutionarily very young hot spot. 18 Overall recombination rates within the MHC arek nown to be lowert han across chromosome6and the genome average, 34,35 and regionso fr elatively highr ecombination rates (hot spots) and lowr ecombination rates (coldspots)have been described in males based on microsatellite typing. 15 The fine-scale experimental recombination analysis in a 200 kb segmentw ithin the MHC class II region revealed that af ew hot spots shape the distribution of LD throughout the segment, that hot spots tend to be delimited to shortD NA segments (usually , 5kb) and that ap ortion of the recombination is contributed by gene conversion. 11,31,36,37 The questiont hereforea rises as to whether the recombination patterno bservedi nt his segment can be extended to the complete MHC and,u ltimately,i fi tr epresents ag eneral characteristic of the genome.T he heterogeneous model of recombination seemst ob eag eneralf eature of the genome, 17 where the landscape is moulded by the local distribution and intensity of recombination.A ccording to the recombination rates inferred from populationg enetic data, 16 the experimentally verified recombination hot spots in the 200 kb segmento ft he MHC class II region 11 might not represent a paradigm for the entire MHC,a st heir intensities are , 2-100 times higher than hot spots in the rest of the MHC. Figure 2s hows the uneven distribution of hot spots across the complete MHC -i ntegrated with genea nnotation, LD-blocks and tagSNP distribution, as represented by the GLOVA Rg enome browser (www.glovar.org) -a nd also that the hot spot intensity is farf romb eing uniform. Additional large-scale experimental evidence of hot spot intensity and distribution is required to more accurately assess the concordance with predictions based on population genetic data. Importantly,h ot spot features will significantly influence the selectiono ft agSNPsf or disease association studies involving anyp articular region of the immune sub-genome.
From the detailed distribution of recombination and LD across the MHC,i tw ould be possible to exploit the common variationo bservedi na pproximately8 0p er cent of the MHC sequence, which is in high LD,t oc onduct association studies. Even with the current SNP density (1 SNP/1.9 kb), however, between 10 -27p er cent of the DNA sequence in the MHC sub-regions is contained in regionso fl ow LD, where tagSNP efficiency-the number of genotyped markers divided by the number of tagging SNPs 38 -w ould not entaila ny benefit, meaning that almost everyS NP would need to be typed. Figure 2i llustrates the dependence of the tagSNP distribution on the SNP density and LD pattern, which is particularly variable across the MHC.Long regions of strong LD arec ommon in the extended class Is ub-region, where 75 tagSNPsa re sufficient to recoverh aplotypic information provided by 408 SNPs. Conversely, 204 tagSNPsa re necessaryt or ecover the equivalent information from 466 SNPs in the class II sub-region, where the LD patterni sm ore interrupted and the decayo fL Di sm ore This region also represents ar ecombination cold spot -w ith recombination rates ¼ 0.00931 centimorgans/megabase,t en times lower relativetothat on the neighbouring area -surrounded by two recombination hot spots identified in NOTCH4 and C6ORF10 genes. 16 (B) Te nh aplotypes (rows) observed in this sample,w hich wererecognised based on alleles at 68 variable positions (SNPs, in columns) within the selected LD-block. Haplotype names and frequencies (per cent) are giveno nt he left. (C) Ab reakinthe haplotype alignment suggesting the presence of ancestral recombination within the LD-block. Haplotypes werethen split at ap otential recombination site into twos ub-regions (with 41 and 27 SNPs, respectively) aiming to construct ap hylogenetic network in order to check for consistency.( D) The distribution of variation among haplotypes as represented by median-joining trees obtained for both sub-regions employing the NETWORK program. 33 Haplotypes Fand Dare grouped closely relatedwithin each network but belong to different clusters when comparing networks from the two sub-regions. The number of substitutions involved in these branching differences could preferably be explained by recombination events rather than owing to mutations occurring between haplotypes. The distribution pattern of these differences adds consistency to this view and strongly suggests the presence of recombination. The circle size is proportional to the haplotype frequency,a nd the number of substitutions between nodes is represented by the number of lines between them. The site wherethe haplotypes weresplit can be changed to few positions on either side of the current site without significantly modifying the topologyofthe network.
pronounced. 16 Also,t he tagSNP sets are derived from studies which mostly dismiss loci of minor allele frequency (MAF; , 5per cent), which seem to be present at increased frequency in lowL Dr egions, 39 thus excluding rare variants from the analyses. This raisest he questiono fw hether failure to detect disease variants simply results from the exclusionofSNPs with rare alleles from the analysis( discussed below). Furthermore, tagging effectiveness -t he proportion of 'hidden' SNPs being detected in LD withS NPs from the tagging set -c an be substantially increased by incorporating SNPs withr are alleles (MAF , 5p er cent); this effect wase venm ore prominent in lowL Dr egions. 39

Epigenetic variation
Much less is known about epigenetic variation, both in terms of causality and function.The most frequent and stable form of epigenetic variation is differential DNA methylation. DNA methylation, which wasfi rstd iscovered in 1948, occursn aturally at the carbon-5 position of cytosine(5-methylcytosine) at CpG dinucleotides. 40 In subsequent years, it wasproposed that DNA methylation playedanimportant role in the regulation of gene expression 41,42 and disease aetiology, particularly cancer. 43 The discovery of CpG islands (sequences enriched in cytosineguanosine dinucleotides) suggested candidate regionsi nt he genome for epigenetic modulation. 44 DNA methylation occurs predominantly at CpG sites 45 and, in additiontogene regulation, is also involved in phenomena such as X-chromosome inactivation in female mammals, parent-of-origin-specific, mono-allelic gene expression (imprinting) and epigenetic reprogrammingi nm ammalian development. 46 The keytechnology for detecting 5-methylc ytosineb yD NA sequencing wasdeveloped in 1992, and is known as bisulphite sequencing. 47 The need to consider epigenetic variationalongside genetic variationi nt he context of disease has been highlighted by the finding of high discordance rates in monozygotic twin studies and numerous other studies, confirming that epigenetic factorsp layadecisiver ole in the aetiologyo fv irtually all human pathologies (reviewed by Robertson and Wo lffe). 48 Therefore, when the SNP Consortium announced in 1999 that it would generate afirst-generation genetic variation map of the humang enome, 49 the time and opportunity seemed right to generate afi rst-generation epigenetic (methylation) variationm ap alongside -w hich is what the Human Epigenome Consortium announced that it would do. 50 One aspect of the Human Epigenome Project (HEP) that is of particular interesti nt he context of this review,i st he aim and ability to identifym ethylation variable positions (MVPs), which,t ogether with SNPs, promise to significantly advance our ability to understand and diagnose human disease. MVPs are defined as differentially methylated CpG sites that have the statistical powert od iscriminate, for example, between biological states such as activev ersus inactive or healthyv ersus diseased.
As aresult of the HEP and other studies, adetailed genomic methylation map is already available for the MHC 51 and models for the epigenetic control of the KIR expression repertoire have been proposed (reviewed by Uhrberg 52 ). For the MHC,methylation profiles have been generatedessentially for all expressed genes, demonstrating tissue-specificity (eg for the C2 locus) and inter-individual heterogeneity (eg for the TNF locus). Previously,i th as been shown that transcription profiles can be associated with specific haplotypes, 53 with epigenetic states 54 or even with specific epi-alleles in the absence of DNA variation. 55 Overall, the methylationp rofile of the MHC appears to be strongly bimodal,w ith good correlation between hypo-a nd hypermethylationw ithin the upstream regions of genes and their transcriptional activity (Figure 3). Forthe LRC,fewer data are available and aremostly restricted to the KIR genes. KIR genes exhibit ahigh degree of polymorphism and aree xpressed in aclonally restricted fashion. 56 Each cell expresses amono-allelic repertoire which is highly variable with respect to number and combination of receptors. KIR genes areregulated by three types of promoters, corresponding to the three different modes of expression: one for KIR2DL4 ,which is constitutively expressed; one for KIR3DL3 ,which is expressed at av eryl ow level;and one which is common to all clonally distributed KIR genes. 57 Given the absence of promoter variability amongt he clonally distributed KIR genes, it is, in fact, likely that the variegated expression of these genes is mainly regulated epigenetically,and amodel has been proposed to this effect. 52 The model proposes four stages of sequential DNA and histone modifications leading to theobservedmosaicism of clonally restricted KIR expression but requires somestill unproven processes, including locus-and allele-specific DNA demethylation.
Compared with genetic data, the ability to analyse and interpret epigenetic data is still fairly simplistic,although enormous progress has been made,particularly in the past few years. What is becoming increasingly clear,h owever, is that Figure 3. Correlation between DNA methylation and gene expression. DNA methylation data derived from the Human Epigenome Pilot Project weregrouped according to 5 0 -UTR being methylated or unmethylated and the body of the gene being methylated or unmethylated. Each group was then plotted against the corresponding Genomics Institute of the Novartis Research Foundation (GNF) (http://symatlas.gnf.org/SymAtlas) expression level, which is set to anything below 200 as not being expresseda nd anything above2 00 as being expressed. This cluster analysis shows that genes with methylated 5 0 -UTRs aresilent and genes with unmethylated 5 0 -UTRs are expressed. In this study,n ocorrelation was found between expression and the methylation level within the body of the gene. Reproduced from Rakyan et al. 51 the twoare inextricably intertwined, and efforts towards an integrated (epi)genetic approach to common disease are well underway. 58,59 Application of immunogenomic data for disease association studies Major progress in genotyping technologyh as improved our ability to harvest genotypic information from acomprehensive number of locia nd has laid down the foundation for the genome-wide common variations urvey, 3 enhancing our understandingo fh umang enetic variation. Current genotyping costs, however, makeg enome-wide association studies still challenging for the larges amples izes that will be required for adequately powered studies. The interdependencies of marker numbers,allelefrequency,LD, cohort size and powerindisease association studies have recently been reviewed 60 -62 and will thereforen ot be discussed in depth here.Essentially,genome-wide studies arestill restricted by the allelic spectra underlying complex diseases. The frequency distribution of the disease variants and the proportion of the trait variance for which they are responsible determine the potentialp ower of genetic association studies and, therefore, the feasibility of the study.D isease susceptibilityv ariants associated with Mendelian disorders tend to be modelled by purifying selection and present lowp opulation frequencies ( , 1p er cent). 63 Thus, rare alleles -a lso including lowfrequency,m ildlyd eleteriousv ariants together witht hose alleles in high LD -m ight be under-represented in genomic LD surveys and tagSNP sets excludingl oci withM AF , 5 per cent. It has been proposed, however, that genetic risk for common diseases and most of the clinically important traits ared etermined by the joint contribution of diseasepredisposing high-frequency alleles that are shared between unrelated affected individuals in the population: the common-disease/common-variant (CD/CV) model. 60,64,65 Al ess extreme model based on thed ivergence of the allelic spectra of disease susceptibility variants relativet ot hat of all variants has been proposed for evaluating the impact of allele architecture on common diseases. 61 Ultimately,t he applicability of the CD/CV model in disease association studies will eventually rely on thes elected markersa nd on the relationship between the four parameterst hat affect the apparent size at the marker locus, namely the odds ratio of the disease allele,t he disease allelef requency,t he marker allele frequency and the LD between the marker and the disease locus (reviewed by Zondervana nd Cardon 62 ).
Diseases related to defectivei mmune response,s uch as autoimmune disorders,m ight comply with theC D/CV model of allelic frequency distribution for susceptibility loci, showing increased allele frequency after being under positive selectionf or infectiousd isease resistance or heterozygote advantage. 61,66 Direct extrapolations to the immune sub-genome and autoimmune diseases could be elusive, however; that is, it would be difficult to provet hata lleles having an effect on the function of an immune gene -a nd also contribute to the risk of autoimmune diseases -a re beneficial on ap opulationl evel because they increase the diversity in the immune response. 67 In fact, the overall level of polymorphism in immune genes is, in general, similar to that of non-immune genes, 10 and only af ew loci amongt he genes constituting the immune sub-genome largely contribute to the diversity of the immune response -d isregarding loci with somatic recombination.I na ddition, modes of selection maintaining extreme allelic diversity levels -h eterozygote advantage,b alanced selection -h aveb een proposedf or these few loci, including the classical HLA classIand II and KIR genes. 29,30,66,68 Consequently,i ti sp robably premature to generalise any disease model for immune genes. Ar ecent study shows,f or instance,t hatm ultiple disease loci with predominant disease variants are not necessarily required to develop acomplex immune disorder. 69 It has been shown that sarcoidosis -amulti-systemici mmune disorder initially associated with several classical HLA markers-is associated with as ingle disease locus independently responsible for most of the predisposing influence. 69 Thea uthorso ft his study demonstrated that as ingle transition in exon 5o ft he BTNL2 gene,w hich altersasplice site and causes ap rematures top codon, has ap rofounde ffect on them ature protein structure and function.
An umbero fa utoimmune disease-susceptibilityl oci have been mapped to the MHC and LRC regions. 70 -72 Autoimmunity and cancer area ssociated witht he progressive decline in immune functions, such as decline in Tc ell function, dysregulation of Tc ell apoptosis and immune senescence during ageing. 73,74 To lerance induction of Tc ells is mediated by cytotoxic Tl ymphocyte antigen 4. Polymorphism in this inhibitory receptor has also been associated with autoimmune diseases. 75,76 Outlook Since the termhaplotype wascoined in 1967, 77 avast amount of literature has been published employing haplotypes to construct genetic and LD maps, which have enabled the emerging field of immunogeneticst om akem ajor contributions to immunology,p opulation genetics and medicine. 78 For the MHC,ahaplotype hierarchyh as been defined based on pedigree and population genetic data, where physically linked classical HLA genes constitute 'blocks' of relatively shortDNA segments containing allelesinLDatdifferent loci, namely HLA-B/C , TNF,c omplotypea nd HLA-DQB/DRB. Combinations of these four basic blocks aref requently found in LD,c onstituting 'conservede xtended haplotypes' (CEH) 79,80 or 'ancestral haplotypes'. 81 More recently,l ocal fine-scale LD blocks based on high-density SNP typing have Immunogenomics: Molecular hide and seek Review REVIEW been described and meiotic recombination hot spots inferred from population genetic data. The integration of long-range LD in CEH and local fine-scale LD datawill provide new and exciting opportunitiest oe xploret he evolutiona nd disease associations of the immune sub-genome.While the juryisstill out on the validity of the CD/CV model, the recently announced We llcomeT rust Case Control Consortium (WTCCC) is likely to providet he answert ot his ongoing debate.T he WTCCC proposes to analyse 19,000 samples for common genetic variants in eight common diseases usingdata generated by the HapMap Consortium.I ntegration of the genetic and epigenetic datasetsd iscussed here is an ecessary next step towards am ore holistic approach to immunological research, which is one of thea ims of the recently formed InternationalI mmunomics Society, IMMIS (http://research. i2r.a-star.edu.sg/IIMMS/). TheI MMIS wasf ormed with the main objectiveo fp romotingt he science of immunomics, which is the interdisciplinaryfi eld spanningi mmunology, immunoinformatics,g enomics,p roteomics, bioinformatics and related scientific fields.