The impact of low-cost , genome-wide resequencing on association studies

With the completion of phase 1o ft he HapMap project (www.hapmap.org), we are nowc lose to the point where genome-wide association studies formaroutinetool for trying to identify genes involved in human disease and drug response. In anyofthree large humanpopulations, the HapMap provides more than 500,000 single nucleotide polymorphisms(SNPs), chosen as fara sp ossible to be evenly spaced and highly poly-morphic.Genotyping these SNPs in large samples of casesand controls should permit anyc ommon variant (for example, minora llele frequency [MAF]. 5p er cent) implicated in disease causation to be identified, even if the size of its effect is rather small. Phase 2ofHapMap,expectedtobedelivered later in 2005, will generate an even more dense SNP resource. Even before theH apMap results have been applied to benefit disease genes tudies,t he project is already yielding a wealth of information on humanp opulation genetics. The enormous fine-scale variability in recombination rates is being documented for the first time; much is being learned about the effects of selection on humang enetic variation, and more generally about the nature of genetic polymorphisms and their distribution in the genome and across the globe. As researchersi nvestigating particular genetic diseases assess the implications of HapMap for their work, however, it is already possible to look beyond to an era in which much of the HapMap data becomes largely redundant. The major motivation for HapMap wast hat, by genotyping as mall proportion of the genome (500,000 SNPs represents less than 0.02 per cent of it) and exploitingt he powero fl inkage dise-quilibrium (LD) to illuminate the unobservedv ariants in the vicinity of at yped polymorphism, much of human genetic variationc an be captured cost-effectively.I fw hole genome resequencing becomes fast and cheap,h owever, whyb other with only parto ft he information? Several companies arep romising to maket hat prospect a reality within as hortt ime-frame.I nM arch 2005, 454 Life Sciences of Connecticut, USA, announced that it had sold and installed its first genome sequencing system, claimed to per-formsequencing 100 times faster than conventional machines, at the Broad Institute of MIT and Harvard.T he system, based on light-emitting sequencing chemistries and microfluidics, implements massively parallel genomic sequencing. According to the company,asingle instrument can sequence over 20 megabases per four-hour run. In December 2004, the company announced that its technologyhad been used to sequence four entireb acterial genomes, one from Mycobacterium tuberculosis and three from related strains, facilitatingt he discovery of apotentialantimicrobial …

genome-wide association studies formaroutinetool for trying to identify genes involved in human disease and drug response. In anyofthree large humanpopulations, the HapMap provides more than 500,000 single nucleotide polymorphisms( SNPs), chosen as fara sp ossible to be evenly spaced and highly polymorphic.Genotyping these SNPs in large samples of casesand controls should permit anyc ommon variant (for example, minora llele frequency [MAF] . 5p er cent) implicated in disease causation to be identified, even if the size of its effect is rather small. Phase 2ofHapMap,expectedtobedelivered later in 2005, will generate an even more dense SNP resource.
Even before theH apMap results have been applied to benefit disease genes tudies,t he project is already yielding a wealth of information on humanp opulation genetics. The enormous fine-scale variability in recombination rates is being documented for the first time; much is being learned about the effects of selection on humang enetic variation, and more generally about the nature of genetic polymorphisms and their distribution in the genome and across the globe.
As researchersi nvestigating particular genetic diseases assess the implications of HapMap for their work, however, it is already possible to look beyond to an era in which much of the HapMap data becomes largely redundant. The major motivation for HapMap wast hat, by genotyping as mall proportion of the genome (500,000 SNPs represents less than 0.02 per cent of it) and exploitingt he powero fl inkage disequilibrium (LD) to illuminate the unobservedv ariants in the vicinity of at yped polymorphism, much of human genetic variationc an be captured cost-effectively.I fw hole genome resequencing becomes fast and cheap,h owever, whyb other with only parto ft he information?
Several companies arep romising to maket hat prospect a reality within as hortt ime-frame.I nM arch 2005, 454 Life Sciences of Connecticut, USA, announced that it had sold and installed its first genome sequencing system, claimed to performsequencing 100 times faster than conventional machines, at the Broad Institute of MIT and Harvard.T he system, based on light-emitting sequencing chemistries and microfluidics, implements massively parallel genomic sequencing. According to the company,asingle instrument can sequence over 20 megabases per four-hour run. In December 2004, the company announced that its technologyhad been used to sequence four entireb acterial genomes, one from Mycobacterium tuberculosis and three from related strains, facilitatingt he discovery of apotentialantimicrobial treatment for tuberculosis targeting an ewly identifiedp athway.S equencing larger genomeseventually that of Homo sapiens itself -s eems not to be far away.
Solexa, based in California, USA and Cambridgeshire,UK, is in the same race towards fast and cheap resequencing of human genomes. In March 2005, it announced the resequencing of the virus Phi-X 174, whose small genomeh as often been used as a' proof of principle' of new technologies. Solexa'sc urrent technology involves 'clusters' -d ense collections of DNA molecules on as urface -a nd novelc hemistryt hat allows as ingle base extension perc ycle,t hanks to reversible termination and fluorescence.Aresequencing system based on these technologies -a nd capable of sequencing much larger genomes -i sd ue for commercial release latei n2 005. For the future,S olexai si nvestigating single-DNA-molecule technology, allowing massivelyp arallel processingw ith around 10 8 sitesp er cm 2 .W orking at the single-molecule level avoids the need for amplification, allowing 'one-tube' samplep reparation for aw hole-genome analysis.
Howmuch extra should researchersbewilling to paytoget whole-genome sequence data, rather than genotypes at a dense SNP map? Clearly,the former includes the latter,and so should be preferred if cost is irrelevant. But, of course,c osti s crucial. Forg enetic association studies, at least 500 cases and 500 control genomes aret ypically required, meaning ac ost of less than $5,000 per genome for a$5m overall cost. This is still along wayoff,evenwith the new technologies, but their small reagent costsm akes the $5,000 genome af easible proposition for the forthcoming years.
Data quality is as important as cost.S NP genotyping platforms have recently been improved, such that miscall and noncall rates can reach very lowl evels. The resequencersa re also claiming very lowerror rates, but achieving this, and high coverage rates, will have cost implications. Thei ssue of error rates is complicated by translocations, inversions,d uplications, tandem repeats and indels of varying sizes, some of which will be virtually impossible to capture fully with any one technology.S ince many of these arec andidates for being involved in disease causation,h owever, capturing most of them would provide am ajor advantage for sequence over SNP data: SNPs can sometimes 'tag' non-SNP variants such as indels or inversionsb ut,b roadly speaking,t hesen eed to be common and known ap riori . More generally,t he main potentiala dvantage of sequence over genome-wide common SNPs is the capturing of rare variants, whether they be SNPs or more complex variants, and whether or not their existence is recognised apriori .Evencases attributable to multiple spontaneous mutations -r esistant to study by classical genetics -c an potentially be successfully unravelled using resequence data. The distribution of allele frequencies in natural populations is U-shaped, with most alleles having af requency close to zeroo ro ne: common variants arer are and rare variants common. The belief that many of the variants causing common diseases are commonknown as the 'common disease common variant' (CDCV) hypothesis -i sb acked by some theorya nd data, but sceptics mayn evertheless suspect that wishful thinking has also played an important role in motivating adherence to CDCV.A n alternativei st hatc ommon diseases are caused by an umber of different polymorphisms, manyorevena ll of them rare: there mayb eanumber of pathwaysp otentially contributing to disease progression, each modified by as pecific set of polymorphisms,o racommon pathway mayb ei nterrupted by mutations at anumberofw idely-dispersed genomic locations.
Genome-wide resequencing would greatly simplify genomic variationa nalyses: the problem of highly-variable LD between ad isease-predisposing polymorphism and even a close marker locus vanishes because the causalp olymorphism itself will be typed in the study.T his potentially leads to more powerf or ag iven sample size,b ecause effect sizes aren ot attenuated by variable LD.C onversely,s equencing errors and coverage gaps can mar the analyses, whereas SNP genotyping is nowh ighly robust. It mayb eo ptimal to accept am oderate level of resequencing errors and gaps, in order to increase the number of individuals sequenced within the available budget. Another possible route to cost reduction is multiple uses of a control sample for comparison withd ifferent cases amples. This cannot be done indiscriminately,b ut somep ooling of controls across studies mayb ef easible and help to make resequencing cost-effective.
Naturally, the large genome centres are takinganinterestin these developments, which for the moment remain out of reach of most researchers. The time-scale for widespread implementationo fr esequencing technologies is hard to predict, but it seems prudent to bear in mind the possibility that fast,c heap genome resequencing mays oon be within reach.

DavidB alding Editorial Board Human Genomics
Theeditor of Human Genomics is pleased to announce acall for papersfor aspecial issue devoted to SNP association studies, guest edited by ProfessorPui-Yan Kwok. The submission deadline for this issue is 27th September,2005. If youare interested in submitting apaper to the journal for this special issue,please contact Liz Caldwellfor further information, by telephone on þ 44 (0)207 323 2916 or by e-mail: liz@hspublication.co.uk