Detecting multiple associations in genome-wide studies

Recent developments in the statistical analysis of genome-wide studies are reviewed. Genome-wide analyses are becoming increasingly common in areas such as scans for disease-associated markers and gene expression profiling. The data generated by these studies present new problems for statistical analysis, owing to the large number of hypothesis tests, comparatively small sample size and modest number of true gene effects. In this review, strategies are described for optimising the genotyping cost by discarding promising genes at an earlier stage, saving resources for the genes that show a trend of association. In addition, there is a review of new methods of analysis that combine evidence across genes to increase sensitivity to multiple true associations in the presence of many non-associated genes. Some methods achieve this by including only the most significant results, whereas others model the overall distribution of results as a mixture of distributions from true and null effects. Because genes are correlated even when having no effect, permutation testing is often necessary to estimate the overall significance, but this can be very time consuming. Efficiency can be improved by fitting a parametric distribution to permutation replicates, which can be re-used in subsequent analyses. Methods are also available to generate random draws from the permutation distribution. The review also includes discussion of new error measures that give a more reasonable interpretation of genome-wide studies, together with improved sensitivity. The false discovery rate allows a controlled proportion of positive results to be false, while detecting more true positives; and the local false discovery rate and false-positive report probability give clarity on whether or not a statistically significant test represents a real discovery.


Introduction
Recent technological advances allowt he rapid generation of vast quantities of molecularb iological data. 1,2 At the same time,t he sequencing of the human genome and subsequent efforts to catalogue the variation within it 3 have created opportunitiesf or testing thousands of sequence variations for association with disease,b ehavioural traits and physiological markers. Such applications are appealing because of the relative lack of success,t od ate, of positional cloning strategies that startw ith family-based linkage mapping, 4 most likely due to insufficient sample sizes to detectg enes of modeste ffect. 5 The whole-genome association scan is an increasingly feasible study design in which the genotyped markersa re sufficiently closely spaced to detect linkage disequilibrium (LD) with all aetiological variants, and well-powered sample sizes are more attainable. 6 Some initials tudies have been performed in special populations 7,8 and in small samples of outbred populations; 9,10 genome-wide admixture scans are imminent 11,12 and, ultimately, routines cans will be performed for common diseases in large cohortso fo utbred populations. 13 Arraye xperiments measuring large numberso ft ranscription or expression levels area nother formo fg enome-wide analysist hat have become widespread. 14 Although the effect sizes expected in these studies are largeb yc omparison with disease association studies, the sample sizes arec onstrained by cost to be relatively small, so that both types of study encounter problems of statistical power( Ta ble 1). Expression levels can be regarded as quantitativet raits under genetic control, so that both kinds of large-scale exploration can occur in genome scans for loci influencing expression levels, 15 or phenome scans demarking the influence of genetic pathways. 16,17 The analysis of large exploratorys tudies creates new problems for methodology and interpretation. Primarily,t here is the multiple testing problem, wherebyt he chance of an exceptional result increases with the number of tests performed,e venw hen there is no true association.T oa lleviate this problem, twob road strategies have emerged: first, to devise more sensitivet ests, so that the penalty for multiple testing is less severe; and,s econdly,t op ropose different measures of experimental errorfor which the interpretation of multiple testingi sl ess serious. Furthermore,g enome-wide analysiscreates problems of computational and cost efficiencies on account of the largev olume of data to be generated and analysed.
Here,s ome recent work addressing these problems is reviewed. Fort he study design, work is summarised that minimises the cost of astudy,while maintaining its power. For the analysis, methods arer eviewed for improving sensitivity in the presence of multiple gene effects, by combining evidence across tests,a nd some methods for reducing the computational burden of permutationt ests are discussed. The review concludes with ad iscussiono fa lternativee rror measures including false discovery rates.
This review is mainly concerned with aw hole-genome association scan, using single nucleotide polymorphisms (SNPs), for ad ichotomous disease status. It will be clear, however, that many of the methods apply in other situations, in particular to arraye xpression studies. Although there are important differences between these twoa pplicationsincluding the number of expected true associations, sample size and effect size (Table 1) -t heir common exploratory character suggests that further advances maya rise from crossapplication of ideas between these areas. For this reason, some methods developed for expression studies are reviewed; there is also adiscussiononwhether they maybesuitable for genetic association scans. The objectso fi nference used will be 'genes' 18 ,withthe understanding that,inthis context, this can mean SNPs, whole genes, haplotype blocks, transcript levels or other features.

Study design
Large samples of unrelated individuals have become the design of choicef or genome-wide association scans, because earlier concerns about population stratificationh aveb een largely allayedb ye mpirical methods. 19 Estimates of the total sample size are in the order of thousands. 20 Because the majority of genes aren ot associated with disease,i ti su neconomical to genotype the whole sample for all genes. Sequential study designs, in particular at wo-stage block design, have been proposed for reducing the total cost of ag enome-wide experiment, which remains the main limiting factor preventing large-scale application. In at wo-stage design, all of the genes are typed in as ubset of the sample,w itho nly the genes showing at rend of association being taken forward for genotyping in the remainder.T his directs resources towardst rue associations at an earlier stage,s ot hat the available sample size is larger for genes with true effects.
The design parametersf or at wo-stage study include the total cost, total sample size,s ize of the first and second sub-samples and rejection criterion at the end of the first stage.S tudies with only twos tages arec onsidered, although more could be performed. Someo ft hese parametersa re constrained in advance,w ith the otherst hen chosen to optimise some objective. One approach is to consider the genotyping cost as fixed and then find parametersthat give the most power. 21,22 Ag eneralr ule of thumb,c onsidering a number of disease models and correlation structures between markers, is to allocate 75 per cent of resources to the first stage and then carry the most promising 10 per cent of markerst ot he second. 22 Here,t he sample size is af unction of the genotype unit cost and the number of markers, within the overall cost constraint.
It is more likely that the sample size is fixed (say, to provide sufficient powert od etect as ingle association)a nd the goal is to minimise costs while achieving powerc lose to that of the one-stage design. 23 In many situations, the cost can be halved while keepingpower within 1per cent of the one-stage design; thus, the totals ample size can be calculated to achieve ac ertain power( say8 1p er cent) in the one-stage design and parametersthenoptimised for atwo-stage design. Considering ar ange of genetic models, ag eneral guidelinei st os et the sample size of the first stage to have 97 per cent powerf or individual tests and carry forwarda ll markersw ith nominal p -values less than 0.15. The samples ize for the first stage cannot be calculated without knowledge of the true effects, however, so am orep racticalapproach is to consider the ranks of test statistics of the true effects. 24 Here,i ti ss hown that similar information to the one-stage design is obtained by genotyping all markerso n5 0p er cent of the sample and then genotyping the 10 per cent mostp romisingo nt he remainder, resulting in ad ecrease of about 45 per cent in the number of genotypes. Again, the total sample size can be calculated for a one-stage design; this last guidelinei sc urrently the most practicala vailable and applies over aw ide range of genetic models and correlation structures between markers. An application of this strategyh as been reported in which the primaryconstraintisthe quantityofDNA available for study subjects. 25 About4 4p er cent of the sampleh ad sufficient DNA to be typed for all markers, witht he remaining 56 per cent used for the second stage.Animportant feature of this study is that the test statistics are calculated over the full sample,w ith adjustmentm ade for the interim test. This is in contrast to the simpler approach used in the simulation studies, 21,23 in which test statistics were calculated separately for the twos tages and their p -values combined into an overall significance. Analysing both stages at once 25 makes more efficient use of information and will be the more powerfulm ethod for computings ignificance in the whole sample.
Formals equential designs have also been proposed for genetic association studies. 26 These can result in substantial cost savings, on average,b ut have yett ob ecomew idely adopted, owing mainly to logistical difficulties. For example, the stopping criteria must be applied to each gene separately, but genotypes are often obtained in bulk in arrayf ormat, which makes it difficult to apply sequential designs efficiently across manyg enes. Thet wo-stage designs areacompromise solutionu singf requentist inference,w hich also avoid the uncertainty in actuals ample size that occursw ith sequential inference.F uture studies mayi ntroduce further design variables. For example, different genotyping technologies may be used in the twos tages, with different unit costs, perhaps using DNA pooling. 27 Optimal study designs can be derived for thesec onditions following current principles.

Analysis methods for multiple associations
Many analysismethods areavailable for genetic data, but afirst pass through ag enome-wide scan mayn ormally consist of single-locus tests for trend, perhaps additionally with two-locus interaction tests. 28 Several methods aren ow available that exploit the important feature that the majority of tested genes are not associated, butthere areasmall number of true,b ut weak, associations to be found. These methods are useful bothfor establishing statisticalsignificance more strongly than single-locus tests, and for informally suggestings ets of genes for follow-up study.
In the traditional hypothesis-testing framework,e ach gene is tested individually and then astepwise adjustment procedure is applied both to control the family-wise type-1e rror rate (FWER)a nd to declare individual genes associated. 29 This approach, related to the Bonferroni correction, achieves strong control of the FWER,w hich is the probability of at least one false positiveb eing within the desired rate when there are any number of true positives. This is generally considered to be too conservativef or genome-wide studies, however, because we can tolerate as mall number of false positives if most true positives ared etected. More preferable is weak control of FWER,w hich ensures that the probability of at least one false positivei sw ithin the desired rate only when there are no true positives. This is desirable,b ecausew em ust defend againstt he possibility of there being no true associations in the sample,b ut it allows us to tolerate some false positives if some true positives are present.
Aj oint test of multiple genes can maintain weak control of FWER and should reveal greater evidence for association from as et of genes, although perhaps with less specificity for individual genes. This argumentm otivates the partial sum statistics, 30 which aref ormed by obtaining test statistics (typically x 2 tests from ac ontingency table) for each individual gene and then forming the sum of the K largest statistics, where K is afi xedn umber called the length.T he significance of the sum can be assessed by apermutation test and an overall significance estimatedo verarange of lengths.
Am ore flexible alternativet ot he sum statistic is the truncated product of p -values. Here,t he product is formed of all the p -values lowert han ap reset threshold, 31 or the K smallest p -values. 32 When the individual tests have the same distribution, the rankt runcated product has equivalent powert o the sum statistic,b ut is more balanced when the tests have different distributions. This will occur,f or example,w hen conducting haplotype-based tests on regions of different sizes, leading to tests with different degrees of freedom. Analytic distributions are known for independent tests, which have been used in simulation studies to showi mproved power for combined evidence methods compared with traditional corrections. 31,32 The present authorsp refer the truncated productt ot he sum statistic on account of its balanced combination of different test, and also prefer to truncate on rank rather than threshold because the number of true gene effects is fixed across studies, whereas their p -values are random. 32 The length K shouldb ec lose to the actual number of true associations, but this is generally unknown. Ar ange of lengths could be tested, with the most significant length used to select genes for follow-up analysis; but there is no formal basis for this strategy,a nd simulation studies showt hat it is capable of grossly over-or under-estimating the number of true associations. 33 Ajudiciouschoice of afixedlength, say K , 20 for a genome-wide association scan, is generally advisable provided that the tests arereasonably independent. When thereisstrong dependency between tests, such as in single-marker analysis of ad ense genome-wide scan, then the variable-length approach can be used to establish statistical significance,b ut not to estimatet he number of follow-up genes. Informally, genes would be followedu pi nr ank order of significance; and if the prior poweri sh igh, this will tend to identify the true associations. 32 In fact, formal adjustments based on the closure principle are available for individual tests, which allow strong control of FWER, 34 but the primaryu se of truncated products is to showt hat the strongesta ssociationsi ndeed arise from true effects.

Review REVIEW
In working with the summary p -value rather than the complete data, some information is lost,a nd as inglea nalysis of the data mayb em ore efficient. An atural approach is to estimate all gene effects together in regression model. On the genome-wide scale,afixed-effectsr egression is impractical, requiring estimation of many more parameterst han there are observations.T herefore, several methods proposed for microarrays regardagene as having ar andom effect, and model the distribution of geneeffects by parametric forms that can be estimated. As imple model is to assume an ormally distributed effect around zero, 35 although this mayl ack power when most genes have no effect. The model can be extended by assuming that the effect variability comes from small and stronger effects, with inference based only on the stronger effects. 36 Another alternativei samixture of az ero-centred normala nd ap ointm ass at zero 37 or,m ore generally,amixture of three normalsw ith respectively positive, zeroa nd negativem eans. 38 Here,t he zero-centred distribution is regarded as the null distribution, which allows for small nonzeroeffects to be regarded as uninteresting if there is sufficient evidence for stronger effects.
These approaches reduce the dimensionality of the inference while modelling the completed ata, rather than summarising each gene before combining evidence.T hese methods offer promise for genome-wide association scans, an important open questionbeing the precision in estimating the random effects distribution when the number and size of true associations ares mall. For example,amethod for testing whether the overall distribution of p -values is uniform 39 has very little powerc ompared with the Bonferroni correction when the number of true effects is small(authors' unpublished data). Another important issue is the choiceo fr andom effects distribution: current methods assume hierarchicalo rm ixture normald istributions, bute xperimental geneticists have favoured gamma distributions. 40,41 Au seful feature of the mixture distribution modelsi st hat they generate maximumlikelihood probabilities of membership to each of the mixture components, for each gene, which can be interpreted informally as posterior probabilities of association allowing individual genes to be selected for follow-up study.

Permutation testing
When the assumptions underlying analyticald istributions are not met, permutation tests areapopular method for computingsignificance levels. In agenome-wide association study, the problem is that genotypes are correlated due to LD; indeed, the correlations aren ecessaryf or the designt ob e successful. The standard procedure is to reassign trait values among study subjects,w hile keeping their genotypes fixed, therebypreserving the correlation structure across the multiple genes and realising the exchangeability conditions for av alid test. 42 When performing thousandso ft ests on thousands of subjects,however, apermutationprocedure usingthousandsof replicates becomes extremely time-consuming, with possible runningt imes of days or weeks. Therefore, more efficient approaches to permutation testing have recently been proposed.
The accuracy of the permutation test can be improved by noting that the minimum p -value,s um statistic and truncated productc an all be regarded as the extreme value of al arge number of observations. 33 Therefore, they should followt he extreme value distribution 43 and by fittingt he parameters of the distribution to the values observedi np ermutation replicates, more accurate significance levels are obtained. Equivalently,f ewer replicates are needed to reach ag iven accuracy.T he efficiency gain dependsu ponan umber of factors, including the true significance level and the number of tests, and it is difficult to compute standard errors for the empirical p -values. Nevertheless, this approach has the advantage of being generally applicable and,i mportantly,t he fitted distribution can be re-used in subsequent tests of the same genes in the samep opulation. This will be useful for studies based on as tandard genome-wide marker panel, 3 leading to substantial time savings over the long term.
Ac omplementarya pproach is to reduce the computation within each replicate.L in 44 considered score statistics from regression models,s howing that it is sufficient to multiply the score contributions of each subject by an ormal random deviate to generate ar ealisation from the null distribution. Alternatively,S eaman and Mü ller-Mysock 45 suggest sampling directly from the multivariate distribution for all the genes. The distribution can be estimated by considering the score test from ar egression model that includes all the genes as predictors. This estimation mayb ed ifficult when the number of genes exceeds the number of subjectsfor which the procedure mayn eed to be applied piecewise to subsets of genes. The approach of Lina lso requires the sample size to exceed the number of genes, but preliminaryr esults suggest that it would be more robust than that of Seaman and Mü ller-Mysock when applied across the whole genome. 44 Both of these approaches requiret he analysis to be expressed as as core statistic from ar egression model,w hich can be done in most situations but mayr equirea dditional work by the user.C urrently,L in's method seemsb etter suited to genome-wide analysis, whereas that of Seaman and Mü ller-Mysock is more applicable and efficienti ns maller-scale candidate gene studies. 46 Af urthera pproach is to assumet hat the sampled markers are representativeo fa n' effective number' of independent tests. 47 -50 After estimating this number -f or example,f rom the singular-valued decomposition of the genotype correlation matrix 50 -a symptotic formulae can be applied.T here is no formal basis for this approach, however, and studies based on real data indicate that the results are not alwaysa ccurate; 51 indeed, there mayb en os uche ffectiven umber after all. 33 This approach is not recommended; however, if it is used, all significantr esults should be confirmed by ap ermutation test.

False discovery rates
Another perspectiveo nt he multiple testing problem is that the family-wise errorr ate is not the most appropriate measure,a nd that other measures should be used that have better sensitivity and specificityi ng enome-wide studies. Although weak control of FWER for the overall significance has been advocated, somee rror control for the single tests is also desirable.Here,two prominent alternatives are discussed: false discoveryr ates (FDRs) 52,53 and posterior errorr ates. 54,55 The original FDRb yB enjaminia nd Hochberg 52 is the expected proportion of false positives among all positive results, with the proportion defined as zeroi ft here are no positives. That is, if R is the number of positiver esults in a study and V is the number of thesethatare false -t hat is, do not arise from true gene effects -t hen: Subsequently,S toreya nd colleagues 53,56 have arguedt hat the choice of the appropriate rate depends on howm any positiveresults there are,and, furthermore,that the rate is only meaningful when there is at least one positive. This motivates the positiveFDR (pFDR), defined as the expected proportion of false positives among all positiver esults, conditional on at least one positivea tagivens ignificance level: 56 Rather than setting afi xedp FDRr ate to control, Storey and colleagues suggest givingavalue to each test that indicates what pFDR would result from declaring that test significant. The follow-up tests can then be chosen based on joint consideration of the number of tests selected and the pFDR associated with them. Formally, the q -value associated with an individual test is defined as the minimum pFDRa chieved when declaring all tests significant at the level of the test's pvalue.Aq -value can be estimated for each test in ag enomewide experiment and follow-up tests selected from those with the lowest q -values. This last stage is somewhat informal and mayb ed rivenb yl ogistic and financial constraints.
Ad ifficulty withF DR methods is that they control an expected proportion, whereas an investigator will be more concerned with the actualp roportion of false positives within as tudy.S ome insight is gained by considering the variationi nw ithin-study false discovery proportion or false discovery variance.L et i be an integer with p (i) the i -th smallest p -value from as et of m tests. If the i mosts ignificant tests are declared positive, then mp ( i ) estimates the maximum number of false positives. The associated variance is mp ( i ) (1 2 p ( i ) )( because thet ruth of ap ositivet est is a binomial outcome) and the coefficient of variationi s ffiffiffiffiffiffiffiffi ffi for the within-study false discovery proportion. This is greatest when p ( i ) is small,s o, for afi xeds et of p -values, this coefficient of variation is greatest when the fewest tests are declared significant. This willo ccur when al ow errorr ate is set, or when there aref ew true associations, or when the poweri sl ow.I ng enome-widea ssociation scans, the number of true associations is expected to be small by comparison with the number of tests, so that the false discovery variance is relatively high in relation to the target rate, and theF DR approach mayn ot be reliable for controlling the errorr ate within studies. In gene expression experiments, however, the number of true associationsi s somewhat higher and FDR methods arem orea ppropriate for those studies. Korn et al.s tudy the within-study proportion of false discoveries and give procedures that keep the number (or proportion) of false discoveries within an upper bound with givenp robability. 57 The attraction of this approach is that one can limit the number of false positives with reasonable confidence,w ith the main disadvantage being increased computation. It is uncertain howt he false discovery proportion behaveswhen it falls outside the upper bound and, although this approach is attractive, further operating characteristics mayb en eeded before it becomes more widely used.
Af urther difficulty with FDR is that it says little about the individual tests. Them ost significantt ests arem ost likely to be the true positives, but FDR and q -values ignore this in favour of averaging the errorr ate across all significant tests. Efron and colleagues 58,59 propose the local FDRa st he posterior probability that an ull hypothesis is true,g iven an observeds tatistic.T he local FDRi sc alculated as where p 0 is the prior probabilityt hat the null hypothesis is true, T is at est statistic and f 0 and f 1 are the probability densities of T under the null and alternativeh ypotheses, respectively. p 0 and f 1 mayb eu nknown but could be estimated from the data. 58,60,61 Note,h owever, that when the true value of p 0 is near one,a si sl ikely in disease association scans, empirical estimates of p 0 mayb eg reater than one, which leads to ad ownwardb ias if these estimates are truncated at one.T hus, it is better to fix ap rior estimateo f p 0 from genomic considerations such as the number of expected disease genes ( O (10 1 )) and the number of genes in the genome ( O (10 4 )). 62 Both the local FDRa nd the q -value arec alculated for individual tests. The q -value should be preferred if all positive tests will be followedu pw ith roughly equal priority,w hich maybethe casefor amoderately powered study in which true and false positives aren ot well separated. The local FDR is preferable if decisions to followuppositivetests aretaken on a case by case basis, because it is ap roperty of single tests rather than the whole set of positivet ests. This applies if there are af ew very strong associations, together with somem oderate ones, or if additionals ources of evidence, such as biological plausibility,are taken into account, together with the statistical association.
Ar elated quantity is the false-positiver eportp robability (FPRP). 55,63 This is the posterior probability that an ull hypothesis is true,g iven as tatistic at least as extreme as that observed. It is calculated as where now F 0 and F 1 are the cumulative distributions. For known p 0 and F 1 and large number of multiple tests, the FPRP is the same as the q -value, 56 the main difference being one of context.FPRP is intended to be applied across multiple studies and calculated from prior models, whereas q -values are motivated by the within-study FDR and areu sually estimated from data. FPRP is also mathematically complementarytothe positivep redictive value of ad iscriminant, 64 again differing in context.B ecauseF PRP is ap roperty of ar ange of test statistics, it is appropriate for setting guidelines for the reporting of significantr esults, based on assumed models for p 0 and F 1 . This means that results can continue to be reported according to their p -values, but with modified thresholds of significance. Ak nown proportion of reported results will then be false; however, for assessment of specific tests for follow-up,the local FDR is more relevant to investigators.
Posterior errorr ates such as local FDRa nd FPRP are gaining support becauseinformed proposals can nowb emade for the prior probability of the null being true,b ased on genomic considerations. 55,62 Which of thevarious measures to use depends on the context.S ome of the determiningf actors are summarisedi nT able 2.

Concluding remarks
Several aspects of the analysis of genome-wide studies have been discussed, including study design, analysis method and errorc ontrol, all of which bear on the likelihood of successfully identifying genee ffects. There ares omek ey aspects that have not been considered here,i ncluding selectiona nd grouping of markerst ob et ested, population choicea nd data quality control. To somee xtent, these issues are specific to the type of study; this review has focused on them ore general statistical issues that apply to most studies.
The field will continue to develop rapidly as more studies are completed and there is much scope for new methodology. In particular,c ombinationso ft he current methods mayp rove to be fruitful -f or example, including combined evidence tests within at wo-stage design. There is no best method for all studies, because of their differing properties and aims, but this reviewh as identifieds ome of the questions that shouldg uide the choiceo fa nalysism ethod. Another important area for development, which has not been discussed here,w ill be the incorporation of evidence from several sources, including association studies, geneo ntology annotation, information from model organismsa nd structural bioinformatics,t og iveaholistic appraisalo ft he effects of genetic variation.

Ta ble 2.
Comparison of different error rates and analysis methods. 'Error control' indicates whether am ethod provides some measure of error:( 1) type-I error;( 2) posterior probability of association; (3) expected proportion of false discoveries in as eries of tests. 'Appropriate for' indicates whether,i nt he view of the authors, am ethod is suitable for genome-wide association or expression studies, based on the factors in Ta ble 1.