Testing groups of genomic locations for enrichment in disease loci using linkage scan data: A method for hypothesis testing

Genes for complex disorders have proven hard to find using linkage analysis. The results rarely reach the desired level of significance and researchers often have failed to replicate positive findings. There is, however, a wealth of information from other scientific approaches which enables the formation of hypotheses on groups of genes or genomic regions likely to be enriched in disease loci. Examples include genes belonging to specific pathways or producing proteins interacting with known risk factors, genes that show altered expression levels in patients or even the group of top scoring locations in a linkage study. We show here that this hypothesis of enrichment for disease loci can be tested using genome-wide linkage data, provided that these data are independent from the data used to generate the hypothesis. Our method is based on the fact that non-parametric linkage analyses are expected to show increased scores at each one of the disease loci, although this increase might not rise above the noise of stochastic variation. By using a summary statistic and calculating its empirical significance, we show that enrichment hypotheses can be tested with power higher than the power of the linkage scan data to identify individual loci. Via simulated linkage scans for a number of different models, we gain insight in the interpretation of genome scan results and test the power of our proposed method. We present an application of the method to real data from a late-onset Alzheimer's disease linkage scan as a proof of principle.


Introduction
In complex disorders where variations in more than one gene are expected to contribute to disease risk, researcherso ften hypothesise that particular groupso fg enes or genomic locations are enriched witht rued isease-susceptibility genes, based on variousl ines of evidence.F or example,g roups of genes found to be differentially expressed in ac ase-controlled microarraye xperiment or the human loci syntenic to those identifiedb yl inkage in am ouse disease model arel ikely to be enriched in susceptibilityg enes. One can also hypothesise disease gene enrichmentb ased on functional data. It can be suggested, for example,t hat the genes involved in glutamate neurotransmissiona re enriched with schizophreniasusceptibility genes, or -c ombining more than one line of evidence -t hat differentially expressed glutaminergic genes in particular are likely to be enriched. Researchersm ay wish to corroborate such hypotheses by testing whether the memberso fa ni dentifiedg roup of genes are located in areas showing evidence of genetic linkage to the disease of interest. Linkage results for complex disorders are often noisy and hard to interpret, however. We propose am ethod for using genome-wide linkage data and the widely used nonparametricl inkage (NPL) score 1 to test for the enrichment of groupso fg eneso rg enetic locations in disease-susceptibility genes. The NPLscore is designed to have,atany locus, astandard normaldistribution with amean of 0and astandard deviation of 1u nder the null hypothesis of no linkage.T his means that, althoughf or any givenu nlinked locus the expectation is an NPL score of 0, stochastic variation creates scores that can takep ositiveo rn egativev alues. Under the alternative hypothesis of linkage,s tochastic variationa tt he true disease loci is still present, but theexpectation is at avalue higher than 0. The magnitude of the expected value depends on the sample size,t he available genetic information and the effect size of the locus. For most complex diseases, it is assumed that individual risk loci will have smalle ffects. As ar esult, the superimposed stochastic variation can masks ome truly linked loci or create signals where no true linkage is present, both situations leadingt oe rrorsa nd/or failed replication studies for true disease loci. If one could have ap riori knowledge of the true disease loci, one could achievegreater significance by studying the group of loci in concertbecause of the consistent trend for increased scores, even in the absence of significant scores at each one of the individual loci. As the number of true loci examined together rises, the noise from the underlying stochastic variationw ill asymptotically approach 0a nd their average NPL score will asymptotically approach the average of their individual expectations,w hich is greater than 0. By contrast, for unlinked loci, the individual expectation is 0; as the number of unlinked loci examined in concerti ncreases, their average NPL will asymptoticallya pproach 0.
Based on thesep roperties of the NPL score,w ec an use linkage analysis data to test whether ap re-defined group of loci is enriched for true disease-linked loci. This can be done by calculating the average NPL score of the group of loci and comparing it againstanull distribution of average scores derived by randomly drawing groupso fl oci of equal size. The null hypothesis is that the proportion of true linked loci among the group of loci tested is not different from the proportion expected when choosing random loci, while the alternativeh ypothesis is that the proportion is greater,a nd hencet he group is enriched. (Note that, as defined here, the proportion of true loci in the group tested corresponds to 1m inus the false discovery rate of the group.) We assessed the powero fo ur proposed method through simulations usingavariety of disease models and varying the number of errors in locationp redictions. We showedt hat this method can be powerful,e venw hen less than half of the examined locations arer eald isease loci. We must note here that we used the NPL score because it is commonly available and because its statistical properties makei te asier to present our hypothesis. Since significance is determined via permutationsa nd no distribution is assumed, however, the method is applicable to anys tatistic.A lso,w ea re aware that other summarystatistics, suchasthe product of p -values (more often used to showt he presence of at least one true locus in a group), can servef or the same purpose.A ssessmento fo ther statistics will be the subject of future work.

Generation of simulated linkages cans
We assumed ab aseline risk for the disease of 0.9 per cent, representingn on-genetic factors. We used the '--simulate' function in the Merlin analysis package 2 to generate genomewide marker data for nuclear families (twop arents and four offspring),i ncluding five, ten or 20 biallelic (disease) loci carrying risk alleles of frequency pthat independently increase the risk of disease two-or threefold (this corresponds to their relative risk). The disease allele frequency pwas equal for all loci and wase mpirically adjusted to provide ap opulation prevalence of 3p er cent. This corresponds to 70 per cent heritability (genetic/total variance), consistentw ith reports for many complex disorders.A fter generating data for a large number of families (up to 200,000 -o r1 .2 million individuals), the genotypes at the disease loci for each individual were examined, the risk wasd etermined based on those genotypes and disease status wasa ssigned with a probability corresponding to the risk. Fore xample,aperson carrying four risk alleles withar elative risk of 2h ad a probability of 2 4 £ 0.009 ¼ 0.144 of being affected. Sufficient families were generated everyt ime to ascertain 1,000 sibling pairsa nd 60 sibling triads (1,180 siblingp airsi nt otal) or 500 sibling pairsa nd 30 sibling triads (590 sibling pairs). This ratio of pairst ot riads corresponds to the moste fficient choice, givent he observeds imulated families, but it is not farf rom common sibling size distributions in the complex disease literature.The genotypes of all markers, excluding the biallelic disease loci, were used for genome scans for linkage using the Merlin software 2 to calculate NPL scores across the genome. Markersother than the disease loci had six equifrequent alleles, spaced 10 centimorgans (cM) apart,and there were no missing data. At otal of 359 microsatellitem arkersw ere simulated, starting on each chromosome at genetic position 0and placing one marker every1 0c Mu ntil the end of the chromosome; therefore, ag enetic length of 0t ol esst han 10 cM wasa tt he telomere of each chromosome.A ll families had as ize of six, with twop arentsa nd four offspring. Although this might be as lightly large family size compared with today'sa verage in the We sternw orld, it is less so for families that are ascertained todayt hrough their adult affected offspring. This family size wasalso necessarytomakethe generation of enough pedigrees for ascertainmentc omputationally feasible.F or each simulated scan, the location of thed isease loci varied and these were placed randomly in the genome,a llowing for co-localisation of more than one disease locus and for zerodistance with scan markersi fi ts oh appened by chance.

Models examined
In order to assess the powero fo ur method under many possible scenarios, we studied multiple genomes cans under multiple disease parametersincluding: 1) The number of disease Avramopoulos et al.

Review PRIMARY RESEARCH
genes wass et to five, ten or 20; 2) The increase in risk from each risk allele wasset to 2or3relativetothe baseline risk; 3) The number of ascertainedfamilies was500 pairs þ 30 triplets or 1,000p airs þ 60 triplets. For each of the 12 possible sets of parameters, 25 genome scans were generated. For each genome scan, we examined an umber of different scenarios regarding the number of true and non-true disease loci in the group to be tested. When some non-true locations, or not all the real locations, were included, the groupsw ere chosen 100 times at random to account for the stochastic variation inherent to the selection. In order to determine significance for each of the 100 selected groups, the average NPL of each wasc ompared with the null distribution formed by the average scores of 1,000 random groups of equals ize -t hat is, groups chosen without takingi nto account whether or not the included loci correspond to disease gene locations. The 2,500 empirical significance values obtained from 25 scans £ 100 group permutations were used to determine the powerf or each model (each cell on Ta ble 1). Since in these simulated data the alternativeh ypothesis (as stated above) is alwayst rue, the number of times that the empirical significance is less than the desired significancel evel, a , corresponds to the power.

Results
Ta ble 2s ummarises our generalo bservationsf romt he simulated genome scans with1,180 affected sibling pairs. Alinkage peak wasc onsidered to contain ar eald isease locus if the NPL scores between the originall ocation of the binaryd isease marker and the observedp eak did not drop by more than one unit less than the score at the peak. Even withfi ve locio f relative risk 3, the top NPL score did not reach genome-wide significance on most scans (25 scan average ¼ 4.22), according to the criteriap roposed by Lander and Kruglyak 3 (for the genome-wide significant p ¼ 2.2 £ 10 2 5 ,anNPL score of 4.4 is required), in accordance to what has been observedi nr eal data analyses 4 and predicted by Risch and Merikangas. 5 It is notable and encouraging, however, that, across the models we tested, 40 -92p er cent of scans showeds trongestl inkage at a real locus. We also observedthat even when there areonly five true loci on average,o ne of these loci is not among the top ten peaks of as can and would thereforen ot be detected. Again, this observation is very much in agreement witht he experience from linkage studies of complex disorders, as many linkage findings that have been considered to carry strong evidence have often not been replicated in subsequent studies of different pedigrees. As expected, the NPL scores and the fraction of true positives amongthe top linkage peaks decrease as the number of disease loci increases and as their relative risk decreases. An increase in the fraction of true findings is counter-intuitively observedasthe number of true disease loci and the number of top linkage peaks examined increases to 20; however, this does not correspond to an increase in the fraction of real genes identified. When looking at Ta ble 2, the reader should keep in mind that when there areo nly fiver eal genes and as et of 20 loci is tested, the maximum possible fraction of true loci in thes et is 5/20 ¼ 25 per cent. Overall, our observations confirmt hat our confidence in the linkage peaks of as ingles can should be somewhat reservedu ntil we observe replication, but also that non-replicationo fa linkage findingdoes not necessarily discreditapositivefinding. In other words, it will takem ore than af ew linkage scans to develop strong confidence in the location of true susceptibility loci for ac omplex disorder.
Ta ble 1presents an evaluation of the powerofthe approach we propose here for examining multiple genomic locations for linkage using as ummarys tatistic,n amely the average NPL score.I np articular,i ts hows the powert od etecte nrichment by examining the average NPL score at levels of a ¼ 0.05 and 0.01 and for different disease models calculated through our computer simulations. Fora1,180 sibling pair scan, and at the nominal level of a ¼ 0.05, it is of interest that for a relative risk of 3a nd for as many as ten disease loci, we can observe significance with power . 80 per cent even if only one-third of the locations in the group are true.F or ar elative risk of 2, we have 80 percentpower if half of the locations in the group aret rue. For 20 segregating loci and ar elative risk of 3, we can only tolerate ten non-real locations in a group that includes all 20 correct locations if we wish to have 80 per cent power. As expected, the poweri sr educed with smaller sample sizes. Figure 1p rovides at hree-dimensional graph showing howpower increases when there are fewer real loci contributing to the risk and when more of thesea re included in the group.Asthe group gets larger and the fraction of true loci that arei ncluded is reduced, however, the powerd ecreases.
Based on our observations on the true positivee nrichment of the topl inkage peaks (Table 2) and on the powero fa veraging NPLs cores (Table 1), we decided to test our approach on reald ata. Our simulations indicate that, as expected, the group of top findings of al inkage scan is enriched in true disease locations. Therefore, provided that there are not too many true loci with too small effects, when this group of top linkage peaks is tested against an independentl inkage scan, it should showasignificantly elevated average NPL score.W e used our genome scan data on late-onset Alzheimer'sd isease (LOAD) to test this hypothesis. This scan wasp eformed on a previously described collection of pedigrees from the National Institute of Mental Health genetics initiative, 6 for which we have previously reported genome scan results. 7 The one gene known to be involved in LOAD is APOE, 8 and its behaviour in terms of risk is very similar to our simulated models. 9,10 It has beensuggested that another 4-5 loci with effects similar to APOE mayb ei nvolved in LOAD. 11 Adopting as tudy design that enabled us to keep most variables equal and yeth avet wo independent scans, we sorted the pedigrees by their assigned Testingg roups of genomic locations fore nrichment Review PRIMARY RESEARCH identification numbers( signifying collection site and collection sequence) and split them at the point that givestwo sample sets (sets Aa nd B) of equal numberso fs ibling pairs (296 pairse ach). We then ran ag enome-wide linkage analysisonboth sets, ranked the top scoring 30 locations from scan A, selected groups of five, ten, 15 ...30 locations starting from the top and tested their average NPL in scan B. We note that splitting the data has no benefit for gened iscovery, but Ta ble 1. The power of our method for different simulated models (five, ten or 20 disease loci, 1,180 or 590 sibling pairs, relativerisk (RR) of 2or3 )different levels of enrichment for true loci and different levels of significance.

#R eal loci
Group we did this here as an exercise to showp roof of principle because it provided us with ah ypothesis that we could readily test usingo ur method, namely that the peaks of ag enome scan are enriched in true loci. Given the small sample sizes (296 sibling pairsp er scan), our powerm ighth aveb een low, since the underlying model is unknown; however, we viewed this analysis as exploratory. Ta ble 3s hows the empirical significance obtained by selectingt he best fivea nd up to 30 locations from the top linkage peaks based on scan A, and testing their average NPL againstt he data from scan B. Although the meanN PL of the fivet op locations wasn ot significantly high, once the number wasr aised to ten and 15, the scores were significant, suggesting enrichmenti nd isease gene locations. We consider that this not only validateso ur method but that it is also very encouraging regarding the validity of the findings of our genome scan, suggesting that the top linkage peaks areindeed enriched in realdisease loci to as ignificant degree.A lthough somem ight consider this notion to be obvious, it is contingent on the underlying disease model and mightn ot necessarily be true.B ased on the observations from the simulated linkage scans, and the expected lowp ower of this test on 296 siblingp airs, this also suggestst hat the number of substantial disease loci is not too great and that their relative risks are not too small. As we performed comparisonsi ns ix groups, we next wanted to see if our findings were significant at the study-wide level.
The strong correlation between the tested groupsm akes Bonferroni correction too conservative, so we tested this empirically.W ec hose 10,000g roups of 30 loci and tested inclusives ubgroups of five, ten, 15 ... 30 members, as we did with the real data against the scan Bresults. A p -value of 0.016 or smaller in anysub-group wasobtained 606 times, providing as tudy-wide significance of 0.06.

Discussion
We have shown howo ne can test for the enrichment of a group of genomic locations for disease loci using linkage genome scandata. Candidate groupsofgenomic locations can arise from multiple types of data. Fore xample,o ne can compare the results of twoi ndependent genomes cans, as described here for Alzheimer'sdisease,inthe same or different organisms. Alternatively,o ne can test prior results of expression studies or genome-wide association analyses, or genes belonging to specific pathwayso ri nteracting with as uspected disease gene.T he method could be extended to applying weightst oi ndividual locations based on the strength of prior evidence.T his is highly intuitive for testing locations that carry as core or as ignificance value (such as linkage,e xpression or association results) but less so for other types of groups (interacting proteins, memberso fafunctional group,e tc). One could alsoe xtend the approach by testing groupso nd ata other than linkage results, yets uch approaches requiref urther method developmentb ecause there can be a number of issues that need to be addressed. The use of sum statistics for SNP association data has been described previously by Wille et al., 12 Hoh et al. 13 and Kim et al. 14 The goals of these investigators, however, were different to ours. These authorss ought methods to test for associations in multilocus disorders, with the notion of increasing powert od etect associations with anyo ne of the loci by examining groups of SNPs or other DNA markers in concert. By contrast,w es ought to develop am ethod specifically for testingt he hypothesis that ag roup of pre-selected genomic locations is enriched for disease loci. Our method is suitable for testinga ny group of genes or locations on pre-existing data. In fact, our method could complement and add to the validity of the findings from other SNP set association studies.
There is one important pitfall about which investigators need to be cautious. It is necessaryt om akes ure that the linkage data used for testing the enrichment hypothesis were not in any wayu sed to generate the hypothesis. Fore xample, if one tests genes that have been reported to be associated with ad isease,i ti sn ecessaryt ou se linkage data generated and/or published after the associations, as there is astrong bias towards association testing in linked regions. If the linkage data were known before the association studies,t he genes might have beene xamined because of thep ositivel inkage scores and testing their scores on the samel inkage scan is certain to give af alse positiver esult. For example,t here are numerous association studies on Alzheimer'sd isease and we could have used our linkage data to test whether the group of positivefi ndings is enriched for true genes. The pedigrees used in our study,h owever, have been publicly available and used for genome scans since 1999. 15 Many association studies that followedw ere biased towards examining linked regions and thus the positivefi ndings would have as imilar bias. We would need results from an unbiased genome screen for association to performavalid test for enrichment. One Ta ble 3. Application of our method to real data. Tw osets of pedigrees were used for scans Aa nd B. Groups of top linkage peaks from scan A( their size is shown in column 1) were then tested for enrichment on the results of scan B. Column 2s hows the empirical p -values for these groups. also needs to consider that althought he poweroft he method is substantial, it will quickly diminish if multipleh ypotheses of little merit aree xamined, as this will requires ubstantial correction for multiple comparisons. Additionally,a st he true underlying disease model is not known,negativeresults cannot be taken as evidence against ah ypothesis and must be interpreted with caution. Although failing to reject the null hypothesis might suggest that the alternativeiswrong, it might also be due to decreased powerr esulting from the small effect of individual genes, the large number of genes involved, insufficient enrichment of the selected locations in true disease genes or the smalln umber of pedigrees in the linkage study.
Regarding the last point, the approach could be extended to simultaneous examination of twoo rm orel inkage scans to increase powerw ithout the need to combinet he genotype data with all the inherent difficulties of doing so.O ne can simplyp erformp ermutations of the same group of random loci on both scans and examinet he distribution of the combined average NPL score against the observeda verage of the twos cans for the tested group.
Our simulation data can provide someg uidance on the optimal selection of group size.A sF igure 1s hows, when less than 50 per cent of the loci in the group arer eal, the power starts to diminish. Significant loss of poweri sa lso observed when less than half of all true loci arei ncluded in the group (Table 1); thus, we suggest using the maximum group size that does not exceed twice the predicted number of true loci. Our data on LOAD support this, as the predicted number of loci conferring ar elativer isk of 2-3i sfi ve, 11 and we obtained our strongest finding with ag roup of size of ten. When information on an expected number of disease genes is available,w es uggest avoiding multiple comparisonsb y defining ap riori the size of the tested group to roughly twice that number.I fo ne wishes to test multiple group sizes, correction for the multiplec orrelated comparisons is required using empirical methods. As we observedi no ur example in Alzheimer'sd isease,t he predicted group of ten loci would have provided the highest significance,while testing six groups resulted in as tudy-wide p -value of 0.06. Va riations in group size can be useful in determining the most enriched group, yeti tm ight be best to performt his analysisa fter significance has been established.
Our example using Alzheimer'sdisease linkage data showed howpositivefindings can not only confirmahypothesis -in this case,c onfirmt hat as ignificantp roportion of disease loci are amidst the top linkage findings -b ut also lead to insight regarding the possible underlying model. It has been previously proposed that about fiveloci, each conferring arelative risk of 2-3f or LOAD,s egregate in the population. 11 According to Ta ble 2, for 1,180 sibling pairsand fiveloci with ar elative risk of 3, we would expect that 3.2 of the top five and four of thet op ten linkage peaks would be real. These numbersw ould be 2.5 and 3.4, respectively,f or 590 sibling pairs. If we compared thesea gainst al inkage scan of 590 sibling pairs, extrapolating from Ta ble 1, we would expect to have somewhat more than 80 per cent powert od etectt his degree of enrichment. Although our sample for both genome scans wasa bout half the size of this sample and presumably had significantly less power, we detected the enrichmenti n our data. Having ap ositivefi nding in this analysist hat is consistent with the proposed number of loci and relative risks is very encouraging, as it suggests that linkage analysish as pointed to some truly linked regions in our Alzheimer's genome scan.
The analytical approach we propose here is simple and, since it calculates the significance of findings based on permutations, robust to type Ie rrors, provided that the prediction of the genomic locations to be grouped and analysed is in no wayb iased by the linkage data on which the test will be performed.W es howedt hat thea pproach has substantial poweru nder disease models with am oderate number of risk genes and moderate relative risks. We believe that as more and more diverse data accumulate through the varioush igh-throughput technologies, it is increasingly important to devise more methods of combining and crossvalidating the resultingi nformation that will help us succeed in our effortt ou nderstand complex disorders.