A survey of data mining methods for linkage disequilibrium mapping

Data mining methods are gaining more interest as potential tools in mapping and identification of complex disease loci. The methods are well suited to large numbers of genetic marker loci produced by high-throughput laboratory analyses, but also might be useful for clarifying the phenotype definitions prior to more traditional mapping analyses. Here, the current data mining-based methods for linkage disequilibrium mapping and phenotype analyses are reviewed.


Introduction
During recent years, there has been growing interest in using data mining methods in gene mapping, motivated by the lack of success of the more traditional approaches for complex diseases, and also by the intriguingp ossibility of simultaneous detection of multiplel oci. 1,2 Although aw ide spectrum of computational approaches is used for data mining, they tend to share certain attractivec haracteristics for genetic association analysis.
First, them ethods are usually computationally efficienta nd scale to high numbersofmarkersand individuals, such as those expected in the nearfuture in genome-wide association scans. Obviously,t his efficiencyc omes with ap rice: the models considered tend to be simpler than those usuallyu sed in statisticalg enetics.
Secondly,d ata miningm ethods areo ften aimed at exploration or discovery -f or example,b yg enerating plausible models (or hypotheses) for further analysis rather than considering one givenm odel in greatd etail. This aim coincides with ag eneral trend in data analysis to move from hypothesis-driven-to hypothesis-generating research. The results of such exploration must often be complemented with more traditional statistical analysis.
Thirdly,d ata miningm ethods typically handle discrete data and use symbolic structures, giving results and explanations that mayb ee asier to understand and utilise for usersb ut are less suitable for statistical analysis.
'Data mining' is often loosely defined as 'non-trivial extraction of implicit, previously unknown and potentially useful information from data'. For this review of data mining methods for linkage disequilibrium (LD) mapping, the authors have chosen,a tt heir ownd iscretion, methods which reflect the three above-mentioned characteristics.
The data mining approaches of this review can be roughly categorised into three groups: (1) classification methods that directly aim to find markersa nd other features that help to predictt he disease status; (2) clustering techniques for finding subgroups of subjects, based on their genotypic and phenotypic similarity,a nd analysiso ft heir disease association; and (3) methods based on the discovery of typicalh aplotypes (or haplotype patterns) and analysiso ft heir associations with the disease (Table 1).
In addition to gene mapping, datam ining approaches have been applied to related areas, such as disease-susceptibility gene identification usingl iterature databases. 3,4 This review focuses on LD mapping approaches.F or more complete coverage, work from workshop proceedings and unpublished articles are also included. Someo ft he methods area vailable as software, for these aw eb page or e-mail address is provided.

Classification methods
Classification methods aim at finding rules or regularities that predict the value of at arget variable from the independent variables. When applied to gene mapping, the goal is to find markerso rh aplotypes (and potentially other variables) that together are good predictorso ft he phenotype and then, more as aside-effect, predict adisease-susceptibility genetobe close to thesem arkers. Regression analysisi sawell-known prediction method for quantitativet raits; this review focuses on classification methods for categorical traits.
Recursivep artitioning (RP)m ethods (also known as decision/classification/regression trees) have been used for this purpose -for example by Yo ung and Ge 5 and Cook et al. 6 RP produces at ree which can be described as as eries of carefully craftedquestions about the attributes of thetest record, where each questions plits the data into twop arts and the next questioni sa lwaysc onditional on the previous one(s). The gene findingmethod is, consequently,conditional: once asplit is made based upon as ingleg ene (or marker or haplotype), then the subsequent analysisisconditional on the results of that split -w hich is av eryn atural assumptionf or genetic effects. Yo ung and Ge 5 present as uccessful application of RP,c arried out with HelixTreew (www.goldenhelix.com) on simulated clinicalt rial data, where the aim is to find (out of 80 genetic polymorphisms) those markersthat have thehighest impact on the efficacy and safety of ab loodp ressurem edication.
Symbolic discriminant analysis( SDA) wasu tilised in integrated analysis of multiple data types (genetic markers, genomic and proteomic data) in ar eviewb yR eif et al. in 2004. 7 SDAi sasupervised pattern-mining approach that carries out variable selection and model selection simultaneouslya nd automatically.S DA buildsd iscriminate functions from al ist of mathematicalo perators( eg þ , 2 ,x ,/ )a nd explanatoryv ariables that can distinguish between disease classes in the data. In an integrativea nalysiso fs imulated multiple-type data, the authorss howedt hat, in particular, when the aetiology of the disease is complex, the integrated analysisc an be highly advantageous. The SDAa pproach implemented by Reif et al.c an be obtained from jason.h.moore@dartmouth.edu.
Associationrules have been applied to genetic problemsfor example,b yR ova et al. 8 in ac andidate gene analysis for bronchopulmonarydysplasia in newborns, where anumber of non-genetic risk factorsh ad also been measured and best combinations of covariates and genetic markersw eres ought. Associationr ules describe co-occurrences of sets of features and can be computed very efficiently.Inthis case,the presence of twod ifferent, but sometimes co-occurring, syndromes were set as targets and the significanceo fa ssociation of conjunctions of several genetic and non-genetic risk factors to either syndrome wasm easured from thea ssociation rules. Tw os eparate polymorphismsw ere proposed to have ap henotypic effect vias eparate molecularm echanisms. 8 Although the implementation used by this group is not available,a generalpurpose Apriorialgorithm for findingassociation rules is givenb yAgrawal et al.; 9 freely available implementations are numerous (eg http://www.kdnuggets.com/software/).
The DICE algorithm 10 identifiesasubset of genetic and non-genetic covariates that are,e ither individually or in combination, associated with ap henotype.T he relationship between the phenotype and the covariates is modelled using a logistic,l inearo rC ox regression model. Thea lgorithm explores, by means of aforward procedure,aset of competing models and selects the mostp arsimoniousa nd informative approximating model(s) that minimise(s) the information criterion.T hus, the method combines the advantages of regressive approaches in terms of modellingand interpretation of effects with those of data exploration tools. It should be well suited to detecting interactions between genetic and non-genetic factorsw ithin the framework of association studies. DICE has been successfully applied to ah andful of datasets 10 (http://ecgene.net/genecanvas/modules/news/ article.php?storyid ¼ 7). DICE is available uponr equest from laurence.tiret@chups.jussieu.fr.
Multifactor dimensionality reduction (MDR) 11 is a non-parametric approach to detecting and characterising non-linear interactions among discrete genetic and environmental attributes. Multilocus genotypes arep ooledi nto highrisk and low-risk groups, reducing the numberso fg enotype predictors. The reduced-dimensionv ariable is used to classify Ta ble 1. Main classes of data mining approaches to gene mapping, characterised by three criteria: 1) Descriptive methods primarily aim to recognise the ancestral, shared chromosomal segments identical by descent, whereas predictive methods directly associate with the disease status. 2) Some approaches trytop artition the set of subjects into homgeneous groups, some emphasise local similarities in haplotypes, and some arec ompromises between these extremes. 3) The suitability for describing and computing interactions varies between approaches.

Approach
Methods Characteristics Classification RP, 5,6 SDA, 7 DICE, 10  Asurvey of data mining for LD mapping Review SOFTWARER EVIEW and predict disease status through cross-validation and permutation testing. MDR has been shown to be capable of revealing significant high-order interactions in reald atasets 12 (http://www.epistasis.org/mdr.html). Support vector machine (SVM) is an algorithm that attempts to find al inears eparator (hyperplane)b etween the data pointsoftwo classes in multidimensional space.SVMs are well suited to dealingw ith interactions amongf eatures and redundant features. Wa ddell et al. 13 used SVMt op redict the age at diagnosis of multiple myeloma, based on 3,000 single nucleotide polymorphisms (SNPs) genotyped in 40 young age-at-onsetand 40 old age-at-onset patients. Although the authorsd on ot refer to their method as being LD mapping, their search for best predictor SNPs for the trait is based on the hypothesis that if there is ag enetic factor to thet rait, then aS NP in the haplotype block (ie in strong LD) containing that genew ill be discovered. In fact,t he trained SVM produced am odel withar easonable accuracy (71 per cent by cross-validation), but the model wasn ot easily interpretable: it consisted of 150 SNPs. Ag eneral-purpose SVM algorithm, SVM light ,i sp ublicly available at http://svmlight. joachims.org.

Clustering
Clustering aims to locate relatively homogeneous subgroupsin the givend ata. In the context of LD mapping, clustering of study subjectsh as been suggested as an approach for findings ubgroups of individuals who potentially share genetic factors. Such clustering can be based on haplotypes of the individuals, or on their phenotypes. After successful clustering, it should be easier to locate the genetic factorsw ithin the clusters,i mproving statistical power; however, powerm ay be reduced if the effectives ample size decreases.
Acrucialfactor here seems to be that genetically motivated similarity measures are used, based on haplotype sharing between individuals. 'Lengthm easure' -t he length spanned by the longest continuous intervalofmatching alleles -isone typicaloption -and 'count measure' -the number of alleles in common in awindow-is another. 14,15 With such measures, the clustering method is directed to search for clusterswith shared genetic aetiology. The association of clusters to the phenotype can then be measured -for example, usingthe x 2 statistic,a nd the disease gene can be predicted to be where the bestc lusters hows similarity of haplotypes. An instance of this approach is implemented in the HapMiner 16 software (http://vorlon.cwru.edu/~jxl175/HapMiner.html). Durrant et al. 17 use hierarchicalclustering to produce approximations of genealogical trees and map genes based on these trees. The method has been coded in the CLADHCa lgorithm,a vailable as al inux executable,w itha ccompanying documentation, on request from amorris@well.ox.ac.uk. Molitor et al. 18 perform fine mappingb ys patial clustering of haplotypesb ased on a similarity metric that measures the length of the shared region and by estimating therisk that each haplotype 'cluster' has for the trait. Thei mplementationo ft he method is available from jmolitor@usc.edu.
Ag ood example of as lightly different approach to LD mapping, also based on measuring haplotype similarities but not on clustering, is givenb yT zeng et al. 14 They investigated the hypothesis that the average similarity between case haplotypes tends to be higher than between control haplotypes. Under this assumption, disease-susceptibility genes can be localised directly by measuring the statistical significanceo f haplotype similarity in the cases without explicit clustering or goodness of fit tests, such as x 2 .The authorsconcluded that similarity measures are actually more powerful than goodness of fit tests when the mutation occurso nacommon haplotype,b ut that goodness of fit tests are superior for rare haplotypes. Haplotype similarity and clustering were proposeda se xploratoryh aplotype analysis methods by To ivonen et al. 15 The other major variant of the clustering theme is to clusters ubjects by their phenotypes, rather than haplotypes. Again, the aim is to find (phenotypic) subgroupst hat potentially have more homogeneous genetic aetiologies, but nowu tilising rich phenotypic datasets, where they exist. Wilcox et al. 19 clustered subjectsf romt he Framingham Heart Study for this purpose.D ifferent phenotypic measurements can have very different rangesand distributions, and these have to be handled to avoid unintended bias. Wilcoxa nd others used multiple correspondence analysis( MCA), an on-parametric analogue of principal component analysis, to producea reduced number of dimensions in which clustering wast hen performed.T hey subsequently used linkage analysis for mapping; there do not appear to be any publications on phenotype clustering for LD mapping, even though the approach should be equally feasible there.

Discovery of frequent patterns
The most popular data mining method applied to gene mappingh as been the discovery of typicalh aplotypes (or haplotype patterns) and analysiso ft heir associationsw ith the disease.I ns imple terms theg oal is firstly to discover sites and haplotypes potentially identicalb yd escent, and then to test their disease associations.
Haplotype patternm ining (HPM) wast he first such method (http://www.cs.helsinki.fi/group/genetics/). 20 The algorithm finds all haplotype fragments (patterns) of arbitraryl ength -p ossibly up to some limit and possibly with gaps -t hat shows tatistical association with the disease.T he set of associated fragments is used as aw hole to evaluate association across the chromosomal area studied. The area that shows the most significantly elevated number of patterns is the most likelyf or ad isease-susceptibility locus.

Review SOFTWARER EVIEW
The significanceo ft he finding is evaluated by permutation tests, where both marker-wise nominal significances -a s well as ac orrected significance for the best finding -a re computed. 21,22 The HPMm ethod is fast,e specially with respect to the number of markers, and it is sensitivet os mallg enetic effects. The results are rough, however, and more elaborate (and computationally more expensive)s tatistical models are expected to predict the disease mutation locale better than HPM. In conclusion, HPM seemst ow ork excellently as the first-stage analysis tool of genome-wide association and has been successfully applied in variousc ircumstances -f or example, for asthma-related traits, 23 glucocorticoid sensitivity 24 and familial glioma. 25 Va riants of HPM include am ethod for findingt wo (interacting) loci at the samet ime 26 and QHPM for analysis of quantitativet raits. 27 F-HPM developed by Zhang et al. 28 is af urther development of HPM, in which the strength of association is tested in pedigrees using the quantitative pedigree disequilibrium test. 29 The tree disequilibrium test (TreeDT;h ttp://www. cs. helsinki.fi/group/genetics/) 30 is am ore elaborate attempt to model the unknown coalescence,r ather than just haplotyping fragments potentially identical by descent. TreeDT constructs, at each locus, trees that approximate the genealogy of the haplotypesa tt hat locus, much liket he method of Durrant et al. 17 These trees can be obtained efficiently usingk nown algorithmsf or strings,m akingt he method computationally fast. After trees areb uilt for all locations, ad isequilibrium test is performed on each of them to test if there is as mall seto fs ubtrees with relatively high proportions of disease-associated chromosomes, suggesting shared genetic historyf or those and al ikely disease-gene location. Again, permutation tests are used to measure significances. TreeDT is fast and has been shown to be relatively accurate,e specially when allelic heterogeneity is present in ad isease locus.

Conclusions
Notably,t he methods presented here are mostly intended for exploratorya nalysisa nd not so much for final stages of identifying ac ausativev ariant in genotype data. Theu ser's expertise and insight playakeyr ole: they aren eeded in choosing the methods and parameter values and arec rucial in interpreting the results. Also,t here is no universally optimal method for all purposes; it can be useful to tryseveral different approaches for the same problem.
As pointed out by Hoh and Ott, 2 what is most needed for future large-scale genetic and genomic data analysis are 'methodsf or discovering sets of susceptibility genes and environmental factors, as well as systematic verifications of the gene -environment -disease network'. According to the present review,t here already exist an umber of data mining approaches to gene mappingo ri dentification purposes (Table 1); however, they ares till rather scattered, consisting of somewhat solitarya ttempts to use different machine-learning or data mining approaches.
Classification methods aret ypically strong in modelling interactions, unlikem ost other approaches in this review. Several of the classification methods produce as et of interacting locit hat best predict thep henotype.H owever,a straightforward application of classification methods to large numberso fm arkersh as ap otentialr isk picking up randomly associated markers.
Approaches based on haplotype sharing, such as most of the reviewed clustering and patternd iscovery methods, explicitly aim to reduce this problem by considering loci that are more likelyt ob ei denticalb yd escent. Of course,c ombinations are possible; for instance,a ll frequenth aplotype patterns could first be found and ac lassifier used to choose as ubset of those and to model the interactions of their loci.
In the more distant future,o ne might expect to gain most from integrated large-scale analyses: data mining of high-throughput SNP data for LD mappingc ombined with phenotype subgroup analysis; expression analysis resultsinformation about co-regulated enzymes in normal and trait-carrying individuals -i ntegrated with the information on known metabolic pathways; and linking of the new experimental information to existing public data by mining literature and biological databases.