Multifactor dimensionality reduction: An analysis strategy for modelling and detecting gene - gene interactions in human genetics and pharmacogenomics studies

The detection of gene - gene and gene - environment interactions associated with complex human disease or pharmacogenomic endpoints is a difficult challenge for human geneticists. Unlike rare, Mendelian diseases that are associated with a single gene, most common diseases are caused by the non-linear interaction of numerous genetic and environmental variables. The dimensionality involved in the evaluation of combinations of many such variables quickly diminishes the usefulness of traditional, parametric statistical methods. Multifactor dimensionality reduction (MDR) is a novel and powerful statistical tool for detecting and modelling epistasis. MDR is a non-parametric and model-free approach that has been shown to have reasonable power to detect epistasis in both theoretical and empirical studies. MDR has detected interactions in diseases such as sporadic breast cancer, multiple sclerosis and essential hypertension. As this method is more frequently applied, and was gained acceptance in the study of human disease and pharmacogenomics, it is becoming increasingly important that the implementation of the MDR approach is properly understood. As with all statistical methods, MDR is only powerful and useful when implemented correctly. Concerns regarding dataset structure, configuration parameters and the proper execution of permutation testing in reference to a particular dataset and configuration are essential to the method's effectiveness. The detection, characterisation and interpretation of gene - gene and gene - environment interactions are expected to improve the diagnosis, prevention and treatment of common human diseases. MDR can be a powerful tool in reaching these goals when used appropriately.


Introduction
One of the biggest challenges in humangenetics is identifying polymorphisms, or sequence variations, that present an increased risk of disease.Int he case of rare,M endeliansinglegene disorders, such as sickle-cell anaemia or cystic fibrosis, the genotype to phenotype relationship is easily apparent, because the mutant genotype is explicitly responsible for disease.I nt he case of common,c omplex diseases, such as hypertension,d iabetes or multiple sclerosis, this relationship is extremely difficultt oc haracterise because disease is likely to be the result of many genetic and environmental factors. In fact,e pistasis, or gene-gene interaction, is increasingly assumed to play ac rucial role in the genetic architectureo f common diseases. [1][2][3] This challenge is equally present in studies of pharmacogenomics. 4 The dimensionality involved in the evaluation of combinations of many such variables quickly diminishes the usefulness of traditional, parametric statisticalm ethods. Referred to as the curse of dimensionality, 5 as the number of genetic or environmental factorsi ncreases and the number of possible interactions increases exponentially,m any contingency table cells will be left with very few, if any, data points. In logistic regression analysis, this can result in increased type Ie rrorsa nd parameter estimates with very large standard errors. 6 Traditional approaches usingl ogistic regression modelling are limited in their ability to deal with many factorsa nd simultaneously fail to characterise epistasis REVIEW q HENRYSTEWART PUBLICATIONS 1473-9542. HUMANG ENOMICS .V OL 2. NO 5. 318-328 MARCH 2006 models in the absenceo fm ain effects, due to the hierarchical model-building process. 7 This leads to an increase in type II errors and decreased power. 8 This is aparticular problem with relatively small sample sizes. Because sample collection is timeconsuming and expensive, the decreased powerc an maket he cost of effectives tudies prohibitivew itht raditionala nalytical methods.
In ordert oa ddress these concerns, an ovel statistical method, multifactor dimensionality reduction (MDR), was developed. MDRr educes the dimensionality of multilocus data to improvet he ability to detectg enetic combinations that confer disease risk. MDR pools genotypes into 'high-risk' and 'low-risk' or 'response' and 'non-response' groupsi n order to reduce multidimensional data into only one dimension. Becausei ti sanon-parametric method, no hypothesis concerning thev alue of anys tatistical parameter is made.I ti sa lsoam odel-free method, so no genetic inheritance model is assumed. 9 MDR wasd esigned to detect gene -geneo rg eneenvironmenti nteractions in datasets with categorical independentv ariables, such as single nucleotide polymorphisms (SNPs)a nd other sequence variations (insertions, deletions etc), as well as environmental datat hat can be represented as categorical variables. The endpoint, or dependent variable, must be dichotomous such as case/control for studies of human disease.P harmacogenomics data can also be analysed with MDR, in terms of 'response/non-response' or 'toxicity/ no toxicity'. MDR is appropriate for anyd ata type with twod istinct clinical endpoints.
MDR has been used to identify interactions in thea bsence of anys ignificant main effects in simulated data. In addition, MDR has identifiedi nteractions in av ariety of different real datasets,i ncluding sporadic breast cancer, 9 essential hypertension, 7 type 2d iabetes, 10 atrial fibrillation, 11 amyloidp olyneuropathy 12 and coronarya rteryc alcification. 13 Each of these studies wast he first of its kind to explore complex interactions and thus needs to be replicated in additional datasets.S tudies with simulated data (of multiple models of different allele frequencies and heritability)h avea lso shown that MDR has high powert oi dentifyi nteractions in the presence of many types of noise commonly found in real datasets (including missing data and genotyping error), whereas errors such as heterogeneity (genetic or locus) and phenocopy diminish the powero fM DR. 14 Additionally, am athematicalp roof has shown that, due to the relationship between MDR and anaï ve Bayesclassifier,MDR is optimally efficienti nd iscriminating betweenc linical endpoints using multilocus genotyped ata. 15 As witha ny type of statistical method, the effectiveness of MDR is dependent on its proper implementation. Because this method is used more frequently,a nd wasg ained acceptance in the study of human disease,i ti sb ecoming increasingly important that the implementation of the MDR approach is properly understood. Although the details of the software package have been published, 9,16 there aref ew resourcesa vailable to guide au ser through the details of the method itself.C oncerns regarding dataset structure( including sample size,b alance of cases and controls and structureo f family data) must be considered before using MDR. Subsequently,i ssues involving configuration parametersc an affect the results of analysisand must be carefully considered (suchas threshold values and cross-validation parameters). Performing hypothesis testingo na nM DR model requires permutation testing. Thep roper execution of permutationt estingi n reference to aparticular dataset and configurationisessential to the method'se ffectiveness.

Method overview
The details of the MDR method have been published previously. 9,14,16 Briefly,MDR is described here and is shown in Figure 1. In step one,t he dataset is divided into multiple partitions for cross-validation. MDRc an be performed withoutc ross-validation; however, this is rarely doned ue to the potentialf or over-fitting. 17 Cross-validation 18 is an important parto ft he MDRm ethod, because it tries to find am odel that not only fits the givend ata but can also predict on future,u nseend ata. Since attainment of as econd dataset for testingi st ime-consuming and often cost-prohibitive, cross-validation produces at esting setf romt he givend ata to evaluate the predictive ability of the model produced. In the case of ten-fold cross-validation, the trainingset comprises 90 per cent of the data, whereas the testing set comprises the remaining 10 per cent of the data.
In step two, aset of n genetic and/or environmental factors are selected. The n factorsand their possible multifactor classes are represented in n -dimensional space; for example, for two loci with three genotypes each, there aren ine possible twolocus-genotype combinations. Then, the ratio of the number of cases to the number of controls is calculated within each multifactor class. Eachmultifactor class in n -dimensional space is thenlabelled as 'high risk' if the cases to controls ratio meets or exceeds somet hreshold (eg $ 1), or as 'lowr isk' if that threshold is not exceeded, thus reducing the n -dimensional space to one dimensionw itht wo levels ('lowr isk' and 'high risk'). Among all of the two-factor combinations, as ingle model that has the fewest misclassified individuals is selected. This two-locus model will have them inimum classification errora mongt he two-locus models. In order to evaluate the predictive ability of the model, prediction errori se stimated using thet estings et. Mathematically,t he calculation of prediction errora nd classification errori st he same, but the portion of thed ataset used to calculate the metric is different. Classification errori sc alculated on the trainings et, whereas prediction errori sc alculated on the testings et. Both metrics measure the number of individuals whose clinicale ndpoint has been incorrectly specified by the MDR model.  9 )Instep one,the data aredivided into at raining set and an independent testing set for cross-validation. In step two,as et of n factors is then selected from the pool of all factors. In step three, the n factors and their possible multifactor cells are represented in n -dimensional space.I ns tep four,e ach multifactor cell in the n -dimensional space is labelled as high risk if the ratio of affected individuals to unaffected individuals exceeds a threshold of one,a nd low risk if the threshold is not exceeded. In steps five and six, the model with the best misclassification error is selected and the prediction error of the model is estimated using the independent test data. Steps one through to six arerepeated for each possible cross-validation interval. Bars represent hypothetical distributions of cases (left) and controls (right) with each multifactor combination. Dark-shaded cells represent high-risk genotype combinations, whereas light-shaded cells represent low-risk genotype combinations. White cells represent genotype combinations for which no data wereobserved.  For studies with more than twof actors, the steps of the MDR method arer epeated for each possible model size (two-factor,t hree-factor etc), if computationally feasible.T he result is as et of models, one for each model size considered. From this set, the model with the combination of loci and/or discrete environmental factorst hat maximisest he crossvalidation consistency and minimises the prediction error is selected. Cross-validation consistencyi sameasureo ft he number of times an MDRmodel is identifiedineach possible 90 per cent of the subjects. 9 When cross-validation consistency is maximal for one model and prediction errori sm inimal for another model, statistical parsimonyi su sed to choose the best model. In model selection, it is crucial that prediction error, and not classificationerror,beused. This is due to overfitting observedw ithc lassification error. As the number of loci evaluated increases, the classification errorw ill always decrease.T his phenomenon is shown in Figures 2A and 2B.
Hypothesis testingo ft his final best model can then be performed by evaluating the magnitudeo ft he prediction errort hrough permutation testing. Permutationt esting is described in more detail below ( Figures 3a nd 4).
More recently,l ess emphasish as been put on choosing as inglefi nal model. Significance levels are assigned to each model in the final set using the permutation-testingprocedure, then all significantm odels arer eported. This new approach attempts to use all information within the final seto fm odels. Because the end goal of the MDRm ethod is hypothesis generation, this approach mayb ep referred to reduce the risk of false negatives.

Implementation
The original distributed version of MDR wasa vailable as Linux, Solaris, or MACOScommandline software. Presently, MDR software is being distributed in aJ avas oftware package with agraphical user interface.The most current open-source version is available at www.epistasis.org/mdr.html. MDR has also been incorporated into theW eka-CG software, which is available from the same website.I na ddition, a 'C' library is under development for userst oplug MDR into their owns oftwarep ackages. Figure 3d isplays the steps considered in the MDR procedure which will be covered in detail in the next four sections. When designing complex genetic and pharmacogenomic studies, the structure and size of ad ataset is very important.

Dataset issues
MDR can be easily applied to case-control and discordant sibling pair study designs with little or no dataset modification. Appropriate datasets for MDR will include any number of genetic and environmental independent variables, along with twod istinct clinical endpoints (dependent variables). MDR wasoriginally designed to find interactions in studies of disease risk, but it is applicable to anyt ype of dataset with two outcome levels. Efforts areu nderway to expandM DR to include more than twoe ndpoints, because this can be done in other contingency table methods, but this is currently ar estriction in the MDRs oftware.
For case-control data with unrelated individuals, the order of individuals within the dataset is irrelevant because the data will be randomly shuffled during cross-validation. If the dataset consists of family/sibling data, or population-based matched case-control data, the order of individuals is very important. In such cases, the pairsm ust be kept together within the dataset during cross-validation splitting. These data should not be randomlys huffled during MDRa nalysis.
Pedigree data can be more complicated. Currently, pedigrees must be converted to sibling pair dataf or analysis, and there are several options to handle such datasets. The first optioni st ou se all possible affected-unaffected pairsf rome ach family.T his would allowi ndividuals to be represented multiple times in ad ataset.T he other optioni st o consider only one randomly chosen affected -unaffected pair from each pedigree. Currently,s uchd atasetsa re handled on a case by case basis, and further work is being donetodetermine the appropriateness of each approach.
One particular type of pedigree data is triad data, where the genotypes of the parents and an affected child are known. In this case,' pseudo-controls' must be created because this approach will enable evaluation of the genotypes that were transmitted to the affected child in comparison with genotypes that were non-transmitted. This is done using allele data from the twop arentst oc reate an ew 'child' with the alleles that were not transferred to the real affected child. For example, if the mother had genotype 'Aa' for ap articular gene, and the affected child had received the 'A' allele,t he pseudo-control would receive the 'a' allele from the mother.T his would be done for everyg ene or SNPf romb oth parents. Sibling pairsw ould be created from the pseudo-control and affected child for analysis.
Samples ize requirements for MDRa re not yetk nown. At otal sample size of 400 individuals has been shown to have excellent powert od etect two-locus interactions for a specificset of epistasis modelssimulated in datasetsoften total SNPs. 9 Larger sample sizes aren eeded for higher-order interactions. There is no theoretical formula for power calculations for MDR, so more thorough empirical estimates for sample size and powera re needed.P reliminarys imulation studies have demonstrated that datasetss maller than 50 cases and 50 controls showadecrease in powera nd, in addition, begin to showa nu pward bias and inflated variance on the prediction errorestimates (unpublished data). Currently,m ore simulation studies are underway to understand the influence of different effect sizes and sample sizes on the powero f the MDR method.
If the dataset is not balanced in the number of cases and controls, variations on the MDR configuration parameters mayb eu tilised. When analysing such ad ataset, there are several options. First, over-sampling or under-sampling might be considered. 9,19 Over-sampling involves randomly re-sampling the under-represented class of individuals within the dataset until the number of cases and controls aree qual. Secondly,u nder-sampling involves randomly removing memberso ft he over-represented class of individuals from the dataset until it is balanced. There is no particular rule for whether over-sampling or under-sampling is generally preferable.C urrently,r esearch is being done with simulated data to understand the implications of over-or undersampling.I nitial observations indicate that either over-or under-sampling is preferred over analysing an unbalanced dataset with agreater than 2:1 ratio of cases to controls or vice versa( manuscript in preparation). In many datasets,c onvergence of results following over-and under-sampling demonstrates as trong signal. If results vary widely amongt he sampling datasets,i tm ay indicate aw eak signal within the dataset (manuscript in preparation). There are risks associated with using over-or under-sampling techniques. Oversampling can introduce false associationsd ue to the particular samples that were over-sampled. In addition, this can provide a false sense of higher power. Under-sampling is am echanism by which data aret hrowna way. Again, this can lead to the introduction of afalse association, as aresult of which samples are throwno ut, or this can reduce powerd ue to as maller sample size.T hus, although these techniques areu sed in the literature, 19,20 they can be dangerous.
Ap otentially more conservativea lternativet hat has been proposedf or analysing unbalanced data is adjusting the MDR threshold value.T he threshold value defines ther atio that determines the disease risk status assignmentt oaparticular multi-locus genotype combination. Ty pically,t his value is set to 'one'. The idea behind modifying this parameter wast o correct for the chance that am ultifactor combination could be assigned ac lassification of 'high risk' or 'lowr isk' simply because of the numerical dominance of one disease class in the dataset. When the threshold is adjusted, the calculation of classification errorw ill also need to be modifiedt oa ccommodate the unbalanced data, such as using abalanced accuracy metric.F urther research is being conducted to understand fully the implications of adjusting the threshold, as well as addressing other potentials olutions for unbalanced data, such as new fitness functions.

MDR configuration parameters
After dataset formatting, the next step is to establish configuration parametersf or data analysis. 16 There are several parameterst hat must be individually established for each new dataset. Ar andom seed (which can be anyr andom number) must be selected for therandom shuffle function used for cross-validation. Random shufflingr educes the risk of biasing cross-validation due to non-random ordering of data. This samer andom seed should be used in permutation testing, which will be discussed below. The next parameter is the number of loci considered. This describes the number of factorsc onsidered in each interaction model. For example,i f' loci considered' is set from '2 -5', MDR will test for all two-factor,t hree-factor etc,u pt ofi ve-factor interactions.
Currently,w hen dealing with missing data, MDRi ncludes this missing data as an additional genotype level. This is not problematic when there is as mall amount of missing data. If there is al arge percentage of data missing,h owever, it can overwhelm the solutions, and MDRc an model the missing data more so than the genotype data. Thus, cautions hould be used when ad ataset has al arge amount of missing data. Instead, one can use data imputationt echniques in thed ata manipulation module of the Java MDR software.A lternative missing data solutions arec urrently being investigated.
The final configuration issue to consider is the crossvalidation parameter.T here are multiple types of crossvalidation,e ach withi ts owna dvantages and disadvantages, from 'leave one out cross-validation'( LOOCV) to 'N'-fold cross-validation to no cross-validation. 18 LOOCV is where only one individual is left out of thet raining group for model validation.' N'-fold cross-validation involves partitioning the data into 'N' groups, where one group is used for testing and the remaining groups are used in training. For MDR, ten-fold cross-validation has traditionally been used. Even though this technique is computationally intensiveand the estimate of the prediction errorm ay be biased, its smaller variance makes it well suited to the end goal of MDR, which is hypothesis generation. In the original version of MDR (and original MDR paper), 9 the dataset had to be perfectly divisible by the cross-validation interval, typically ten. This often meant that a few individuals had to be throwno ut of ad ataset. Current versionso fM DR do not have this restriction.N ow,t he dataset is divided into partitions as evenly as possible, withoutl osing any data. Current simulations are underway to explore different types of cross-validation for evaluating power, type Ie rror,b ias and variance.R egardless of the type of cross-validation selected, it is recommended that cross-validation be used because it has been shown to be so important to prevento ver-fitting. 17

Performing MDR analysis
Using the MDR software is very straightforward after all decisions regarding configuration parametersh aveb een made. There are af ew issues that influence computation time: the number of factorsc onsidered for am odel (the dimension of interaction), the number of individuals in ad ataset, the number of factors/variablesc onsidered for each individual and the number of cross-validation intervals. These variables increase computation time exponentially due to the combinatorial aspect of the algorithm.
Current versionso fM DR are constrained by the parametersd iscussedi nt he previous section, butw ork is in progress to expand MDR to more diverse datasets.O ne current developmenti st oe xpandM DR to analyse data with more than twoc linical endpoints, such as 'unaffected', 'mildly affected' and 'stronglya ffected'. The immediate relevance of such an extension could easily be seeni ns tudies of many common medical conditionsw ith multiple phenotypes, such as diabetes, blood pressure, etc.A sm entioned earlier,t his modification should not be too difficult because MDRi sa contingency table method, which is at ype of method often used for ordinal data.
Additionally,w ork is being donet oe xpandt he capability of MDR to capitalise further on pedigree data. MDR -PDT has been developed to merge the MDRa lgorithm with the pedigree disequilibrium test (PDT). 21 PDT wasd eveloped as at est for linkage disequilibrium. This merger will allowt he application of MDR to complex pedigree data in the presence of family structure.
For large datasetsw ith many individuals and/or loci, the combinatorial explosioni nvolved in an exhaustive search of all multifactorial combinations exponentially extends computation time.T ypically, datasetsa re analysed out to four-or five-way interactions. Powers tudies with moderately sized datasets indicate that MDR has excellent powert od etect interactions up to this level, but powert od etect higher-order interactions decreases. Also,t he computation time required for analysis beyond this point becomes prohibitive. To try to resolvet hese issues and enable analysiso fm uch higherorder interactions and much larger datasets,aparallel programmingi mplementation of MDR is in development. Utilising parallel programming and parallel supercomputing technologies will allowa nalysiso fl argerd atasets and higher-order interactions in reasonable time frames.

Permutation testing
Once afi nal MDR model or set of modelsh as been chosen, permutation testingc an be used to performahypothesis test and evaluate its statistical significance.T he theoryb ehind permutation testingi st oc reate an empirical distribution of prediction errors that could be expected simply by chance. This distribution must be created for each individual dataset, mimicking the configuration parametersa nd dataset characteristics of the original MDR analysis. 22 Permutationt estingh as similar considerations as at ypical MDR analysis. If the dataset has ab alanced ratio of cases and controls, the ratio of cases and controls in the randomised datasets shoulda lso be balanced. When analysing unbalanced data, the randomised datasetsmust reflect the same proportions of cases and controls. In addition, all configurationp arameters used in the original analysisshould be identicalinpermutation testing. This is to ensure that the permutation test mimics the original analysis, except for the random disease status label.
Once the randomised data sets are created and configuration issues arec onsidered, an MDR analysisi sp erformed on all randomised datasets. After the analysiso fe ach dataset, the best model is extracted usingt he samec riterion that was used for the original analysis. Thep rediction errors of the single best model from each analysis comprise the empirical distribution. The prediction errors within the distribution are sorted in ascending order because the lowert he error, the better the model.O nce the distribution is created, the final model from the originalrun can be evaluated. The location in the empirical distribution where the original errorw ould fall directly translates into the p -value of the analysis. This omnibus permutationt est mayb eaconservativem ethod, but it is morel ikelyt oc ontrol for type Ie rror,w hile not limiting power. As mentioned earlier,t he primaryg oal of MDRi s hypothesis generation for future studies; however, one often wants some measure of howl ikely it is that the model or set of models detected by MDR would arise by chance.P ermutation testing allows for the evaluation of statistical significance of one or af ew MDR models.

Case series
The importance of the correct MDR implementationc an be illustrated usingasimulated dataset. SNP data were simulated, containing at hree-locus gene -gene interaction model with no main effects, as described by Moore et al. 23 The epistasis model can be shown as am ultilocusp enetrance function, where the table values indicate the probabilityo fd isease, givenaspecific multilocus genotype combination p (Dj AABBCC ¼ 0.07). This particular dataset included allele frequencies of 0.2 and 0.8 and ah eritability of 1.5 per cent. The effect in the dataset wass imulated using at hree-locus interaction model between loci 3, 5a nd 10. Them odel is shown in Ta ble 1.
The dataset is unbalanced, with 200 controls and 50 cases. First, the dataset wasr un without anyc onsiderations for its unbalanced nature: without any manipulation of the data (ie no over-or under-sampling),w ithoutc hanging the threshold (leaving it at 1.0), and following all previously mentioned configuration recommendations. Single-locus through to five-locus interactions were considered in the analysis. Ther esulting best models for each level of interaction arel isted in Ta ble 2A. Based on thel owest prediction errora nd highest cross-validation consistency,t he single-locus model would be chosen as the final model. The correct three-locus model wasi dentified, butn ot chosen as the final best model due to the over-representationofcontrols within the dataset, skewing the assignmento fd isease risk status for each multi-locus combination.
To performp ermutation testingp roperly, the randomised datasets must reflect the proportion of cases and controls in the reald ataset. Permutation testing wasd onec orrectly, reflecting the unbalanced nature of the dataset as well as all configuration parametersu sed in the original analysis. The permutationd istribution showedt hat the final model revealed by MDR analysisw as not statistically significant.
To demonstrate the importance of proper permutation testing, randomised datasetsw ere created for permutation testing without consideration for the unbalanced nature of the original dataset. When permutationt esting wasd onei nt his manner,t he final single-locus model wasf ound to be significant. In fact, all fivec andidate models (single-locust hrough to five-locus models) were significant. This demonstrates the challenge presented by unbalanced datasets-disease risk status in each cell can be influenced by the numerical dominance of one affection class, makingd etection of a true signal difficult.
As mentioned previously,a ltering the threshold value has been suggested to deal with this challenge.T odemonstrate the effect of altering the threshold value,t he data were runa gain, but this time adjusting the threshold to reflect the proportions seen within the data. Because there were 50 cases and 200 controls, the threshold wass et to 0.25, instead of 1.0. As discussed earlier,t his produces unpredictable results, (also shown in Ta ble 2B). The final model chosen from this run of MDR would be the two-locus model as it has the minimum prediction errora nd parsimony.H owever,i td oes not include even one of the three actualdisease loci. Adjusting the threshold gave rise to an even worsep erformance than was seen in the originalM DR run-the correct model wasn ot identifiede vena st he bestt hree-locus model. Proper permutation testing, using the adjusted threshold value,r evealed that the final model wasn ot statistically significant. Using an alternativefi tness metric to accommodate theu nbalanced nature of the data, however, can improvet his procedure. Balanced accuracy (or 1-balanced classification error) takes into account the ratio of cases to controls in the dataset. This metric is calculated by the equation [1 2 ((sensitivity þ specificity)/2)]. The results of the MDR analysisu sing a threshold of 0.25 and balanced accuracy as the fitness metric are shown in Ta ble 2C.Here,the bestmodel is the three-locus over-fittingofthe data. As the number of loci in the candidate model increases, the misclassification errora lwaysd ecreases, as shown in this analysis. This sample dataset demonstrates the importance of proper implementationo ft he MDR method. As imulated dataset wasu sed so that the correct model wask nown and the deleteriouse ffect of improper implementationc ould be readily apparent. These phenomena are also observedd uring the analysis of real data.

Conclusions
MDR is an ovel and powerful statisticalt ool for detecting and modellinge pistasis in the study of humand isease and pharmacogenomics. In making this method morea vailable and acceptable in the scientific community,itisimportant that the guidelinesf or use arew ell understood.
These guidelines must also be understood when comparing MDR with other,m ore traditional methods such as logistic regression or classification and regression trees. To evaluate multiple methods accurately,t he parametersd efined for each method must be comparable.T he range of loci interactions considered must be identical, along withcross-validation splits and permutation parameters.
Building on the success that MDR has already had,m any of the performance features of the method are currently being studied. More extensivep ower studies areb eing performed to estimate the powero fM DR in datasetsw ith different sample sizes, effect sizes, number of factorsa nd noise level attached to the true model. Additionally,o ther levels of N-fold crossvalidation are being explored for their influence on power and computation time.U nderstandingt he problems that can arise from over-and under-sampling, new fitness metrics are being explored to handle the problem of unbalanced data. The dissection of all performance features of MDR is a priority of future research.
The detection, characterisationand interpretation of genegene and gene -environment interactions aree xpected to improvet he diagnosis, prevention and treatment of common human diseases. MDRc an be ap owerful tool in reaching these goals when used appropriately.