Bioinformatics methods for identifying candidate disease genes

With the explosion in genomic and functional genomics information, methods for disease gene identification are rapidly evolving. Databases are now essential to the process of selecting candidate disease genes. Combining positional information with disease characteristics and functional information is the usual strategy by which candidate disease genes are selected. Enrichment for candidate disease genes, however, depends on the skills of the operating researcher. Over the past few years, a number of bioinformatics methods that enrich for the most likely candidate disease genes have been developed. Such in silico prioritisation methods may further improve by completion of datasets, by development of standardised ontologies across databases and species and, ultimately, by the integration of different strategies.


Introduction
Currently,w itht he increase in accessible data and the developmento fn ovel molecularb iology techniques, new methods for the identification of disease genes are evolving. Linkage studies and mutations creening areb ecominge asier and the number of identified(disease) genes is increasing rapidly.2003 sawt he completion of the human genome sequence and the number of genes is nows et to 20,000 -25,000. 1,2 With all the genetics technology in place,i dentification of disease-related mutations in Mendelian single-gene disordersm ainly depends on having the right patients and families. The genetic analysis of complex diseases still remains ad ifficult task, however, and most genes for multifactorial disease remain to be discovered.
Genetic mappingb yl inkage is am ainstayo fh uman genetics research. While positional information reduces the number of genes that are candidates for causing the disease, this reduction is often not sufficient for rapid disease gene identification.
The aim of candidate gene prioritisation methods is to choosethose genes for detailed mutationanalysisthat are most likelyt ob et he cause of thed isease.T his is especially relevant since positional methods mayleave up to 100 differentgenes as candidates. Hence additional information to be used for prioritisation is essential.
Databases have become ac ore source for today'sg ene hunters.R etrieval systems such as the National Center for Biotechnology Information's Entrez, 3 the SequenceR etrieval System 4 and Maarten's Retrieval System 5 provideeasy and fast access to ac ollection of frequently used databases. The main focus of these retrieval systemsi st of etch as et of database entries that meett he user query. Identification of ad isease gene is mostl ikely to be successful when positional and functional routes arei ntegrated. Integration of data based on genomic context, such as in the University of California, Santa Cruz genome browser and Ensembl, 6,7 resulted in step by step interfaces (eg EnsMart 8 )w hich extract data based on chromosomalp osition, genee xpression 9 and gene ontology (GO). 10 Enrichmentf or disease candidate genes usingt hese databaseinterfaces, however, dependsheavily on the operation skills of the researcher.A lternative methods have been developed systematically to explore datasets for the most likely candidated isease genes. This paper presents an overview of such methods and their accessibility.

Candidate disease gene identification methods
The methods developed for candidate disease genei dentification use different data sources and strategies.
Perez-Iratxeta et al. developed the Genes2Diseases (G2D) method. 11 This searches Medline abstracts for MeSH-C (phenotypic features) and MeSH-D( chemical objects)  12,13 Co-occurrence of MeSH-C/D in the Medline abstracts wasc onsidered to be related to the association between the chemicaland the phenotypic terms. Sequences in the Ref Seqdatabase 14 are used to associateMeSH-D from the sequence Medline links with the GO functional annotation of the sames equence. 10 This creates the possibility of associating phenotype terms withfunctional terms via the chemicalterms. Literature on ad isease can be screened for MeSH-C terms, which arethenused to determine the association between the disease and theg enes witht he GO annotation. The system wastested on 450 diseases that had previously been mapped to as pecific locus but without ap articular gene assigned. The resulting scores were compared with 100 diseases for which the gene wask nown. 11 On average,P erez-Iratxeta et al. tested 30-megabase candidate regions. Assuming 20,000 -25,000 human genes, 1,2 and an average gene density of one gene per 120 kilobases, an 8-31-fold enrichment wasc alculated for this method. The G2D method wasr ecently extended with expressed sequence tag data. The system for phenotype input wasalso improved, which reduces the prior clinical knowledge required to be entered. This new version of G2D performs better,m ostly because more databases areu sed with larger datasets.B oth versions of the method area vailable online (http://www.bork.embl-heidelberg.de/g2d/; http://www. ogic.ca/projects/g2d_2/). Whereas G2D uses information extracted from the medical literature,POCUS, developed by Tu rner et al., uses functional, domainannotation and gene expression data to prioritisecandidate disease genes. 15 The method assumes that genes involved in thesame disease will share GO annotation, protein domains and asimilar expression pattern. Ascoring system that includes these sources allows one to rank genes in the candidate disease regions.POCUS seeks over-representationoffunctional annotation between loci for the same disease.Larger candidate regions are apriori more likely to share similarities and arethus less likelytogenerate geneconnections that arestatistically significant. The method wastested with 29 diseases and achieved an enrichmentof12-42-fold. The method cannot be used online, but the POCUS scriptscan be downloaded.
Specificg ene characteristics have been used in candidate disease gene identification. Sequence analysiso fh uman/ eukaryotic genes showedt hat humanp roteinsw ith multiple long amino acid runs are more often linked with genetic disease than are shorter proteins. 16 Lopez-Bigas et al. found that proteins involved in genetic diseases tend to be long, conserveda nd without close paralogues. 17 Disease genes are more frequently found to be conservedi no ther species, but this might be due the preferentials equencing of known (disease) genes. Thed isease genep rediction system using these sequence characteristics can be accessed online (http://cgg.ebi. ac.uk/services/dgp/).
Similarly,A die et al.t ested sequence property analysisu sing alternatingd ecisiont rees. 18 They found differences between random genes and disease genes based on anumber of features, including: gene/cDNA/protein/3 0 untranslated region length, the number of exons, distance to the adjacent gene,h igher level of conservation in the mouse,s ignal peptide encoding and 5 0 CpG islands. Te sts for candidate gene identification showed2-25-fold enrichment. Data can be accessed using the PROSPECTR webs erver( http://www.genetics.med.ed.ac. uk/prospectr/). The user can rank genes for their likelihood to be involved in adisease,either from alist or agenomic region. The method wasr ecently extended with GO terms, Interpro domains and gene expression data. The SUSPECTSw eb serveru ses PROSPECTR, and allowso ne to rank genes for their likelihood of involvement in the disease of interest (http://www.genetics.med.ed.ac.uk/suspects/). Smith et al. used ac omparable analysis, which found similar differences between disease and non-disease genes.U sing discriminant analysis, they showedthat these differences mayhelp to predict human disease genes; 19 however, their method is not accessible online.
It is possible systematically to interrogate the multitude of gene and protein expression data that are produced by methods such as RNA expression microarraya nalysisa nd SAGE. For example,T iffin et al.d eveloped am ethod which uses an anatomicalo ntology (eVOC) 9 to integrate biomedical literature and humangeneexpression data. 20 Themethod uses eVOC as ac ontrolled vocabularyt ofi nd anatomyt erms specific for the disease in PubMed abstracts. The anatomy terms are ranked and the candidate genes are selected using the highest ranked terms. The selected candidate genes have a gene expression patternt hat matches the disease associated/ affected tissues.T he enrichmentr eached is 1.5-3.0-fold and the correct gene wasf ound in more than 85 per cent of the cases. Data and scripts are available,b ut there is no web interface (http://www. sanbi.ac.za/tiffin_et_al/).
The link between the tissues and organs that are affected by ag enetic disease and candidate gene expression profiles have been exploited. 21,22 GeneSeeker uses human as well as mouse expression and phenotypic data stored in various databases (http://www.cmbi.ru.nl/GeneSeeker/). This information is combined with positional data for the genes from both species. The system uses different online databases rather than local data and thus mines in real-time.T he GeneSeeker approach differsf romt he other candidate prioritisation approaches by utilisingc ross-species data. Knowledge of model organisms makes comparativec andidate genes election possible.T his situationa pplies when ag ene is known to cause as imilar phenotype in another species. Nonetheless, ad irectc omparison between the phenotypes in humans and model organisms can be complicatedbecause of differences in anatomy. Transfer of knowledge by phenotype is most straightforward in other mammalian species, such as the mouse,t hat are evolutionarily close to humans. An example is the disease gene identification in ectrodactyly-ectodermal dysplasia-clefting syndrome.T his human disease gene wasidentifiedbyacomparable phenotype in homozygous null mice. 23  candidate disease genes wasa chieved using GeneSeeker on a test set of ten diseases. 21 Recently,B radford et al. presented a cross-species search system. 24 Theh uman -mouse gene searcher enables the user to search with phenotype data from human and mouse and links this to the Mouse Gene Expression Database. 25 This tool can assist in the humanmouse phenotype mapping process. It has its ownm erits, and can also be implemented in GeneSeeker.
An umbero fg roupsh aves tarted to use clinicalo verlap between genetic diseases to clusterp henotypes,t hereby allowing correlations with the functional classificationo ft heir underlying disease genes. 26 Such phenotype relationships might be ap owerful method for function prediction. 27 -29 The humanp henotype collection and the underlying genephenotype relations can thereforeb eu sed as at ool for functional genomics. 30 Freudenberg and Propping developed am ethod for clustering genetic diseases based on their phenotype similarity. 31 They manually attributed the disease phenotypic manifestations using the Online MendelianI nheritance in Man (OMIM) database. 32 As imilarity measure wasd eveloped to comparethe phenotypes and to performacomplete linkage clustering. The approach wast ested with al eave-one-out cross-validation of 878 diseases from the OMIM database, using 10672 candidate genes from the human genome.T hey achieved an enrichmento f7-33-fold. Unfortunately,t he method is not available for other users.
Similarly,C antor et al. clustered OMIM 32 records based on the clinical synopsis section. 33 They reduced the disease characteristics to 50 categories. In at est of twod iseases, they found relationships at the genotype level. Since the authors only intended to establish proof of principle on using OMIM for phenotype clustering,t hey did not systematically analyse phenotype -genotype relationships, and their system cannot be accessedd irectly.
Masseroli et al. developed the GFINDer system.T his web tool allowso ne to mine the annotation information from several databases for agiven setofsequence identifiers. 34 Filter parametersa re set manually in the system to select disease genes, and statistical analysisc an be performed. Recently,t he clinicals ynopsis of OMIM wasi ntegrated into the GFINDer system (http://www. bioinformatics.polimi.it/GFINDer/). Phenotype data were normalised and structured in order to obtaint wo controlled vocabularies suitable for computational purposes. Thea bsence of ap redefined strategy makes the efficiency of the system heavily dependent on the operating researcher.The authorspresented only afew selected examples of their method, which makes it difficult to estimate the enrichment. 34 vanDriel et al.devised amethod for comparing phenotypes derived from the OMIM database that uses atextual similarity measure by an automated full text-analysis technique,r ather than predefined termc lasses, and analysed the phenotypegenotype relationships. 35 They found that phenotype similarity scores, which arebased on automatic quantification, correlate positively with an umbero fm easures of gene function, including protein sequence,s imilarity of shared protein motifs, functional annotation and directp rotein -protein interaction. Thed ata support the idea that phenotypic relationships mayb eu sed as indicatorso fb iological and functional interactions at the genea nd protein levels. The phenotype -phenotype rankings cores can be searched online (http://www.cmbi.ru.nl/MimMiner/). The method can be used to study the phenotypic relationshipsa tt he genotype level, by which the phenotype becomes at ool for functional genomics. 30 The aim wasn ot to enrich as pecific region for candidateg enes,b ut the data can be used for this purpose.

Future: Integration and standardisation
The various methods for identifying my candidate disease genes in humans coverdifferent concepts. They use functional and literature data, gene-specific characteristics, anatomybased gene/protein expression data or phenotype comparison analyses. In light of the comparable enrichmentlevels achieved with the different methods, it is likelyt hat they can complement each other.
The results discussed here suggest that the phenotype is a powerful source for revealing biological function and that special attention is needed for the standardisation of the description of phenotypes. 30 Va rious approaches to am ore systematic description of phenotype data have been proposed and await further development. 36,37 Essential to thei mprovement of the candidate disease gene identification methods will be the establishing of standard vocabularies that can be used across databases and species. Af urther challenge will be to develop,r efine and integrate these methods into as ystem that aids in elucidation and understandingo ft he mechanisms of (complex) disease.