Bioinformatics methods for identifying candidate disease genes
© Henry Stewart Publications 2006
Received: 28 December 2005
Accepted: 28 December 2005
Published: 1 June 2006
With the explosion in genomic and functional genomics information, methods for disease gene identification are rapidly evolving. Databases are now essential to the process of selecting candidate disease genes. Combining positional information with disease characteristics and functional information is the usual strategy by which candidate disease genes are selected. Enrichment for candidate disease genes, however, depends on the skills of the operating researcher. Over the past few years, a number of bioinformatics methods that enrich for the most likely candidate disease genes have been developed. Such in silico prioritisation methods may further improve by completion of datasets, by development of standardised ontologies across databases and species and, ultimately, by the integration of different strategies.
Currently, with the increase in accessible data and the development of novel molecular biology techniques, new methods for the identification of disease genes are evolving. Linkage studies and mutation screening are becoming easier and the number of identified (disease) genes is increasing rapidly. 2003 saw the completion of the human genome sequence and the number of genes is now set to 20,000-25,000 [1, 2]. With all the genetics technology in place, identification of disease-related mutations in Mendelian single-gene disorders mainly depends on having the right patients and families. The genetic analysis of complex diseases still remains a difficult task, however, and most genes for multifactorial disease remain to be discovered.
Genetic mapping by linkage is a mainstay of human genetics research. While positional information reduces the number of genes that are candidates for causing the disease, this reduction is often not sufficient for rapid disease gene identification.
The aim of candidate gene prioritisation methods is to choose those genes for detailed mutation analysis that are most likely to be the cause of the disease. This is especially relevant since positional methods may leave up to 100 different genes as candidates. Hence additional information to be used for prioritisation is essential.
Databases have become a core source for today's gene hunters. Retrieval systems such as the National Center for Biotechnology Information's Entrez , the Sequence Retrieval System  and Maarten's Retrieval System  provide easy and fast access to a collection of frequently used databases. The main focus of these retrieval systems is to fetch a set of database entries that meet the user query. Identification of a disease gene is most likely to be successful when positional and functional routes are integrated. Integration of data based on genomic context, such as in the University of California, Santa Cruz genome browser and Ensembl [6, 7], resulted in step by step interfaces (eg EnsMart ) which extract data based on chromosomal position, gene expression  and gene ontology (GO) . Enrichment for disease candidate genes using these database interfaces, however, depends heavily on the operation skills of the researcher. Alternative methods have been developed systematically to explore datasets for the most likely candidate disease genes. This paper presents an overview of such methods and their accessibility.
Candidate disease gene identification methods
The methods developed for candidate disease gene identification use different data sources and strategies.
Perez-Iratxeta et al. developed the Genes2Diseases (G2D) method . This searches Medline abstracts for MeSH-C (phenotypic features) and MeSH-D (chemical objects) terms [12, 13]. Co-occurrence of MeSH-C/D in the Medline abstracts was considered to be related to the association between the chemical and the phenotypic terms. Sequences in the Ref Seq database  are used to associate MeSH-D from the sequence Medline links with the GO functional annotation of the same sequence . This creates the possibility of associating phenotype terms with functional terms via the chemical terms. Literature on a disease can be screened for MeSH-C terms, which are then used to determine the association between the disease and the genes with the GO annotation. The system was tested on 450 diseases that had previously been mapped to a specific locus but without a particular gene assigned. The resulting scores were compared with 100 diseases for which the gene was known . On average, Perez-Iratxeta et al. tested 30-megabase candidate regions. Assuming 20,000-25,000 human genes [1, 2], and an average gene density of one gene per 120 kilobases, an 8-31-fold enrichment was calculated for this method. The G2D method was recently extended with expressed sequence tag data. The system for phenotype input was also improved, which reduces the prior clinical knowledge required to be entered. This new version of G2D performs better, mostly because more databases are used with larger datasets. Both versions of the method are available online (http://www.bork.embl-heidelberg.de/g2d/; http://www.ogic.ca/projects/g2d_2/).
Whereas G2D uses information extracted from the medical literature, POCUS, developed by Turner et al., uses functional, domain annotation and gene expression data to prioritise candidate disease genes . The method assumes that genes involved in the same disease will share GO annotation, protein domains and a similar expression pattern. A scoring system that includes these sources allows one to rank genes in the candidate disease regions. POCUS seeks over-representation of functional annotation between loci for the same disease. Larger candidate regions are a priori more likely to share similarities and are thus less likely to generate gene connections that are statistically significant. The method was tested with 29 diseases and achieved an enrichment of 12 - 42-fold. The method cannot be used online, but the POCUS scripts can be downloaded.
Specific gene characteristics have been used in candidate disease gene identification. Sequence analysis of human/eukaryotic genes showed that human proteins with multiple long amino acid runs are more often linked with genetic disease than are shorter proteins . Lopez-Bigas et al. found that proteins involved in genetic diseases tend to be long, conserved and without close paralogues . Disease genes are more frequently found to be conserved in other species, but this might be due the preferential sequencing of known (disease) genes. The disease gene prediction system using these sequence characteristics can be accessed online (http://cgg.ebi.ac.uk/services/dgp/).
Similarly, Adie et al. tested sequence property analysis using alternating decision trees . They found differences between random genes and disease genes based on a number of features, including: gene/cDNA/protein/3' untranslated region length, the number of exons, distance to the adjacent gene, higher level of conservation in the mouse, signal peptide encoding and 5' CpG islands. Tests for candidate gene identification showed 2-25-fold enrichment. Data can be accessed using the PROSPECTR web server (http://www.genetics.med.ed.ac.uk/prospectr/). The user can rank genes for their likelihood to be involved in a disease, either from a list or a genomic region. The method was recently extended with GO terms, Interpro domains and gene expression data. The SUSPECTS web server uses PROSPECTR, and allows one to rank genes for their likelihood of involvement in the disease of interest (http://www.genetics.med.ed.ac.uk/suspects/). Smith et al. used a comparable analysis, which found similar differences between disease and non-disease genes. Using discriminant analysis, they showed that these differences may help to predict human disease genes;  however, their method is not accessible online.
It is possible systematically to interrogate the multitude of gene and protein expression data that are produced by methods such as RNA expression microarray analysis and SAGE. For example, Tiffin et al. developed a method which uses an anatomical ontology (eVOC)  to integrate biomedical literature and human gene expression data . The method uses eVOC as a controlled vocabulary to find anatomy terms specific for the disease in PubMed abstracts. The anatomy terms are ranked and the candidate genes are selected using the highest ranked terms. The selected candidate genes have a gene expression pattern that matches the disease associated/affected tissues. The enrichment reached is 1.5-3.0-fold and the correct gene was found in more than 85 per cent of the cases. Data and scripts are available, but there is no web interface (http://www.sanbi.ac.za/tiffin_et_al/).
The link between the tissues and organs that are affected by a genetic disease and candidate gene expression profiles have been exploited [21, 22]. GeneSeeker uses human as well as mouse expression and phenotypic data stored in various databases http://www.cmbi.ru.nl/GeneSeeker/. This information is combined with positional data for the genes from both species. The system uses different online databases rather than local data and thus mines in real-time. The GeneSeeker approach differs from the other candidate prioritisation approaches by utilising cross-species data. Knowledge of model organisms makes comparative candidate gene selection possible. This situation applies when a gene is known to cause a similar phenotype in another species. Nonetheless, a direct comparison between the phenotypes in humans and model organisms can be complicated because of differences in anatomy. Transfer of knowledge by phenotype is most straightforward in other mammalian species, such as the mouse, that are evolutionarily close to humans. An example is the disease gene identification in ectrodactyly-ectodermal dysplasia-clefting syndrome. This human disease gene was identified by a comparable phenotype in homozygous null mice . A 7-25-fold enrichment of candidate disease genes was achieved using GeneSeeker on a test set of ten diseases . Recently, Bradford et al. presented a cross-species search system . The human - mouse gene searcher enables the user to search with phenotype data from human and mouse and links this to the Mouse Gene Expression Database . This tool can assist in the human - mouse phenotype mapping process. It has its own merits, and can also be implemented in GeneSeeker.
A number of groups have started to use clinical overlap between genetic diseases to cluster phenotypes, thereby allowing correlations with the functional classification of their underlying disease genes . Such phenotype relationships might be a powerful method for function prediction [27–29]. The human phenotype collection and the underlying gene-phenotype relations can therefore be used as a tool for functional genomics .
Freudenberg and Propping developed a method for clustering genetic diseases based on their phenotype similarity . They manually attributed the disease phenotypic manifestations using the Online Mendelian Inheritance in Man (OMIM) database . A similarity measure was developed to compare the phenotypes and to perform a complete linkage clustering. The approach was tested with a leave-one-out cross-validation of 878 diseases from the OMIM database, using 10672 candidate genes from the human genome. They achieved an enrichment of 7-33-fold. Unfortunately, the method is not available for other users.
Similarly, Cantor et al. clustered OMIM  records based on the clinical synopsis section . They reduced the disease characteristics to 50 categories. In a test of two diseases, they found relationships at the genotype level. Since the authors only intended to establish proof of principle on using OMIM for phenotype clustering, they did not systematically analyse phenotype-genotype relationships, and their system cannot be accessed directly.
Masseroli et al. developed the GFINDer system. This web tool allows one to mine the annotation information from several databases for a given set of sequence identifiers . Filter parameters are set manually in the system to select disease genes, and statistical analysis can be performed. Recently, the clinical synopsis of OMIM was integrated into the GFINDer system (http://www.bioinformatics.polimi.it/GFINDer/). Phenotype data were normalised and structured in order to obtain two controlled vocabularies suitable for computational purposes. The absence of a predefined strategy makes the efficiency of the system heavily dependent on the operating researcher. The authors presented only a few selected examples of their method, which makes it difficult to estimate the enrichment .
van Driel et al. devised a method for comparing phenotypes derived from the OMIM database that uses a textual similarity measure by an automated full text-analysis technique, rather than predefined term classes, and analysed the phenotype-genotype relationships . They found that phenotype similarity scores, which are based on automatic quantification, correlate positively with a number of measures of gene function, including protein sequence, similarity of shared protein motifs, functional annotation and direct protein - protein interaction. The data support the idea that phenotypic relationships may be used as indicators of biological and functional interactions at the gene and protein levels. The phenotype-phenotype ranking scores can be searched online http://www.cmbi.ru.nl/MimMiner/. The method can be used to study the phenotypic relationships at the genotype level, by which the phenotype becomes a tool for functional genomics . The aim was not to enrich a specific region for candidate genes, but the data can be used for this purpose.
Future: Integration and standardisation
The various methods for identifying my candidate disease genes in humans cover different concepts. They use functional and literature data, gene-specific characteristics, anatomy-based gene/protein expression data or phenotype comparison analyses. In light of the comparable enrichment levels achieved with the different methods, it is likely that they can complement each other.
The results discussed here suggest that the phenotype is a powerful source for revealing biological function and that special attention is needed for the standardisation of the description of phenotypes . Various approaches to a more systematic description of phenotype data have been proposed and await further development [36, 37]. Essential to the improvement of the candidate disease gene identification methods will be the establishing of standard vocabularies that can be used across databases and species. A further challenge will be to develop, refine and integrate these methods into a system that aids in elucidation and understanding of the mechanisms of (complex) disease.
- International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.View ArticleGoogle Scholar
- Larsson TP, Murray CG, Hill T, et al: Comparison of the current Ref Seq, Ensembl and EST databases for counting genes and gene discovery. FEBS Lett. 2005, 579: 690-698. 10.1016/j.febslet.2004.12.046.View ArticlePubMedGoogle Scholar
- Schuler GD, Epstein JA, Ohkawa H, et al: Entrez: Molecular biology database and retrieval system. Methods Enzymol. 1996, 266: 141-162.View ArticlePubMedGoogle Scholar
- Etzold T, Argos P: SRS -- An indexing and retrieval tool for flat file data libraries. Comput Appl Biosci. 1993, 9: 49-57.PubMedGoogle Scholar
- Hekkelman ML, Vriend G: MRS: A fast and compact retrieval system for biological data. Nucleic Acids Res. 2005, 33: W766-W769. 10.1093/nar/gki422.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, et al: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard T, Andrews D, Caccamo M, et al: Ensembl 2005. Nucleic Acids Res. 2005, 33: D447-D453. 10.1093/nar/gki378.PubMed CentralView ArticlePubMedGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, et al: EnsMart: A generic system for fast and flexible access to biological data. Genome Res. 2004, 14: 160-169.PubMed CentralView ArticlePubMedGoogle Scholar
- Kelso J, Visagie J, Theiler G, et al: eVOC: A controlled vocabulary for unifying gene expression data. Genome Res. 2003, 13: 1222-1230. 10.1101/gr.985203.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, et al: Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Perez-Iratxeta C, Bork P, Andrade MA: Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002, 31: 316-319.PubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005, 33: D39-D45.PubMed CentralView ArticlePubMedGoogle Scholar
- Lipscomb CE: Medical Subject Headings (MeSH). Bull Med Libr Assoc. 2000, 88: 265-266.PubMed CentralPubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (Ref Seq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005, 33: D501-D504. 10.1093/nar/gki476.PubMed CentralView ArticlePubMedGoogle Scholar
- Turner FS, Clutterbuck DR, Semple CA: POCUS: Mining genomic sequence annotation to predict disease genes. Genome Biol. 2003, 4: R75-10.1186/gb-2003-4-11-r75.PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Brocchieri L, Bergman A: Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA. 2002, 99: 333-338. 10.1073/pnas.012608599.PubMed CentralView ArticlePubMedGoogle Scholar
- Lopez-Bigas N, Ouzounis CA: Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res. 2004, 32: 3108-3114. 10.1093/nar/gkh605.PubMed CentralView ArticlePubMedGoogle Scholar
- Adie EA, Adams RR, Evans KL: Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics. 2005, 6: 55-10.1186/1471-2105-6-55.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith NG, Eyre-Walker A: Human disease genes: Patterns and predictions. Gene. 2003, 318: 169-175.View ArticlePubMedGoogle Scholar
- Tiffin N, Kelso JF, Powell AR, et al: Integration of text and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 2005, 33: 1544-1552. 10.1093/nar/gki296.PubMed CentralView ArticlePubMedGoogle Scholar
- van Driel MA, Cuelenaere K, Kemmeren PP, et al: A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet. 2003, 11: 57-63. 10.1038/sj.ejhg.5200918.View ArticlePubMedGoogle Scholar
- van Driel MA, Cuelenaere K, Kemmeren PP, et al: GeneSeeker: Extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res. 2005, 33: W758-W761. 10.1093/nar/gki435.PubMed CentralView ArticlePubMedGoogle Scholar
- Celli J, Duijf P, Hamel BC, et al: Heterozygous germline mutations in the p53 homolog p63 are the cause of EEC syndrome. Cell. 1999, 99: 143-153. 10.1016/S0092-8674(00)81646-3.View ArticlePubMedGoogle Scholar
- Bradford I, Winter R, Evans C, et al: Human-mouse gene searcher: A tool to assist discovery of malformation-associated genes by using phenotype databases. Bioinformatics. 2005, 21: 408-409. 10.1093/bioinformatics/bti017.View ArticlePubMedGoogle Scholar
- Ringwald M, Eppig JT, Begley DA, et al: The Mouse Gene Expression Database (GXD). Nucleic Acids Res. 2001, 29: 98-101. 10.1093/nar/29.1.98.PubMed CentralView ArticlePubMedGoogle Scholar
- Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature. 2001, 409: 853-855. 10.1038/35057050.View ArticlePubMedGoogle Scholar
- Spranger J: Pattern recognition in bone dysplasias. Prog Clin Biol Res. 1985, 1200: 315-342.Google Scholar
- Annunen S, Korkko J, Czarny M, et al: Splicing mutations of 54-bp exons in the COL11A1 gene cause Marshall syndrome, but other mutations cause overlapping Marshall/Stickler phenotypes. Am J Hum Genet. 1999, 65: 974-983. 10.1086/302585.PubMed CentralView ArticlePubMedGoogle Scholar
- van Steensel MA, Buma P, de Waal Malefijt MC, et al: Oto- spondylo-megaepiphyseal dysplasia (OSMED): Clinical description of three patients homozygous for a missense mutation in the COL11A2 gene. Am J Med Genet. 1997, 70: 315-323. 10.1002/(SICI)1096-8628(19970613)70:3<315::AID-AJMG19>3.0.CO;2-O.View ArticlePubMedGoogle Scholar
- Brunner HG, van Driel MA: From syndrome families to functional genomics. Nat Rev Genet. 2004, 5: 545-551.View ArticlePubMedGoogle Scholar
- Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002, 18: S110-S115. 10.1093/bioinformatics/18.suppl_2.S110.View ArticlePubMedGoogle Scholar
- Hamosh A, Scott AF, Amberger J, et al: Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2002, 30: 52-55. 10.1093/nar/30.1.52.PubMed CentralView ArticlePubMedGoogle Scholar
- Cantor MN, Lussier YA: Mining OMIM for insight into complex diseases. Medinfo. 2004, 11: 753-757.Google Scholar
- Masseroli M, Galati O, Pinciroli F: GFINDer: Genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists. Nucleic Acids Res. 2005, 33: W717-W723. 10.1093/nar/gki454.PubMed CentralView ArticlePubMedGoogle Scholar
- van Driel MA, Bruggeman J, Vriend G, et al: A text-mining analysis of the human phenome. Eur J Hum Genet. 2006, 14: 535-542. 10.1038/sj.ejhg.5201585.View ArticlePubMedGoogle Scholar
- Freimer N, Sabatti C: The human phenome project. Nat Genet. 2003, 34: 15-21. 10.1038/ng0503-15.View ArticlePubMedGoogle Scholar
- Biesecker LG: Mapping phenotypes to language: A proposal to organize and standardize the clinical descriptions of malformations. Clin Genet. 2005, 68: 320-326. 10.1111/j.1399-0004.2005.00509.x.View ArticlePubMedGoogle Scholar