Skip to main content

Web-based resources for comparative genomics


The available web-based genome data and related resources provide great opportunities for biomedical scientists to identify functional elements in a particular genome region or to explore the evolutionary pattern of genome dynamics. Comparative genomics is an indispensable tool for achieving these goals. Because of the broad scope of comparative genomics, it is difficult to address all of its aspects in short survey. A few currently 'hot' topics have therefore been selected and a brief review of the availability of web-based databases software is given.

Genome databases for comparative genomics

Usually, genome-wide databases (see Table 1) change rapidly, both in their internal implementation and in the datasets recorded. This paper briefly reviews two severs recently made public, which researchers should find valuable for obtaining a wealth of useful information. The genome alignment and annotation database (GALA)[1] provides access to information on genes (known and predicted), gene ontology, expression patterns, genome alignments and conserved transcription factor binding sites predicted by the TRANSFAC weight matrix that can be estimated from the known binding sites to show the sequence signature [2]. For example, given a set of genes expressed in a particular tissue, GALA is able to identify all of the predicted binding sites for one or more transcription factors of interest that are all conserved in mammals. EnsMart is a branch of the Ensembl project,[3] which integrates data from Ensembl and several other resources, using a 'warehouse star-schema' with central biological objects (eg genes or single nucleotide polymorphisms) connected to a set of satellite tables, such as disease, transcript and protein family (PFAM) attributes. Thus, EnsMart provides users with fast and effective access to deep data in and around genes.

Table 1 Websites for tools and databases useful for comparative genomics

Multi-genome alignment and gene prediction

Genome-wide alignment servers for two closely related species are available on the web. The BLAST,[4, 5] implemented at the National Center for Biotechnology Information (NCBI), is the most frequently used suite of tools. Several servers were specially designed to align two or more long genomic sequences at high sensitivity while detecting common rearrangements or duplications -- for example, PipMaker,[6] MultiPipMaker,[7] zPicture,[8] VISTA [9] and MAVID [10]. These servers are suitable for species such as those from different mammalian orders. Several pipelines have been designed for mammalian genome alignment [1113]. For more distant species, or ancient paralogous genes, different alignment methods should be recommended. One major application is to look for common motifs in the upstream regions of co-expressed genes. Two examples of these approaches are multiple expectation maximisation for motif elicitation (MEME) and Gibbs sampling [1416].

One application of multi-genome alignment is to improve the efficiency of gene finding. ROSETTA reconstructs co-linear gene structures from global alignments and defines exons as sub-sequences bounded by splice sites [17]. Syntenic Gene Prediction version 1 (SGP18) reconstructs genes from a collection of local alignments between two sequences,[18] while SGP2 assesses the reliability of gene models predicted by GeneID,[19] a conventional gene predictor [20]. Similarly, TWI-NSCAN represents a direct extension of the Genscan algorithm that integrates conservation information between two sequences [2123]. DOUBLESCAN uses a pair hidden Markov model (Pair HMM) to reconstruct gene structures from a series of local alignments created with BLAST [4, 24].

Evolutionary approaches to protein function detection

Phylogenetic analysis by maximum likelihood (PAML) is a software package that includes a wealth of methods for statistically testing the evolutionary pattern of coding sequences, which can be used for one functional detection and prediction of proteins [25]. For instance, PAML is able to estimate ω, the ratio of the non-synonymous rate to the synonymous rate at each amino acid residue along the lineages of a given phylogenetic tree. DIVERGE is a program for studying one functional divergence of a protein family by detecting site-specific changes in the evolutionary rate using a multiple alignment of amino acid sequences for a given phylogenetic tree [26, 27]. It first conducts a statistical test for site-specific rate shifts along the tree and predicts candidate amino acid residues responsible for functional divergence based on posterior analysis. These results can then be mapped on the three-dimensional protein structure, if available.

Multiple genome rearrangement by signed reversal

For comparative gene mapping, it is important to reconstruct the ancestral gene orders for given current genomes. Mathematically, it becomes a problem of signed reversals -- that is, how the genomes evolve from a common ancestral genome based on signed reversal of genes or gene sets. Since this problem is now-deterministic polynomial-time hard (NP-hard),[28] most work is focused on heuristic algorithms for reconstructing the gene order of ancestral genomes. Sankoff et al. [29] searched for the optimal ancestral genome for a median problem upon a grid. Bourque and Pevzner [30] designed the model generative reasoning (MGR) algorithm to reconstruct ancestral genomes using a greedy-split strategy. Wu and Gu [31, 32] improved the searching accuracy by using a nearest path search algorithm; they developed a neighbour-perturbing algorithm to reconstruct the optimal gene order of ancestral genomes.

Comparative microarray analysis

Because of the limited data available, there are only a few case studies for interspecies microarray analysis. One good example is for the human-chimpanzee expression profile comparisons in the brain and liver [33, 34]. Gu [35] developed a statistical framework for studying expression divergence between duplicate genes, which can also be used to infer the ancestral expression profiles when the phylogeny of duplicate genes is known. To facilitate application of these models to expression and genomic data, Gu et al. [36] defined an additive expression distance between duplicate genes, measured by the average of squared expression differences. They analysed yeast gene families using a multi-microarray dataset and found a more than ten-fold increase in the rate of expression evolution immediately following gene duplication.

Identification of functional non-coding elements by comparative genomics

Although the majority of eukaryote genomes are non-coding regions and were previously regarded as 'junk DNA', recent studies have indicated that non-coding regions harbour important functional elements such as cis-regulatory modules [37, 38]. Computational detection of these functional non-coding elements has been extremely challenging. It has been recognised that comparative genomics may be a promising approach to solving this problem. 'Phylogenetic footprinting' focuses on the discovery of novel regulatory elements based on the sequence conservation among a set of ortho-logous non-coding regions [39]. Using this method, many successful motif discovery programs have been developed; for example, Gibbs sampler,[40] MEME,[41] Consensus,[42] AlignAce,[43] ANN-Spec,[44] FootPrinter [45] and PhyMe [46]. For non-coding RNA elements, many tools have been developed to identify the evolutionary conservation of secondary structures, such as QRNA,[47] DDBRNA,[48] MSARI,[49] and RNAZ [50]. The development of these tools serves as compelling evidence for biologically relevant non-coding RNAS function. In addition, some databases of functional non-coding elements are also available; for example, TRED,[51] RNAdb [52] and NONCODE [53].


In summary, this paper has briefly reviewed the web-based resources for comparative genomics. Given that substantial resources are available, the challenge in fact turns on how to transfer the explosion in genomic data to biological knowledge. The internet has substantially facilitated the transition process but progress depends on the development of new ideas and analysis pipelines that combine many approaches, including comparative genomics.


  1. 1.

    Giardine BM, Elnitski L, Riemer C, et al: GALA, a database for genomic sequence alignments and annotations. Genome Res. 2003, 13: 732-741. 10.1101/gr.603103.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  2. 2.

    Wingender E, Chen X, Fricke E, et al: The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001, 29: 281-283. 10.1093/nar/29.1.281.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Kasprzyk A, Keefe D, Smedley D, et al: EnsMart--A generic system for fast and flexible access to biological data. Genome Res. 2003, 14: 160-169. 10.1101/gr.1645104.

    Article  Google Scholar 

  4. 4.

    Altschul SF, Gish W, Miller W, et al: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Altschul SF, Madden TL, Schaffer AA, et al: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  6. 6.

    Schwartz S, Zhang Z, Frazer KA, et al: PipMaker--A web server for aligning two genomic DNA sequences. Genome Res. 2000, 10: 577-586. 10.1101/gr.10.4.577.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  7. 7.

    Schwartz S, Elnitski L, Li M, et al: MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 2003, 31: 3518-3524. 10.1093/nar/gkg579.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  8. 8.

    Ovcharenko I, Loots GG, Hardison RC, et al: zPicture: Dynamic alignment and visualization tool for analyzing conservation profiles. Genome Res. 2004, 14: 472-477. 10.1101/gr.2129504.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. 9.

    Mayor C, Brudno M, Schwartz JR, et al: VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics. 2000, 16: 1046-1047. 10.1093/bioinformatics/16.11.1046.

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Bray N, Pachter L: MAVID multiple alignment server. Nucleic Acids Res. 2003, 31: 3525-3526. 10.1093/nar/gkg623.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  11. 11.

    Brudno M, Do CB, Cooper GM, et al: LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13: 721-731. 10.1101/gr.926603.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  12. 12.

    Couronne O, Poliakov A, Bray N, et al: Strategies and tools for whole-genome alignments. Genome Res. 2003, 13: 73-80. 10.1101/gr.762503.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  13. 13.

    Schwartz S, Kent WJ, Smit A, et al: Human-mouse alignments with Blastz. Genome Res. 2003, 13: 103-105. 10.1101/gr.809403.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  14. 14.

    Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol. 1995, 3: 21-29.

    CAS  PubMed  Google Scholar 

  15. 15.

    Schug J, Overton GC: Modeling transcription factor binding sites with Gibbs sampling and minimum description length encoding. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 268-271.

    CAS  PubMed  Google Scholar 

  16. 16.

    Thompson W, Rouchka EC, Lawrence CE: Gibbs Recursive Sampler: Finding transcription factor binding sites. Nucleic Acids Res. 2003, 31: 3580-3585. 10.1093/nar/gkg608.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  17. 17.

    Batzoglou S, Pachter L, Mesirov JP, et al: Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 2000, 10: 950-958. 10.1101/gr.10.7.950.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  18. 18.

    Wheeler DL, Church DM, Lash AE: Database resources of the National Center for Biotechnology Information: Update. Nucleic Acids Res. 2002, 30: 13-16. 10.1093/nar/30.1.13.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  19. 19.

    Parra G, Agarwal P, Abril JF, et al: Comparative gene prediction in human and mouse. Genome Res. 2003, 13: 108-117. 10.1101/gr.871403.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  20. 20.

    Guigo R: Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998, 5: 681-702. 10.1089/cmb.1998.5.681.

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Korf I, Flicek P, Duan D, Brent MR: Integrating genomic homology into gene structure prediction. Bioinformatics. 2001, 17 (Suppl 1): S140-S148. 10.1093/bioinformatics/17.suppl_1.S140.

    Article  PubMed  Google Scholar 

  22. 22.

    Burge C: Identification of genes in human genomic DNA. PhD thesis. 1997, Stanford University, Stanford, CA

    Google Scholar 

  23. 23.

    Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268: 78-94. 10.1006/jmbi.1997.0951.

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Meyer IM, Durbin R: Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics. 2002, 18: 1309-1318. 10.1093/bioinformatics/18.10.1309.

    CAS  Article  PubMed  Google Scholar 

  25. 25.

    Yang Z: PAML: A program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13: 555-556.

    CAS  PubMed  Google Scholar 

  26. 26.

    Gu X, Vander Velden K: DIVERGE: Phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics. 2002, 18: 500-501. 10.1093/bioinformatics/18.3.500.

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Gu X: Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol. 1999, 16: 1664-1674. 10.1093/oxfordjournals.molbev.a026080.

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Caprara A: Formulations and hardness of multiple sorting by reversals. Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB'99). 1999, ACM Press, New York, NY

    Google Scholar 

  29. 29.

    Sankoff D, Sudaram G, Kececioglu J: Steiner points in the space of genome rearrangements. Int J Found Comput Sci. 1996, 7: 1-9. 10.1142/S0129054196000026.

    Article  Google Scholar 

  30. 30.

    Bourque G, Pevzner PA: Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002, 12: 26-36.

    PubMed Central  CAS  PubMed  Google Scholar 

  31. 31.

    Wu S, Gu X: Multiple genome rearrangement by reversals. Pac Symp Biocomput. 2002, 7: 259-270.

    Google Scholar 

  32. 32.

    Wu S, Gu X: Algorithms for multiple genome rearrangement by signed reversals. Pac Symp Biocomput. 2003, 8: 363-374.

    Google Scholar 

  33. 33.

    Enard W, Khaitovich P, Klose J, et al: Intra- and interspecific variation in primate gene expression patterns. Science. 2002, 296: 340-343. 10.1126/science.1068996.

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Gu J, Gu X: Induced gene expression in human brain after the split from chimpanzee. Trends Genet. 2003, 19: 63-65. 10.1016/S0168-9525(02)00040-9.

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Gu X: Statistical framework for phylogenomic analysis of gene family expression profiles. Genetics. 2004, 167: 531-542. 10.1534/genetics.167.1.531.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. 36.

    Gu X, Zhang Z, Huang W: Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc Natl Acad Sci USA. 2005, 102: 707-712. 10.1073/pnas.0409186102.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  37. 37.

    Dermitzakis ET, Reymond A, Lyle R, et al: Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature. 2002, 420: 578-582. 10.1038/nature01251.

    CAS  Article  PubMed  Google Scholar 

  38. 38.

    Gibbs WW: The unseen genome: Gems among the junk. Sci Am. 2003, 289: 26-33.

    PubMed  Google Scholar 

  39. 39.

    Tagle DA, Koop BF, Goodman M, et al: Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J Mol Biol. 1988, 203: 439-455. 10.1016/0022-2836(88)90011-3.

    CAS  Article  PubMed  Google Scholar 

  40. 40.

    Lawrence CE, Altschul SF, Boguski MS, et al: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.

    CAS  Article  PubMed  Google Scholar 

  41. 41.

    Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach Learn. 1995, 21: 51-80.

    Google Scholar 

  42. 42.

    Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.

    CAS  Article  PubMed  Google Scholar 

  43. 43.

    Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945. 10.1038/nbt1098-939.

    CAS  Article  PubMed  Google Scholar 

  44. 44.

    Workman CT, Stormo GD: ANN-Spec: A method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000, 5: 467-478.

    Google Scholar 

  45. 45.

    Blanchette M, Tompa M: FootPrinter: A program designed for phylogenetic footprinting. Nucleic Acids Res. 2003, 31: 3840-3842. 10.1093/nar/gkg606.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  46. 46.

    Sinha S, Blanchette M, Tompa M: PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004, 5: 170-10.1186/1471-2105-5-170.

    PubMed Central  Article  PubMed  Google Scholar 

  47. 47.

    Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2001, 2: 8-10.1186/1471-2105-2-8.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  48. 48.

    di Bernardo D, Down T, Hubbard T: ddbRNA: Detection of conserved secondary structures in multiple alignments. Bioinformatics. 2003, 19: 1606-1611. 10.1093/bioinformatics/btg229.

    CAS  Article  PubMed  Google Scholar 

  49. 49.

    Coventry A, Kleitman DJ, Berger B: MSARI: Multiple sequence alignments for statistical detection of RNA secondary structure. Proc Natl Acad Sci USA. 2004, 101: 12102-12107. 10.1073/pnas.0404193101.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  50. 50.

    Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA. 2005, 102: 2454-2459. 10.1073/pnas.0409169102.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  51. 51.

    Zhao F, Xuan Z, Liu L, Zhang MQ: TRED: A Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res. 2005, 33: D103-D107. 10.1093/nar/gni105.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  52. 52.

    Pang KC, Stephen S, Engstrom PG, et al: RNAdb -- A comprehensive mammalian noncoding RNA database. Nucleic Acids Res. 2005, 33: D125-D130. 10.1093/nar/gni117.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  53. 53.

    Liu C, Bai B, Skogerbo G, et al: NONCODE: An integrated knowledge database of non-coding RNAs. Nucleic Acids Res. 2005, 33: D112-D115. 10.1093/nar/gni113.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

Download references


This work was supported by an NIH grant to X.G. and the NSFC Overseas Outstanding Young Investigator Award (China) to X.G.

Author information



Corresponding author

Correspondence to Xun Gu.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Gu, X., Su, Z. Web-based resources for comparative genomics. Hum Genomics 2, 187 (2005).

Download citation


  • comparative genomics
  • software
  • web-based database