The truth about mouse, human, worms and yeast
© Henry Stewart Publications 2004
Received: 1 December 2003
Accepted: 1 December 2003
Published: 1 January 2004
Genome comparisons are behind the powerful new annotation methods being developed to find all human genes, as well as genes from other genomes. Genomes are now frequently being studied in pairs to provide cross-comparison datasets. This 'Noah's Ark' approach often reveals unsuspected genes and may support the deletion of false-positive predictions. Joining mouse and human as the cross-comparison dataset for the first two mammals are: two Drosophila species, D. melanogaster and D. pseudoobscura; two sea squirts, Ciona intestinalis and Ciona savignyi; four yeast (Saccharomyces) species; two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae; and two pufferfish (Takefugu rubripes and Tetraodon nigroviridis). Even genomes like yeast and C. elegans, which have been known for more than five years, are now being significantly improved. Methods developed for yeast or nematodes will now be applied to mouse and human, and soon to additional mammals such as rat and dog, to identify all the mammalian protein-coding genes. Current large disparities between human Unigene predictions (127,835 genes) and gene-scanning methods (45,000 genes) still need to be resolved. This will be the challenge during the next few years.
Keywordshuman genome mouse genome Caenorhabditis elegans genome Caenorhabditis briggsae genome Saccharomyces genomes comparative genomics gene discovery gene-prediction algorithms
Introduction and background
The monumental sequence of a composite human genome conjures up images of Arthur C. Clark's monolith in the film 2001: A Space Odyssey -- a beautiful, awe-inspiring structure with a hidden message. Researchers' ignorance is laid bare by the simple fact that they cannot, with any confidence, extract from this (human genome) structure the total number of its genes. What is needed is a Carl Sagan (SETI, Contact), an Alan Turing (WWII code breaker) or a Jean-Franc¸ois Champollion (Rosetta stone decoder) to break the codes. Or perhaps, what is really needed is a Rosetta stone for genomes: just two or three translations of the same message, laid side by side. Unfortunately, thereis not even one full translation available. James Watson put it this way in a 1992 interview:  'The goal of the Human Genome Project is to understand the genetic instructions for human beings ... Getting the instructions is a big job; understanding those instructions can consume many hundreds of years ...'.
In December 1999, an analysis of the human chromosome (Chr) 22 sequence was published; 545 protein-coding genes and 134 pseudogenes were identified . In January 2003, a reanalysis of the Chr 22 sequence by the same group reported 546 protein-coding genes and 234 pseudogenes, with an increase of 74 per cent in the total length of exons in the annotation . A third, microarray-based, study  doubled the number of Chr 22 base pairs in transcribed sequences. The National Center for Biotechnology Information (NCBI) human genome map-viewer build 34 version 1 (Nov 2003) has 673 genes on Chr 22 and an unspecified number of pseudogenes. Since the true number of genes and pseudogenes has not changed in the past four years -- it is merely researchers' ability to detect them that has improved -- how many more genes will be found and how will they be detected?
Finding protein-coding genes
The best method for documenting genes is with a full-length cDNA. Even shorter expressed-sequence tags (ESTs), if not from the same species then from a closely related species, are useful. The EST database dbEST (21st November, 2003) lists 5,427,257 Homo sapiens ESTs and 3,948,029 Mus musculus ESTs. The Unigene database clusters these ESTs into unique contigs representing 127,835 human (build 163) and 93,645 mouse transcripts. The human number is similar to the TIGR Gene Index prediction of 120,000 genes in humans .
According to the NCBI Handbook 2003, Unigene clusters may contain more than one alternative-splice form . Furthermore, Unigene clusters are required to have evidence of a 3'] terminus, to avoid forming two or more clusters from a single long gene; this restriction prevents some ESTs in dbEST from being included in Unigene. The logical interpretation  of these facts is that 'each Unigene cluster contains sequences that represent a unique gene'.
This leaves researchers with a problem. Conservative gene annotation of the human genome only identified 25,642 genes . More relaxed estimates predict about 40,000  to 45,000  genes; yet, these numbers are about threefold lower than the Unigene cluster count. At some point, these values should converge on the true number of genes -- defined as full-length, expressed messages from any cell type at any time, from germ cells to embryo to adult. Currently, this point is some way away.
By the comparative genomics approach, the mouse genome is supposed to save us from this weakness in finding genes in the human genome. By comparing mouse and human geno-mic sequences, all orthologous genes and many paralogous genes should be detectable, exon by exon. Preliminary efforts with small sets of known genes were highly successful. The ROSETTA program  identified 94 per cent of internal coding exons from 117 mouse-human orthologous gene pairs perfectly at both exon ends, and another 4 per cent at one of the two ends . It did less well for initial, plus terminal, coding exons.
Including conserved sequence elements
We now find the problem grows more complex, however, because there are thousands of non-expressed conserved sequence elements (CSEs) in the two mammals , sequences whose function we do not understand. Some are possibly promoter regions, some pseudogenes or RNA genes and some are new undocumented genes, but it is clear that this does not account for all of these sequences. Thus, the comparative genomics approach may over-predict, when viewing two mammals, since they may be phylogenetically too close. The distance between species for optimal gene identification has been studied, and mouse-human is generally good, but a mammal more distant than mouse from human might be even better .
An alternative approach has been to use fish as a more distant relative. The EXOFISH Program  compared human and Tetraodon nigroviridis (freshwater pufferfish) for conserved regions (presumably exons) and found 28,000-34,000 genes. Due to the greater evolutionary distance between human and fish, there is a cleaner background, but the many mammal-specific genes and human brain-specific genes may not be identified, so the gene number predicted by EXOFISH is almost certainly an underestimate.
Another approach is exemplified by the analysis of sequences from 12 species, all derived from a 1.8 Mb region orthologous to a human Chr 7 segment containing ten genes . In this instance, coding exons were already well documented, but substantial numbers of CSEs -- beyond those previously identified experimentally -- were discovered. This approach might be more fruitful at human gene discovery, if applied to areas of the human genome that are more poorly characterised than the Chr 7 segment chosen.
Whereas ~1.5 per cent of the human genome comprises protein-coding genes, another ~3.5 per cent of the genome contains CSEs that are more conserved than protein-coding-gene regions . Possible functions for these CSEs (termed CNGs by Dermitzakis et al . and CNSs by Inada et al .) include control regions that: (a) regulate gene expression; (b) govern developmental-, cell type- and organ-specific expression, in trans, of genes located far away; (c) lock-in regulatory decisions ; and (d) act as structural components of chromosomes when alignment and chromosome movement occurs during meiosis or mitosis. There appear to be at least twice as many CSEs than protein-coding genes in the genome. A recent comparison of 43 species -- including vertebrates, insects, worms, plants, fungi, yeast, eubacteria and archae-bacteria  -- revealed noteworthy increases in genome size and complexity from prokaryote to mammals, again emphasising the innumerable highly-conserved CSEs that are likely to have essential functions and critical effects on an organism's phenotype.
Learning from the worms
Several comparisons of Caenorhabditis briggsae, Caenorhabditis elegans WS77* and Caenorhabditis elegans hybrid
Number of genes
Number of exons
Different algorithms for predicting protein-coding genes give similar results in predicting exons but tend to disagree on the grouping of exons into genes . Four different gene-prediction programs can give four very different answers across the same region of a genome. Stein et al . used the concordance of prediction between C. elegans and C. briggsae to predict the most likely gene model -- using Genefinder (version 980506, P. Green, unpublished data, 2003; see also Ref. ), FGENESH , TWINSCAN  and the Ensembl annotation pipeline . The output of the four gene-prediction programs (Figure 1) was largely concordant with respect to the position of C. briggsae exons (80 per cent of exons predicted identically by two or more programs; 26 per cent predicted identically by all four programs), but discordant with regard to gene predictions (38 per cent of genes called identically by two or more programs; just 4 per cent called identically by all four programs). A similar pattern was seen in C. elegans .
Stein et al . termed the gene sets produced by their analysis 'hybrid gene sets', because the final gene sets are a mixture of gene prediction from multiple programs; applying a transpo-son- and pseudogene-filtering step to the WormBase 77 set, they removed 619 genes to create a 'pruned' WS77 set, termed WS77*. The constitution of the final gene sets was: C. briggsae, 19,507 genes; the C. elegans WS77*, 18,808 genes; and the hybrid C. elegans, 20,621 genes (Table 1).
Stein et al . compared the C. elegans hybrid gene set (20,621 genes) to the WS77* set (18,808 genes) derived from WormBase and derived 1,275 well-supported suggestions for new C. elegans genes, 1,763 new exons in 1,100 existing genes, 2,093 exon deletions in 1,583 genes, 1,675 exon truncations in 1,502 existing genes and 1,115 exon extensions in 1,008 existing genes. These data underscore the value of comparative genomics between total-genome sequences from two species in establishing a more accurate count of protein-coding genes.
Comparing C. elegans/C. briggsaedivergence and mouse/human divergence
The two worms diverged ~100 million years ago (MYA) and the two mammals diverged ~75 MYA. Similar levels of amino acid identity exist between C. briggsae and C. elegans ortho-logues (80 per cent) and between mouse and human ortho-logues (78.5 per cent). In the mouse/human comparison, 80 per cent of predicted proteins can be assigned to a 1:1 orthologue pair, whereas <65 per cent of C. briggsae genes could be assigned a C. elegans orthologue. The protein families are thus more dynamic in the two nematodes -- several hundred either being novel or having diverged so far that their common origin cannot be recognised, and another ~200 having expanded or contracted by more than twofold. The C. briggsae/C. elegans pair is also evolving more rapidly at the nucleotide level: 1.78 synonymous substitutions per synonymous site, compared with 0.6 in the mouse/human pair .
Many of these striking differences between the two worms and the two mammals can probably be explained on the length of generation times. The generation time in the nematodes is ~3 days, compared with ~3 months and ~20 years for the mouse and human, respectively.
Approaching a stable gene count in yeast: Hope for mammals
Improved annotation does not always increase gene number. Detailed comparison of four Saccharomyces species  resulted in revision of 15 per cent of known yeast genes and a net decrease in the S. cerevisiae gene count of about 500; this is a case where 'less is more'. This illustrates the power of adding more closely related sequences to the analysis, especially since the yeast genome had been known for seven years prior to this analysis.
Tremendous progress has been made in the eight years since the baker's yeast genome sequence appeared. There is still a large gap, however, between gene predictions and Unigene clusters. This must be accounted for by improvement of comparative genomics methods such as: (a) using the ROSETTA program to include three or more species; (b) obtaining more comprehensive EST collections from mouse, rat, human and other species, possibly by purchase of these resources from private companies that have already amassed the information; and/or (c) utilising consensus prediction methods, as was done in the C. elegans/C. briggsae study . Special attention will need to be given to the first- and last-exon predictions, as well as allowance of non-canonical intron-exon boundaries (GC versus GT, etc) -- if supported by EST data. Verification of predictions by reverse transcriptase polymerase chain reaction, as was demonstrated in the study by Guigo et al . will confirm the expression of questionable genes and enhance genome annotation. One can only hope that Dr Watson's prediction of 12 years ago was a slight exaggeration.
The writing of this article was funded, in part, by NIH grant P30 ES06096 (D.W.N.). The authors very much appreciate assistance with the graphics from Dr Marian Miller.
- Dunham I, Shimizu N, Roe BA, et al: 'The DNA sequence of human chromosome 22'. Nature. 1999, 402: 489-495. 10.1038/990031.View ArticlePubMedGoogle Scholar
- Collins JE, Goward ME, Cole CG, et al: 'Reevaluating human gene annotation: A second-generation analysis of chromosome 22'. Genome Res. 2003, 13: 27-36. 10.1101/gr.695703.PubMed CentralView ArticlePubMedGoogle Scholar
- Rinn JL, Euskirchen G, Bertone P, et al: 'The transcriptional activity of human chromosome 22'. Genes Dev. 2003, 17: 529-540. 10.1101/gad.1055203.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang F, Holt I, Pertea G, et al: 'Gene index analysis of the human genome estimates approximately 120,000 genes'. Nat Genet. 2000, 25: 239-240. 10.1038/76126.View ArticlePubMedGoogle Scholar
- Pontius JU, Wagner L, Schuler GD: 'UniGene: A unified view of the transcriptome'. The NCBI Handbook. 2003, National Center for Biotechnology Information, Bethesda, MD, 21-24.Google Scholar
- Wheeler DL, Church DM, Federhen S, et al: 'Database resources of the National Center for Biotechnology'. Nucleic Acids Res. 2003, 31: 28-33. 10.1093/nar/gkg033.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Keibler E, Hu P, et al: 'Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map'. Genome Res. 2003, 13: 46-54. 10.1101/gr.830003.PubMed CentralView ArticlePubMedGoogle Scholar
- Xuan Z, Wang J, Zhang MQ: 'Computational comparison of two mouse draft genomes and the human golden path'. Genome Biol. 2003, 4: R1-10.1186/gb-2003-4-2-p1.PubMed CentralView ArticlePubMedGoogle Scholar
- Das M, Burge CB, Park E, et al: 'Assessment of the total number of human transcription units'. Genomics. 2001, 77: 71-78. 10.1006/geno.2001.6620.View ArticlePubMedGoogle Scholar
- Batzoglou S, Pachter L, Mesirov JP, et al: 'Human and mouse gene structure: Comparative analysis and application to exon prediction'. Genome Res. 2000, 10: 950-958. 10.1101/gr.10.7.950.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Pavlovic V, Cantor CR, Kasif S: 'Human-mouse gene identification by comparative evidence integration and evolutionary analysis'. Genome Res. 2003, 13: 1190-1202. 10.1101/gr.703903.PubMed CentralView ArticlePubMedGoogle Scholar
- Roest-Crollius H, Jaillon O, Bernot A, et al: 'Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence'. Nat Genet. 2003, 25: 235-238.Google Scholar
- Thomas JW, Touchman JW, Blakesley RW, et al: 'Comparative analyses of multi-species sequences from targeted genomic regions'. Nature. 2003, 424: 788-793. 10.1038/nature01858.View ArticlePubMedGoogle Scholar
- Dermitzakis ET, Reymond A, Scamuffa N, et al: 'Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs)'. Science. 2003, 302: 1033-1055. 10.1126/science.1087047.View ArticlePubMedGoogle Scholar
- Inada DC, Bashir A, Lee C, et al: 'Conserved noncoding sequences in the grasses'. Genome Res. 2003, 13: 2030-2041. 10.1101/gr.1280703.PubMed CentralView ArticlePubMedGoogle Scholar
- Lynch M, Conery JS: 'The origins of genome complexity'. Science. 2003, 302: 1401-1404. 10.1126/science.1089370.View ArticlePubMedGoogle Scholar
- Stein LD, Bao Z, Blasiar D, et al: 'The genome sequence of Caenorhabditis briggsae: A platform for comparative genomics'. PLoS Biol. 2003, 1: 2-View ArticleGoogle Scholar
- Reese MG, Hartzell G, Harris NL, et al: 'Genome annotation assessment in Drosophila melanogaster'. Genome Res. 2000, 10: 483-501. 10.1101/gr.10.4.483.PubMed CentralView ArticlePubMedGoogle Scholar
- Salamov AA, Solovyev VV: 'Ab initio gene finding in Drosophila genomic DNA'. Genome Res. 2000, 10: 516-522. 10.1101/gr.10.4.516.PubMed CentralView ArticlePubMedGoogle Scholar
- Korf I, Flicek P, Duan D, et al: 'Integrating genomic homology into gene structure prediction'. Bioinformatics. 2001, 17 (Suppl 1): S140-S148. 10.1093/bioinformatics/17.suppl_1.S140.View ArticlePubMedGoogle Scholar
- Clamp M, Andrews D, Barker D, et al: 'Ensembl 2002: Accommodating comparative genomics'. Nucleic Acids Res. 2003, 31: 38-42. 10.1093/nar/gkg083.PubMed CentralView ArticlePubMedGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, et al: 'Sequencing and comparison of yeast species to identify genes and regulatory elements'. Nature. 2003, 423: 241-254. 10.1038/nature01644.View ArticlePubMedGoogle Scholar
- Guigo R, Dermitzakis ET, Agarwal P, et al: 'Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes'. Proc Natl Acad Sci USA. 2003, 100: 1140-1145. 10.1073/pnas.0337561100.PubMed CentralView ArticlePubMedGoogle Scholar