Open Access

Gene nomenclature by default, or BLASTing to Babel

Human Genomics20052:196

DOI: 10.1186/1479-7364-2-3-196

Received: 11 May 2005

Accepted: 11 May 2005

Published: 1 September 2005

Abstract

The current proliferation of mammalian genomes is creating a nomenclature issue caused by naming genes based on their best BLAST hit to a gene in another annotated genome. The rat genome is relying heavily on the mouse genome for nomenclature, but not all rat genes have direct orthologues in the mouse; often, there are paralogous groups of genes -- due to expansions of that gene subfamily in one or the other genome. Many of these genes have already been assigned names in the rat, so that renaming them based on BLAST scores leads to duplicate sets of names. The supposed orthology created by name sharing across genomes is not always found. These inaccurate names are appearing in frequently used sites, such as the University of California Santa Cruz Genome Browser. The example of rat cytochrome P450 (Cyp) genes is presented here, but other gene families are also likely to be affected.

Keywords

gene nomenclature cytochrome P450 rat genome mouse genome orthologue

Introduction

The rat genome has been sequenced and assembled [1], creating a need for rat gene nomenclature. The obvious source of gene nomenclature for the rat would seem to be the mouse genome. Ideally, orthologues should have the same name. This logic has led to an automated naming of rat genes -- leading to problems of two kinds. First, the rat has long been an experimental animal. Genes from both rat and mouse were sequenced and named for nearly 20 years before the genomes were being sequenced. In the example of cytochromes P450, the first mammalian sequences Cyp2b1 and Cyp2b2 were determined in the rat [2]. The mouse sequences began to appear two years later with Cyp1a1 and Cyp1a2 [3, 4]. The established nomenclature for CYP genes has been in place since 1987 [57], and these names have been used in publications for several years. Because the names were assigned independently, mostly in chronological order, orthologues do not always carry the same name.

The second nomenclature problem has to do with divergence over time between species' genomes. Here, mouse and rat will be discussed, but the same applies to other species such as human and rhesus monkey. When similar genes appear in gene clusters, the one to one relationship of the genes between mouse and rat is often broken, meaning that the orthology is broken. Compared with the 57 CYP genes of the human, the mouse has greatly expanded its set of Cyp genes to 102 full-length genes; [8] the rat has been a little more conservative, with 87 Cyp genes [9]. The solo genes in a mammalian CYP subfamily -- those that occur without related neighbours -- are strict orthologues, and so nomenclature by best reciprocal BLAST hit between mouse and rat is a viable strategy. This works for 31 mouse-rat gene pairs and one pseudogene. Eighty-seven rat genes cannot be matched up to 102 mouse genes as orthologue pairs, however, and this nomenclature method can be seen to fail in the gene clusters.

Results and Discussion

Not all Cyp gene clusters are disordered between mouse and rat. For example, the Cyp4f gene cluster has nine genes in both species and there is a clean 1:1 mapping between orthologous pairs (Table 1). In fact, there are 33 such pairs in the Cyp gene clusters (Table 1); two of these pairs involve matches to pseudogenes in the other species. After these 64 pairs are subtracted and a correction is made in the count for pseudogenes, there are still 40 mouse genes remaining to pair with 24 rat genes. These genes either have no orthologues (paired with an 'x' in Table 1) or they are in paralogous gene sets (shaded and boldened in Table 1).
Table 1

Orthology between mouse and rat Cyp genes

Mouse

Rat

Mouse

Rat

Mouse

Rat

1a1

1a1

2e1

2e1

4f13

4f6

1a2

1a2

2f2

2f4

4f14

4f1

1b1

1b1

2g1

2g1

4f15

4f4

   

4f16

4f5

 

2a4

x

2j5

2j5-ps

4f17

4f17

2a5

2a3

2j6

2j4

4f18

4f18

2a12

2a2

2j7

 

4f37

4f37

2a22

2a1

2j8

2j16

4f39

4f39

  

2j11

 

4f40

4f40

2b9

2b3

2j9

2j3

  

2b10

2b1

2j12

2j10

4v3

4v?

2b13

2b2

2j13

2j13

4x1

4x1

 

2b12

    

2b19

2b15

2r1

2r1

5a1

5a1

 

2b31

2s1

2s1

7a1

7a1

2b23

2b21

2t4

2t1

7b1

7b1

  

2u1

2u1

8a1

8a1

2c37

x

2w1

2w1

8b1

8b1

2c29

 

2ab1

2ab1

11a1

11a1

2c38

2c6

2ac1-ps

2ac1

11b1

11b1

2c39

2c7

  

11b2

11b2

2c44

2c23

3a11

 

x

11b3

2c50

x

3a16

3a1/3a23

17a1

17a1

2c52-ps

2c11

3a41

3a2

19a1

19a1

2c54

x

3a44

3a73

20a1

20a1

2c55

2c24

3a13

3a9

21a1

21a1

 

2c80

3a25

 

24a1

24a1

  

3a57

3a18

26a1

26a1

2c65

 

3a59

 

26b1

26b1

2c66

2c78

x

3a62

26c1

26c1

2c70

2c22

  

27a1

27a1

2c40

 

4a12a

 

27b1

27b1

2c67

2c12

4a12b

4a8

39a1

39a1

2c68

2c13

  

46a1

46a1

2c69

 

4a29

x

51a1

51a1

  

4a30b

x

  

2d9

     

2d10

 

4a14

4a2

  

2d11

2d1

 

4a3

  

2d12

2d5

    

2d34

 

4a10

   

2d22

2d4

4a31

4a1

  

2d26

2d2

4a32

   

2d13

     

2d40

2d3

4b1

4b1

  

Paralogues are boldened

The discrepancy is explained by extra duplications, mostly in the mouse but, in at least some cases, there is duplication in the rat that is not seen in the mouse. Figures 1 and 2 illustrate a comparison of two such gene clusters between rat and mouse in detail. The Cyp2d cluster shows five genes in the rat and two pseudogenes. The mouse has nine genes and 17 pseudogenes. Cyp2d5 and Cyp2d1 in the rat are most similar by BLAST searches to the five mouse genes -- Cyp2d11, Cyp2d10, Cyp2d9, Cyp2d12 and Cyp2d34 -- that are boxed in the mouse cluster; these represent paralogous sets of genes Mouse Cyp2d13 and Cyp2d40 are almost equally similar to Cyp2d3 in the rat. In between these genes there are six pseudogenes. This whole cluster of genes and pseudogenes may have been derived from a Cyp2d3-like ancestor that expanded in the mouse. Of course, more complicated scenarios are also possible.
Figure 1

The rat and mouse Cyp2d locus. Expansion in the mouse leads to non-orthologous relationships between these genes. These rat genes have had official CYP names for more than 15 years; in fact, they were the first five genes in the Cyp2d subfamily to be identified. For more details on rat P450 nomenclature, see http://drnelson.utmem.edu/cytochromeP450.html. Abbreviation: bp, base pairs.

Figure 2

The Cyp 4 abx locus in the rat and the mouse. There has been differential expansion in both species. Strict orthology does not exist, except on the outer edges of the cluster. This is often a feature of genes in gene clusters: the edges of the cluster are more likely to be conserved.

Figure 2 shows the Cyp4abx clusters. Notice how the rat Cyp4a1 gene has given rise to three Cyp4a genes in the mouse. By contrast, mouse Cyp4a14 has duplicated, making Cyp4a2 and Cyp4a3 in the rat, based on BLAST similarities. The mouse cluster is further complicated by an approximately 100 kilobase duplication involving the Cyp4a12 and Cyp4a30 genes. This did not happen in that rat and there does not seem to be a Cyp4a30 equivalent in that animal -- unless it might be the rat Cyp4a33-ps pseudogene. There are seven Cyp gene clusters in the rat, some being even more complex than that described for the Cyp2d and Cyp4abx clusters.

The example of mouse versus rat Cyp genes that has been chosen in this paper are by no means the only gene sets that will have this problem. In the 5th December, 2002 issue of Nature, in which the mouse genome was reported [10], Table 11 (p. 542) shows the top 50 InterPro domain families in mouse compared with that in human, fish, worm and fly. Cytochrome P450 is ranked 46th in the mouse and 52nd in human. The 45 other families that are more abundant than Cyp in the mouse will potentially have similar nomenclature issues. Fortunately, some of these groups (eg the homeobox genes) have a firmly established nomenclature and will not be renamed. It is not so clear what confusion will descend on the ATPase, kinase, zinc finger protein and the many other gene families.

The point made by these figures and tables is that: naming genes cannot be an automatic process, unless one wishes to create confusion. Best reciprocal BLAST hits can be used in assigning names, but they should not be used indiscriminately -- if they are, the result presented in Figure 3 might occur. In fact, it has occurred. Figure 3 is a screenshot of the University of California Santa Cruz (UCSC) browser showing the rat Cyp2d cluster, with its five genes. Note that these genes are named Cyp2d22, Cyp2d10, Cyp2d9, Cyp2d13 and Cyp2d26. From Figure 2, it can be seen that these rat genes had already been named Cyp2d4, Cyp2d5, Cyp2d1, Cyp2d3 and Cyp2d2. These names were assigned between 1987 and 1989 by the Committee on Standardized Cytochrome P450 Nomenclature and are official names used in many dozens or hundreds of publications. The two outside 'rat genes Cyp2d22 and Cyp2d26' (Figure 3) are, in fact, orthologues of rat Cyp2d4 and Cyp2d2 (Table 1), but the other three rat genes in between Cyp2d22 and Cyp2d26 in Figure 3 -- Cyp2d10, Cyp2d9 and Cyp2d13 -- are not orthologous pairs. Thus, rat Cyp genes that already have official names have been renamed to match seemingly orthologous mouse Cyp genes. On other views in the UCSC browser, rat Cyp genes that already have official names have been renamed for human CYP genes that are not their orthologues. These names are wrong, yet because they appear in the Genbank database they will probably be used by companies making microarrays and by genome browsers like UCSC and ENSEMBL. This is a very unfortunate practice that may require considerable effort to correct.
Figure 3

The rat Cyp2d locus, as shown by the University of California Santa Cruz Genome Browser. Note that the gene nomenclature being used follows the existing mouse gene nomenclature, which is incorrect. Mouse Cyp2d22 is actually the orthologue of rat Cyp2d4. Mouse Cyp2d26 is actually the orthologue of rat Cyp2d2.

Conclusions

Gene nomenclature committees have been established to impose order on gene families and in whole genomes to prevent duplication of names and multiple uses of the same root symbol. Gene nomenclature committees have been established to provide an authority that can be trusted. Ignoring the existence of naming systems in order to assign hundreds, or thousands, of names quickly to rat genes to match genes in other genomes will come with a price, and the price will be in failed communication and widespread confusion. These problems are not so different from those that must occur when a carefully constructed language is corrupted.

Authors’ Affiliations

(1)
The University of Tennessee Health Sciences Center

References

  1. Rat Genome Sequencing Project Consortium: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004, 428: 493-521.View ArticleGoogle Scholar
  2. Fujii-Kuriyama Y, Mizukami Y, Kawajiri K, et al: Primary structure of a cytochrome P-450: Coding nucleotide sequence of phenobarbital-inducible cytochrome P-450 cDNA from rat liver. Proc Natl Acad Sci USA. 1982, 79: 2793-2797. 10.1073/pnas.79.9.2793.PubMed CentralView ArticlePubMedGoogle Scholar
  3. Kimura S, Gonzalez FJ, Nebert DW: The murine Ah locus. Comparison of the complete cytochrome P1-450 and P3-450 cDNA nucleotide and amino acid sequences. J Biol Chem. 1984, 259: 10705-10713.PubMedGoogle Scholar
  4. Kimura S, Gonzalez FJ, Nebert DW: Mouse cytochrome P3-450: Complete cDNA and amino acid sequence. Nucleic Acids Res. 1984, 12: 2917-2928. 10.1093/nar/12.6.2917.PubMed CentralView ArticlePubMedGoogle Scholar
  5. Nebert DW, Adesnik M, Coon MJ, et al: The P450 gene superfamily: Recommended nomenclature. DNA. 1987, 6: 1-11. 10.1089/dna.1987.6.1.View ArticlePubMedGoogle Scholar
  6. Nelson DR, Kamataki T, Waxman DJ, et al: The P450 superfamily: Update on new sequences, gene mapping, accession numbers, early trivial names of enzymes, and nomenclature. DNA Cell Biol. 1993, 12: 1-51. 10.1089/dna.1993.12.1.View ArticlePubMedGoogle Scholar
  7. Nelson DR, Koymans L, Kamataki T, et al: P450 superfamily: Update on new sequences, gene mapping, accession numbers and nomenclature. Pharmacogenetics. 1996, 6: 1-42. 10.1097/00008571-199602000-00002.View ArticlePubMedGoogle Scholar
  8. Nelson DR, Zeldin D, Hoffman S, et al: Comparison of cytochrome P450 (CYP) genes from the mouse and human genomes including nomenclature recommendations for genes, pseudogenes, and alternative-splice variants. Pharmacogenetics. 2004, 14: 1-18. 10.1097/00008571-200401000-00001.View ArticlePubMedGoogle Scholar
  9. The Cytochrome P450 homepage available at http://drnelson.utmem.edu/cytochromeP450.html. (last accessed 30th March, 2005)
  10. Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View ArticleGoogle Scholar

Copyright

© Henry Stewart Publications 2005

Advertisement