Gene nomenclature by default, or BLASTing to Babel

The current proliferation of mammalian genomes is creating a nomenclature issue caused by naming genes based on their best BLAST hit to a gene in another annotated genome. The rat genome is relying heavily on the mouse genome for nomenclature, but not all rat genes have direct orthologues in the mouse; often, there are paralogous groups of genes -- due to expansions of that gene subfamily in one or the other genome. Many of these genes have already been assigned names in the rat, so that renaming them based on BLAST scores leads to duplicate sets of names. The supposed orthology created by name sharing across genomes is not always found. These inaccurate names are appearing in frequently used sites, such as the University of California Santa Cruz Genome Browser. The example of rat cytochrome P450 (Cyp) genes is presented here, but other gene families are also likely to be affected.


Introduction
The rat genome has been sequenced and assembled, 1 creating a need for rat gene nomenclature. The obvious source of gene nomenclature for the rat would seem to be the mouse genome. Ideally, orthologues should have the same name. This logic has led to an automated naming of rat genes -leading to problems of two kinds. First, the rat has long been an experimental animal. Genes from both rat and mouse were sequenced and named for nearly 20 years before the genomes were being sequenced. In the example of cytochromes P450, the first mammalian sequences Cyp2b1 and Cyp2b2 were determined in the rat. 2 The mouse sequences began to appear two years later with Cyp1a1 and Cyp1a2. 3,4 The established nomenclature for CYP genes has been in place since 1987, 5 -7 and these names have been used in publications for several years. Because the names were assigned independently, mostly in chronological order, orthologues do not always carry the same name.
The second nomenclature problem has to do with divergence over time between species' genomes. Here, mouse and rat will be discussed, but the same applies to other species such as human and rhesus monkey. When similar genes appear in gene clusters, the one to one relationship of the genes between mouse and rat is often broken, meaning that the orthology is broken. Compared with the 57 CYP genes of the human, the mouse has greatly expanded its set of Cyp genes to 102 full-length genes; 8 the rat has been a little more conservative, with 87 Cyp genes. 9 The solo genes in a mammalian CYP subfamily -those that occur without related neighbours -are strict orthologues, and so nomenclature by best reciprocal BLAST hit between mouse and rat is a viable strategy. This works for 31 mouse -rat gene pairs and one pseudogene. Eighty-seven rat genes cannot be matched up to 102 mouse genes as orthologue pairs, however, and this nomenclature method can be seen to fail in the gene clusters.

Results and Discussion
Not all Cyp gene clusters are disordered between mouse and rat. For example, the Cyp4f gene cluster has nine genes in both species and there is a clean 1:1 mapping between orthologous pairs (Table 1). In fact, there are 33 such pairs in the Cyp gene clusters ( Table 1); two of these pairs involve matches to pseudogenes in the other species. After these 64 pairs are subtracted and a correction is made in the count for pseudogenes, there are still 40 mouse genes remaining to pair with 24 rat genes. These genes either have no orthologues (paired with an 'x' in Table 1) or they are in paralogous gene sets (shaded and boldened in Table 1).
The discrepancy is explained by extra duplications, mostly in the mouse but, in at least some cases, there is duplication in the rat that is not seen in the mouse.  Table 1. Orthology between mouse and rat Cyp genes. Paralogues are boldened and yellow shaded.

Mouse
Rat  Figure 1. The rat and mouse Cyp2d locus. Expansion in the mouse leads to non-orthologous relationships between these genes. These rat genes have had official CYP names for more than 15 years; in fact, they were the first five genes in the Cyp2d subfamily to be identified. For more details on rat P450 nomenclature, see http://drnelson.utmem.edu/cytochromeP450.html. Abbreviation: bp, base pairs. in detail. The Cyp2d cluster shows five genes in the rat and two pseudogenes. The mouse has nine genes and 17 pseudogenes. Cyp2d5 and Cyp2d1 in the rat are most similar by BLAST searches to the five mouse genes -Cyp2d11, Cyp2d10, Cyp2d9, Cyp2d12 and Cyp2d34 -that are boxed in the mouse cluster; these represent paralogous sets of genes. Mouse Cyp2d13 and Cyp2d40 are almost equally similar to Cyp2d3 in the rat. In between these genes there are six pseudogenes. This whole cluster of genes and pseudogenes may have been derived from a Cyp2d3-like ancestor that expanded in the mouse. Of course, more complicated scenarios are also possible. Figure 2 shows the Cyp4abx clusters. Notice how the rat Cyp4a1 gene has given rise to three Cyp4a genes in the mouse. By contrast, mouse Cyp4a14 has duplicated, making Cyp4a2 and Cyp4a3 in the rat, based on BLAST similarities. The mouse cluster is further complicated by an approximately 100 kilobase duplication involving the Cyp4a12 and Cyp4a30 genes. This did not happen in that rat and there does not seem to be a Cyp4a30 equivalent in that animal -unless it might be the rat Cyp4a33-ps pseudogene. There are seven Cyp gene clusters in the rat, some being even more complex than that described for the Cyp2d and Cyp4abx clusters.
The example of mouse versus rat Cyp genes that has been chosen in this paper are by no means the only gene sets that will have this problem. In the 5th December, 2002 issue of Nature, in which the mouse genome was reported, 10 Table 11 (p. 542) shows the top 50 InterPro domain families in mouse compared with that in human, fish, worm and fly. Cytochrome P450 is ranked 46th in the mouse and 52nd in human. The 45 other families that are more abundant than Cyp in the mouse will potentially have similar nomenclature issues. Fortunately, some of these groups (eg the homeobox genes) have a firmly established nomenclature and will not be renamed. It is not so clear what confusion will descend on the ATPase, kinase, zinc finger protein and the many other gene families.
The point made by these figures and tables is that: naming genes cannot be an automatic process, unless one wishes to create confusion. Best reciprocal BLAST hits can be used in assigning names, but they should not Figure 2. The Cyp4abx locus in the rat and the mouse. There has been differential expansion in both species. Strict orthology does not exist, except on the outer edges of the cluster. This is often a feature of genes in gene clusters: the edges of the cluster are more likely to be conserved. be used indiscriminately -if they are, the result presented in Figure 3 might occur. In fact, it has occurred. Figure 3 is a screenshot of the University of California Santa Cruz (UCSC) browser showing the rat Cyp2d cluster, with its five genes. Note that these genes are named Cyp2d22, Cyp2d10, Cyp2d9, Cyp2d13 and Cyp2d26. From Figure 2, it can be seen that these rat genes had already been named Cyp2d4, Cyp2d5, Cyp2d1, Cyp2d3 and Cyp2d2. These names were assigned between 1987 and 1989 by the Committee on Standardized Cytochrome P450 Nomenclature and are official names used in many dozens or hundreds of publications. The two outside 'rat genes Cyp2d22 and Cyp2d26' (Figure 3) are, in fact, orthologues of rat Cyp2d4 and Cyp2d2 (Table 1), but the other three rat genes in between Cyp2d22 and Cyp2d26 in Figure 3 -Cyp2d10, Cyp2d9 and Cyp2d13 -are not orthologous pairs. Thus, rat Cyp genes that already have official names have been renamed to match seemingly orthologous mouse Cyp genes. On other views in the UCSC browser, rat Cyp genes that already have official names have been renamed for human CYP genes that are not their orthologues. These names are wrong, yet because they appear in the Genbank database they will probably be used by companies making microarrays and by genome browsers like UCSC and ENSEMBL. This is a very unfortunate practice that may require considerable effort to correct.

Conclusions
Gene nomenclature committees have been established to impose order on gene families and in whole genomes to prevent duplication of names and multiple uses of the same root symbol. Gene nomenclature committees have been established to provide an authority that can be trusted. Ignoring the existence of naming systems in order to assign hundreds, or thousands, of names quickly to rat genes to match genes in other genomes will come with a price, and the price will be in failed communication and widespread confusion. These problems are not so different from those that must occur when a carefully constructed language is corrupted.