'Frankenstein genes', or the Mad Magazineversion of the human pseudogenome
© Henry Stewart Publications 2004
Received: 27 February 2004
Accepted: 27 February 2004
Published: 1 May 2004
Annotation of the human genome is inching forward. Seven human chromosomes have now been fully annotated, covering 17 per cent of the genome, and at least one chromosome has been re-annotated. The enormity of the task forces a dependence on automated tools for detecting and assembling the genes, followed by hand curation to correct errors and polish the gene models. The accuracy of gene prediction algorithms is very good for internal exons from intact genes, but these programs do peculiar and exasperating things to pseudogenes. These programs can actually resurrect pseudogenes from the dead, making them into viable gene models for intact proteins, albeit science-fictional proteins. This process is demonstrated for four human pseudogenes from the cytochrome P450 family and one putatively functional P450 gene, CYP2U1, having a non-consensus intron boundary. These examples are offered as a call-to-arms to improve pseudogene prediction as an art in itself, and not as a by-product of gene annotation. Failure to do so will flood the databases with thousands of false-positive predictions. Indeed, they are already there.
Keywordspseudogenes pseudogene prediction genome annotation cytochrome P450 (CYP) genes GNOMON
The assembly of the human genome from whole-genome-shotgun sequence reads was once deemed impossible, and the probable outcome was dubbed the 'MAD Magazine version of the human genome' . This may have referred to MAD's fold-in back cover, where a drawing and the caption under it are folded twice to reveal a new picture and a new caption to go with it. This fold-in back cover is rather like alternative splicing, or, more likely, the intent was to imply production of a crazy-quilt genome structure caused by the incorrect joining of the millions of repeat sequences in the human genome. In spite of these concerns, this did not happen. Paired-end sequencing of three sizes of clones largely avoided the misas-sembly problem and resulted in a faithful rendition of the genome, now with only 498 contigs in Build 34 version 2 . The spectre of the MAD Magazine version of the genome is coming back in a new guise, however: the corrupt annotation of pseudogenes.
Programs designed to scan genomic DNA for genes are trained to look for GT and AG boundaries and some consensus regions around these boundaries. They also evaluate statistical properties of exons to try to assemble a gene. When they encounter a pseudogene that is nearly intact, they try mightily to make it code for an intact protein. The outcome is a Frankenstein gene that never existed in reality, but was cobbled together from spare parts, beginning with the pseudogene carcass. Although it is thought that the number of pseudogenes in the human genome is 20,000 or more,  these genome features are not annotated as completely as intact genes. Locus Link has 21,382 protein coding genes, but only 2,592 pseudogenes are listed (as of 13th February, 2004) . Complete annotation of the human genome must also include identification and naming of all the pseudogenes.
Stumbling on a GC boundary
Taking a lesson from the re-annotation of Drosophila
A recent re-annotation effort on the Drosophila melanogaster genome has predicted approximately 2,000 more genes than is the case for the current Berkeley Drosophila Genome Project (BDGP) annotations [9, 10]. This represents an increase of almost 15 per cent. This result was obtained by relaxing the stringency of the FGENESH prediction algorithm; 7,464 gene models were predicted above and beyond the BDGP annotations. Many of these gene models will be pseudogenes or false-positive predictions. Validation by spotting the gene model exon sequences in a microarray and probing for mRNA sequences that bound to the array did give support for a large percentage of these predictions. Further analysis by reverse transcriptase polymerase chain reaction (RT-PCR) of a subset of the microarray-positive models confirmed that many of these genes are expressed and need to be added to the official gene count .
These findings are important to human genome annotators for two reasons. First, the number of weakly supported or unsupported gene models was large, but a significant fraction of these weak predictions were verified by microarray and RT-PCR experiments. Some of these genes had no BLAST hits in Genbank and so represent new protein families; this argues that the conservative estimates (approximately 25,000 genes) used by some human annotators  are missing many real genes. Secondly, it is perilous not to identify and properly annotate the pseudogenes. There are an estimated 20,000 pseudogenes in the human genome and this number may be much higher [3, 5]. The collection of these defective genes may be called the human pseudogenome. As discussed above, what comes out of this set -- after Genscan or Ensembl gets through with it -- is pretty frightening. These are vivisected genes raised from the dead, as sure as Shelley's monster.
The human genome is large and it contains large genes. The recent annotation of chromosome 6 illustrates this point [12, 13]. The BPAG1 gene contains 101 exons. Another chromosome 6 gene, TCBA1 , contains an intron that is 479 kilobases long. ZNF451 , a zinc finger protein, contains a single exon of 9,114 bp. These are all potential challenges for gene prediction programs. Unfortunately, prediction algorithms are not written to cope with pseudogenes, sequence errors or GC boundaries, giving rise to the results shown in Figures 1-5. Even the detailed manual annotation of chromosome 6 found only 633 pseudogenes, as compared with 1,557 genes. Other evidence suggests that the number of human pseudogenes should be closer to the number of intact genes. Part of this problem lies in the definition of pseudogenes. Perhaps the working definition for the chromosome 6 annotation did not include some small solo exons or detritus exons scattered near documented genes . It is possible, however, that there might still be about 900 undocumented pseudogenes on chromosome 6.
The Genscan program is quite successful in predicting genes, especially the internal exons of real genes . It has been optimised for different G + C compositions and different species, but not for pseudogenes. In order to be able to annotate all of the human genome and other genomes, there is a need for a second program, which is optimised for detecting pseudogenes. These pseudogenes need to be named and documented on the sequence records and genome browsers in order to avoid the aforementioned problem of creating Frankenstein genes.
One program that may fulfil some of these criteria is GNOMON, the National Center for Biotechnology Information's Hidden Markov Model ab initio prediction program . BLAST searches against the GNOMON predictions can be performed from MapViewer . This program assumes that very close exon predictions (less than 50 bp apart) are separated by a frameshift, because very short introns are rare (at least in vertebrates). A frameshift is introduced to merge the two exons into one. The Hidden Markov Model used by GNOMON allows non-consensus splice sites such as the GC boundary in CYP2U1. The CYP2U1 gene is correctly predicted by GNOMON. Stop codons in the middle of an otherwise strong alignment are disregarded and the stops are included in the model, which is treated as a pseudogene rather than as a gene. These features ought to allow the correct prediction of the pseudogenes in Figures 1-4. In practice, even GNOMON makes a few errors. For example, all amino acids of CYP2G2P are correctly predicted, but 20 extra amino acids are added between exons 2 and 3. The CYP2AB1P sequence predicted by GNOMON has 69 more correctly predicted amino acids than Genscan, and includes two of the in-frame stops, but it misses exon 7 and adds the same cryptic exon as Genscan did after exon 5. GNOMON finds most of the C-terminal part of CYP51P1 , but misses the first two-thirds. CYP51P2 coverage is improved, but both N- and C-terminal regions are missed, as well as two internal segments. In short, GNOMON is better than Genscan for these five genes but still misses the gold standard of hand curation.
Finally, once pseudogenes are correctly assembled, nomenclature groups must devise a suitable nomenclature for them. This will not be easy. Pseudogenes are a heterogeneous set. Our own efforts at CYP pseudogene nomenclature identify at least four types of pseudogenes . This nomenclature could be applied to other gene families, or other proposals could be made, but this problem needs to be addressed. In the future, progress in genome annotation should make it increasingly hard to find the misleading and erroneous peptide predictions that currently abound, as these become replaced by accurate annotated and named pseudogene models.
- Anon : 'The book of life: Twin efforts will attempt to write it'. USA Today. 1998, Arlington, 06D-Google Scholar
- National Center for Biotechnology Information: 'Statistics: Genome annotation'. 2003, Human genome Build 34 version 2, from data freeze of 19th November, (cited 8th March, 2004), [http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid%20=%209606%26build%20=%2034%26ve%20=%202]Google Scholar
- Harrison PM, Hegyi H, Balasubramanian S, et al: 'Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22'. Genome Res. 2002, 12: 272-280. 10.1101/gr.207102.PubMed CentralView ArticlePubMedGoogle Scholar
- National Center for Biotechnology Information (NCBI): LocusLink/Statistics. 2004, (cited 18th February, 2004)., [http://www.ncbi.nlm.nih.gov/LocusLink/statistics.html]Google Scholar
- Nelson DR, Zeldin DC, Hoffman SMG, et al: 'Comparison of cytochrome P450 (CYP) genes from the mouse and human genomes, including nomenclature recommendations for genes, pseudogenes, and alternative-splice variants'. Pharmacogenetics. 2004, 14: 1-18. 10.1097/00008571-200401000-00001.View ArticlePubMedGoogle Scholar
- Nelson DR: 'Human P450s in FASTA format'. 2003, (cited 8th March, 2004)., [http://drnelson.utmem.edu/human.blast.file.html]Google Scholar
- Kent WJ: UCSC Genome Browser, July, Human genome assembly. 2003, (cited 18th February, 2004)., [http://genome.ucsc.edu/]Google Scholar
- Nelson DR: ' Drosophila pseudoobscura P450s'. 2004, (cited 18th February, 2004)., [http://drnelson.utmem.edu/pseudoobscura.html]Google Scholar
- Hild M, Beckmann B, Haas SA, et al: 'An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome'. Genome Biol. 2003, 5: R3-10.1186/gb-2003-5-1-r3. [http://genomebiology.com/2003/5/I/R3]PubMed CentralView ArticlePubMedGoogle Scholar
- Oliver B, Leblanc B: 'How many genes in a genome?'. Genome Biol. 2003, 5: 204-10.1186/gb-2003-5-1-204. [http://genomebiology.com/mkt/5114/2003/5/1/204]PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Keibler E, Hu P: 'Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map'. Genome Res. 2003, 13: 46-54. 10.1101/gr.830003.PubMed CentralView ArticlePubMedGoogle Scholar
- Mungall AJ, Palmer SA, Sims SK, et al: 'The DNA sequence and analysis of human chromosome 6'. Natur. 2003, 425: 805-811. 10.1038/nature02055.View ArticleGoogle Scholar
- Grimwood J, Schmutz J: 'Six is seventh'. Natur. 2003, 425: 775-776. 10.1038/425775a.View ArticleGoogle Scholar
- Burge C: The New GENSCAN We b Server at MIT, modified. 2003, (cited 18th February, 2004)., [http://genes.mit.edu/GENSCAN.html]Google Scholar
- National Center for Biotechnology Information: Gnomon description. 2003, (cited 8 March, 2004), [http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.html]Google Scholar