Update on genome completion and annotations: Protein Information Resource
© Henry Stewart Publications 2004
Received: 11 January 2004
Accepted: 11 January 2004
Published: 1 March 2004
The Protein Information Resource (PIR) recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt -- the Universal Protein Resource -- which now unifies the PIR, Swiss-Prot and TrEMBL databases. The PIRSF (SuperFamily) classification system is central to the PIR/UniProt functional annotation of proteins, providing classifications of whole proteins into a network structure to reflect their evolutionary relationships. Data integration and associative studies of protein family, function and structure are supported by the iProClass database, which offers value-added descriptions of all UniProt proteins with highly informative links to more than 50 other databases. The PIR system allows consistent, rich and accurate protein annotation for all investigators.
The high-throughput genome projects have resulted in a rapid accumulation of genome sequences for a large number of organisms. Meanwhile, researchers have begun to tackle gene function and other complex regulatory processes by studying organisms at the global scale for various levels of biological organisation. To fully exploit the value of the data, bioinfor-matics infrastructures are urgently needed to identify proteins encoded by these genomes and to understand how these proteins function in making up a living cell.
PIR, then and now
For more than three decades, PIR has made many protein databases and analysis tools freely accessible to the scientific community. These include the Protein Sequence Database (PSD) -- the first international protein database -- which grew out of the Atlas of Protein Sequence and Structure, edited by Margaret Dayhoff [1965-1978], a pioneer in molecular evolution research. As a core resource, the PIR environment is widely used by researchers to develop other bioinformatics infrastructures and algorithms and to enable basic and applied scientific research.
The current version (January 2004) consists of more than 1,232,000 (non-redundant PIR-PSD, SwissProt and TrEMBL) proteins organised into more than 36,290 PIR superfamilies, 145,340 families, 5,720 Pfam and PIR homology domains, 1,300 PROSITE/ProClass motifs, 280 RESID post-transla-tional modification sites, 550,000 FASTA similarity clusters and links to more than 50 molecular biology databases. iProClass cross-references include: databases for protein families (eg COG, InterPro); functions and pathways (eg KEGG, WIT); protein-protein interactions (eg DIP); structures and structural classifications (eg PDB, SCOP, CATH, PDBSum); genes and genomes (eg TIGR, OMIM); ontologies (eg gene ontology); literature (NCBI PubMed); and taxonomy (NCBI taxonomy).
Protein family classification and integrative associative analysis for functional annotation*
A. Functional inference of uncharacterised hypothetical proteins
TIM-barrel signal transduction protein
ATPase with chaperone activity and inactive LON protease domain
Lipid carrier protein
[Ni, Fe]-hydrogenase-3-type complex, membrane protein EhaA
B. Correction, or improvement, of genome annotations
Ligand-binding protein with an ACT domain
Inactive homologue of metal-dependent protease
Glycyl radical cofactor protein YfiD
Chemotaxis response regulator methylesterase CheB
Thioesterase, type II
Bifunctional tetrapyrrole methylase and MazG NTPase
C. Enhanced understanding of structure, function and evolutionary Relationships
Chorismate mutase, AroH class
Chorismate mutase, AroQ class, prokaryotic type
Organisational levels of protein groups
PIR has three organisational levels of protein groups -- namely, protein sequence entry, homeomorphic superfamily/family/subfamily and domain superfamily.
Protein sequence entries
A UniProt protein sequence entry represents the unprocessed normal product of a gene (or, sometimes, of very closely-related genes) from a single species. (A number of Swiss-Prot entries still contain identical sequences from different species, which will be unmerged in future releases.) The mature protein chain and its modifications are detailed in the feature table. To the extent that that is practical, UniProt aims to have one entry for each genetic locus that encodes protein. When the sequence variation is more extensive than can be conveniently represented within the entry, however, additional entries may be constructed for splice variants, allelic variants and strain variants. The source data from which entries are constructed include entries that represent a single sequence report, either published or deposited in a databank. Often, such reports will need to be 'merged' with other reports representing the same protein sequence. The UniProt staff attempt to identify these cases and perform the required merges.
For the purposes of standardising annotation, database entries are organised into families of closely-related sequences. These generally represent proteins with the same function in various organisms. The taxonomic distribution within a family will depend on how well conserved are the structure and function of the protein. As a general guideline, sequences having more than 50 per cent sequence identity are usually similar in structure and function, and the major sequence features are unambiguously aligned by commonly-used multiple sequence alignment programs. Therefore, 50 per cent sequence identity is used by the database staff for the provisional clustering of proteins into families. This threshold is appropriate in many cases; however, some families may be repartitioned into more convenient clusters after PIR review.
The PIR superfamily concept , the original classification based on sequence similarity, has been used as a guiding principle to provide comprehensive and non-overlapping clustering of PIR protein sequences into a hierarchical order to reflect their evolutionary relationships . To facilitate sensible propagation and standardisation of protein annotation and systematic detection of annotation errors as part of the UniProt project, PIR has extended its hierarchical superfamily concept and developed the PIRSF system, a network classification system based on the evolutionary relationships of whole proteins.
Proteins are considered 'homeomorphic' if they share full-length sequence similarity and a common domain architecture, as indicated by the same type, number and order of defined domains. Length deviation may occur for alternative-splice and alternate-initiator variants, sequence fragments and peptides derived from proteolytic processing. Variation of the domain architecture may exist for repeating domains and/or auxiliary domains, which are often mobile and may easily be lost, acquired or functionally replaced during evolution. Classification based on whole proteins, rather than on the component domains, allows annotation of both generic biochemical and specific biological functions.
The network structure accommodates a flexible number of levels that reflect varying degrees of sequence conservation (superfamily, family and subfamily). The threshold values of sequence similarity may vary at each level, depending on the evolutionary rate in each group of proteins (ie the taxonomic distribution within a protein group will depend on how well conserved are the structure and function of the protein). The network structure allows improved protein annotation, more accurate extraction of conserved functional residues, and classification of distantly-related orphan proteins. Homeo-morphic families and subfamilies -- generally representing proteins with the same function in various organisms -- are suitable for propagating standardised protein names, position-specific features (such as functional sites) and keywords. Distantly-related homeomorphic families and orphan proteins sharing a common domain architecture may form a homeo-morphic superfamily. It is assumed, although in most cases this has not been investigated in detail, that the molecules in a homeomorphic superfamily share a common evolutionary history. Thus, it should be valid to construct an evolutionary tree from the members of a homeomorphic superfamily. If two groups of proteins with the same architecture or function are shown to have come to that structure independently (convergent evolution), they are appropriately separated into two homeomorphic superfamilies. For example, the cytochrome P450 (CYP) and nitric oxide synthase (NOS) families of enzymes both carry out "P450-like" oxygenation reactions and at first were believed to be evolutionarily related. Upon further in-depth analysis, however, no evidence for an evolutionary relationship of the two gene superfamilies was found , so the conclusion can only be that this is a likely case in point of convergent evolution.
Many types of domains have been found in diverse proteins. In common use, for example, the term 'protein kinase superfamily' refers to the collection of all proteins that contain a protein kinase-like domain. PIR calls such a group a 'domain superfamily'. Any given protein sequence will be assigned to only one homeomorphic superfamily, but it may contain sequence segments belonging to several domain superfamilies .
Recent directions for additional protein analyses and databases
With the new surge in interest in the fields of subcellular and intracellular signal transduction circuitry and 'systems biology' , confirmed protein-protein interactions are being registered at the Human Protein Reference Database (HPRD; http://www.hprd.org) . Another bioinformatics database under development is the Secreted Protein Discovery Initiative (SPDI), which has begun to identify novel and transmembrane proteins . A Bayesian networks approach for predicting protein-protein interactions, genome-wide, in yeast  is available at: http://genecensus.org/intint. A protein interaction map for Drosophila melanogaster has very recently been developed , as a starting point of a systems biology modelling for multicellular organisms, including humans.
OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm when applied to two genomes, but can be extended to cluster orthologue analysis across multiple species (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating Plasmodium falciparum genes, for example, identifies numerous enzymes that were incompletely annotated in first-pass annotation of that parasite genome . Finally, the evolutionary divergence of large enzyme protein families, based on the complexities of their substrates, can be compared by a profile Hidden Markov Model method; the method was recently used to classify 47 glycosyltransferase families in the CAZy database into four superfamilies .
The PIR is supported by grant U01 HG02712 from the National Institutes of Health and grants DBI-0138188 and ITR-0205470 from the National Science Foundation (C.W.). The writing of this article was funded, in part, by NIH grant P30 ES06096 (D.W.N.). The author very much appreciates the graphics assistance of Dr Marian Miller.
- Apweiler R, Bairoch A, Wu CH, et al: 'UniProt: Universal protein knowledgebase'. Nucleic Acids Res. 2004, 32: D115-D119. 10.1093/nar/gkh131.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Nikolskaya A, Huang H, et al: 'PIRSF family classification system at the Protein Information Resource'. Nucleic Acids Res. 2004, 32: D112-D114. 10.1093/nar/gkh097.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Huang H, Nikolskaya A, et al: 'The iProClass integrated database for protein functional analysis'. Comput Biol Chem. 2004, 28: 87-96. 10.1016/j.compbiolchem.2003.10.003.View ArticlePubMedGoogle Scholar
- Dayhoff MO: 'The origin and evolution of protein superfami-lies'. Fed Proc. 1976, 35: 2132-2138.PubMedGoogle Scholar
- Barker WC, Pfeiffer F, George DG: 'Superfamily classification in the PIR--International Protein Sequence Database'. Meth Enzymol. 1996, 266: 59-71.View ArticlePubMedGoogle Scholar
- Nelson DR, Koymans L, Kamataki T, et al: 'Cytochrome P450 superfamily: Update on new sequences, gene mapping, accession numbers, and nomenclature'. Pharmacogenetics. 1996, 6: 1-42. 10.1097/00008571-199602000-00002.View ArticlePubMedGoogle Scholar
- Ghosh DK, Salerno JC: 'Nitric oxide synthases: Domain structure and alignment in enzyme function and control'. Front Biosci. 2003, 8: d193-d209. 10.2741/959.View ArticlePubMedGoogle Scholar
- Ehrenberg M, Elf J, Aurell E, et al: 'Systems biology is taking off'. Genome Res. 2003, 13: 2377-2380. 10.1101/gr.1763203.View ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Amanchy R, et al: 'Development of human protein reference database as an initial platform for approaching systems biology in humans'. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.PubMed CentralView ArticlePubMedGoogle Scholar
- Clark HF, Gurney AL, Abaya E, et al: 'The secreted protein discovery initiative (SPDI), a large-scale effort to identify novel human secreted and transmembrane proteins: A bioinformatics assessment'. Genome Res. 2003, 13: 2265-2270. 10.1101/gr.1293003.PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R, Yu H, Greenbaum D, et al: 'A Bayesian networks approach for predicting protein-protein interactions from genomic data'. Science. 2003, 302: 449-453. 10.1126/science.1087361.View ArticlePubMedGoogle Scholar
- Giot L, Bader JS, Brouwer C, et al: 'A protein interaction map of Drosophila melanogaster'. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.View ArticlePubMedGoogle Scholar
- Li L, Stoeckert CJ, Roos DD: 'OrthoMCL: Identification of ortholog groups for eukaryotic genomes'. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMed CentralView ArticlePubMedGoogle Scholar
- Kikuchi N, Kwon Y-D, Gotoh M, et al: 'Comparison of glycosyltransferase families using the profile Hidden Markov model'. Biochem Biophys Res Commun. 2003, 310: 574-579. 10.1016/j.bbrc.2003.09.031.View ArticlePubMedGoogle Scholar