- Genome update
- Published:
Update on genome completion and annotations: Protein Information Resource
Human Genomics volume 1, Article number: 229 (2004)
Abstract
The Protein Information Resource (PIR) recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt -- the Universal Protein Resource -- which now unifies the PIR, Swiss-Prot and TrEMBL databases. The PIRSF (SuperFamily) classification system is central to the PIR/UniProt functional annotation of proteins, providing classifications of whole proteins into a network structure to reflect their evolutionary relationships. Data integration and associative studies of protein family, function and structure are supported by the iProClass database, which offers value-added descriptions of all UniProt proteins with highly informative links to more than 50 other databases. The PIR system allows consistent, rich and accurate protein annotation for all investigators.
Introduction
The high-throughput genome projects have resulted in a rapid accumulation of genome sequences for a large number of organisms. Meanwhile, researchers have begun to tackle gene function and other complex regulatory processes by studying organisms at the global scale for various levels of biological organisation. To fully exploit the value of the data, bioinfor-matics infrastructures are urgently needed to identify proteins encoded by these genomes and to understand how these proteins function in making up a living cell.
The Protein Information Resource (PIR) is a public bioinformatics database, and is located at the Georgetown University Medical Center (Washington, DC). PIR (http://pir.georgetown.edu) provides an advanced framework for comparative analysis and functional annotation of proteins. PIR recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt [1] (http://www.uniprot.org), the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function and was created by joining the information contained in PIR-PSD, Swiss-Prot and TrEMBL. To facilitate consistent and accurate protein annotation, PIR has extended its superfamily concept and developed the PIR SuperFamily (PIRSF) classification system [2]. This framework is supported by the iProClass integrated database of protein family, function and structure [3]. iProClass offers value-added descriptions of all UniProt proteins and has highly informative links to more than 50 other databases of protein family, function, pathway, interaction, modification, structure, genome, ontology, literature and taxonomy (Figure 1).
PIR, then and now
For more than three decades, PIR has made many protein databases and analysis tools freely accessible to the scientific community. These include the Protein Sequence Database (PSD) -- the first international protein database -- which grew out of the Atlas of Protein Sequence and Structure, edited by Margaret Dayhoff [1965-1978], a pioneer in molecular evolution research. As a core resource, the PIR environment is widely used by researchers to develop other bioinformatics infrastructures and algorithms and to enable basic and applied scientific research.
The current version (January 2004) consists of more than 1,232,000 (non-redundant PIR-PSD, SwissProt and TrEMBL) proteins organised into more than 36,290 PIR superfamilies, 145,340 families, 5,720 Pfam and PIR homology domains, 1,300 PROSITE/ProClass motifs, 280 RESID post-transla-tional modification sites, 550,000 FASTA similarity clusters and links to more than 50 molecular biology databases. iProClass cross-references include: databases for protein families (eg COG, InterPro); functions and pathways (eg KEGG, WIT); protein-protein interactions (eg DIP); structures and structural classifications (eg PDB, SCOP, CATH, PDBSum); genes and genomes (eg TIGR, OMIM); ontologies (eg gene ontology); literature (NCBI PubMed); and taxonomy (NCBI taxonomy).
Coupling protein classification and data integration allows associative studies of protein family, function and structure [3]. Domain-based or structural classification-based searches allow identification of protein families sharing domains or structural-fold classes. Functional convergence (unrelated proteins with the same activity) and functional divergence are revealed by the relationships between the enzyme classification and protein family classification. With the underlying taxonomic information in hand, protein families that occur in given lineages can be identified. Combining phylogenetic pattern and biochemical pathway information for protein families allows identification of alternative pathways to the same end product in different taxonomic groups, which may suggest potential drug targets. The systematic approach for protein family curation, using integrative data, leads to novel predictions and functional inference for uncharacterised 'hypothetical' proteins, and to detection and correction of genome annotation errors (a few examples are listed in Table 1). Such studies may serve as a basis for further analysis of protein functional evolution and its relationship to the co-evolution of metabolic pathways, cellular networks and organisms.
Organisational levels of protein groups
PIR has three organisational levels of protein groups -- namely, protein sequence entry, homeomorphic superfamily/family/subfamily and domain superfamily.
Protein sequence entries
A UniProt protein sequence entry represents the unprocessed normal product of a gene (or, sometimes, of very closely-related genes) from a single species. (A number of Swiss-Prot entries still contain identical sequences from different species, which will be unmerged in future releases.) The mature protein chain and its modifications are detailed in the feature table. To the extent that that is practical, UniProt aims to have one entry for each genetic locus that encodes protein. When the sequence variation is more extensive than can be conveniently represented within the entry, however, additional entries may be constructed for splice variants, allelic variants and strain variants. The source data from which entries are constructed include entries that represent a single sequence report, either published or deposited in a databank. Often, such reports will need to be 'merged' with other reports representing the same protein sequence. The UniProt staff attempt to identify these cases and perform the required merges.
Protein families
For the purposes of standardising annotation, database entries are organised into families of closely-related sequences. These generally represent proteins with the same function in various organisms. The taxonomic distribution within a family will depend on how well conserved are the structure and function of the protein. As a general guideline, sequences having more than 50 per cent sequence identity are usually similar in structure and function, and the major sequence features are unambiguously aligned by commonly-used multiple sequence alignment programs. Therefore, 50 per cent sequence identity is used by the database staff for the provisional clustering of proteins into families. This threshold is appropriate in many cases; however, some families may be repartitioned into more convenient clusters after PIR review.
Homeomorphic superfamilies/families/subfamilies
The PIR superfamily concept [4], the original classification based on sequence similarity, has been used as a guiding principle to provide comprehensive and non-overlapping clustering of PIR protein sequences into a hierarchical order to reflect their evolutionary relationships [5]. To facilitate sensible propagation and standardisation of protein annotation and systematic detection of annotation errors as part of the UniProt project, PIR has extended its hierarchical superfamily concept and developed the PIRSF system, a network classification system based on the evolutionary relationships of whole proteins.
Proteins are considered 'homeomorphic' if they share full-length sequence similarity and a common domain architecture, as indicated by the same type, number and order of defined domains. Length deviation may occur for alternative-splice and alternate-initiator variants, sequence fragments and peptides derived from proteolytic processing. Variation of the domain architecture may exist for repeating domains and/or auxiliary domains, which are often mobile and may easily be lost, acquired or functionally replaced during evolution. Classification based on whole proteins, rather than on the component domains, allows annotation of both generic biochemical and specific biological functions.
The network structure accommodates a flexible number of levels that reflect varying degrees of sequence conservation (superfamily, family and subfamily). The threshold values of sequence similarity may vary at each level, depending on the evolutionary rate in each group of proteins (ie the taxonomic distribution within a protein group will depend on how well conserved are the structure and function of the protein). The network structure allows improved protein annotation, more accurate extraction of conserved functional residues, and classification of distantly-related orphan proteins. Homeo-morphic families and subfamilies -- generally representing proteins with the same function in various organisms -- are suitable for propagating standardised protein names, position-specific features (such as functional sites) and keywords. Distantly-related homeomorphic families and orphan proteins sharing a common domain architecture may form a homeo-morphic superfamily. It is assumed, although in most cases this has not been investigated in detail, that the molecules in a homeomorphic superfamily share a common evolutionary history. Thus, it should be valid to construct an evolutionary tree from the members of a homeomorphic superfamily. If two groups of proteins with the same architecture or function are shown to have come to that structure independently (convergent evolution), they are appropriately separated into two homeomorphic superfamilies. For example, the cytochrome P450 (CYP)[6] and nitric oxide synthase (NOS)[7] families of enzymes both carry out "P450-like" oxygenation reactions and at first were believed to be evolutionarily related. Upon further in-depth analysis, however, no evidence for an evolutionary relationship of the two gene superfamilies was found [5], so the conclusion can only be that this is a likely case in point of convergent evolution.
Domain superfamilies
Many types of domains have been found in diverse proteins. In common use, for example, the term 'protein kinase superfamily' refers to the collection of all proteins that contain a protein kinase-like domain. PIR calls such a group a 'domain superfamily'. Any given protein sequence will be assigned to only one homeomorphic superfamily, but it may contain sequence segments belonging to several domain superfamilies [5].
Recent directions for additional protein analyses and databases
With the new surge in interest in the fields of subcellular and intracellular signal transduction circuitry and 'systems biology' [8], confirmed protein-protein interactions are being registered at the Human Protein Reference Database (HPRD; http://www.hprd.org) [9]. Another bioinformatics database under development is the Secreted Protein Discovery Initiative (SPDI), which has begun to identify novel and transmembrane proteins [10]. A Bayesian networks approach for predicting protein-protein interactions, genome-wide, in yeast [11] is available at: http://genecensus.org/intint. A protein interaction map for Drosophila melanogaster has very recently been developed [12], as a starting point of a systems biology modelling for multicellular organisms, including humans.
OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm when applied to two genomes, but can be extended to cluster orthologue analysis across multiple species (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating Plasmodium falciparum genes, for example, identifies numerous enzymes that were incompletely annotated in first-pass annotation of that parasite genome [13]. Finally, the evolutionary divergence of large enzyme protein families, based on the complexities of their substrates, can be compared by a profile Hidden Markov Model method; the method was recently used to classify 47 glycosyltransferase families in the CAZy database into four superfamilies [14].
References
Apweiler R, Bairoch A, Wu CH, et al: 'UniProt: Universal protein knowledgebase'. Nucleic Acids Res. 2004, 32: D115-D119. 10.1093/nar/gkh131.
Wu CH, Nikolskaya A, Huang H, et al: 'PIRSF family classification system at the Protein Information Resource'. Nucleic Acids Res. 2004, 32: D112-D114. 10.1093/nar/gkh097.
Wu CH, Huang H, Nikolskaya A, et al: 'The iProClass integrated database for protein functional analysis'. Comput Biol Chem. 2004, 28: 87-96. 10.1016/j.compbiolchem.2003.10.003.
Dayhoff MO: 'The origin and evolution of protein superfami-lies'. Fed Proc. 1976, 35: 2132-2138.
Barker WC, Pfeiffer F, George DG: 'Superfamily classification in the PIR--International Protein Sequence Database'. Meth Enzymol. 1996, 266: 59-71.
Nelson DR, Koymans L, Kamataki T, et al: 'Cytochrome P450 superfamily: Update on new sequences, gene mapping, accession numbers, and nomenclature'. Pharmacogenetics. 1996, 6: 1-42. 10.1097/00008571-199602000-00002.
Ghosh DK, Salerno JC: 'Nitric oxide synthases: Domain structure and alignment in enzyme function and control'. Front Biosci. 2003, 8: d193-d209. 10.2741/959.
Ehrenberg M, Elf J, Aurell E, et al: 'Systems biology is taking off'. Genome Res. 2003, 13: 2377-2380. 10.1101/gr.1763203.
Peri S, Navarro JD, Amanchy R, et al: 'Development of human protein reference database as an initial platform for approaching systems biology in humans'. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.
Clark HF, Gurney AL, Abaya E, et al: 'The secreted protein discovery initiative (SPDI), a large-scale effort to identify novel human secreted and transmembrane proteins: A bioinformatics assessment'. Genome Res. 2003, 13: 2265-2270. 10.1101/gr.1293003.
Jansen R, Yu H, Greenbaum D, et al: 'A Bayesian networks approach for predicting protein-protein interactions from genomic data'. Science. 2003, 302: 449-453. 10.1126/science.1087361.
Giot L, Bader JS, Brouwer C, et al: 'A protein interaction map of Drosophila melanogaster'. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.
Li L, Stoeckert CJ, Roos DD: 'OrthoMCL: Identification of ortholog groups for eukaryotic genomes'. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.
Kikuchi N, Kwon Y-D, Gotoh M, et al: 'Comparison of glycosyltransferase families using the profile Hidden Markov model'. Biochem Biophys Res Commun. 2003, 310: 574-579. 10.1016/j.bbrc.2003.09.031.
Acknowledgements
The PIR is supported by grant U01 HG02712 from the National Institutes of Health and grants DBI-0138188 and ITR-0205470 from the National Science Foundation (C.W.). The writing of this article was funded, in part, by NIH grant P30 ES06096 (D.W.N.). The author very much appreciates the graphics assistance of Dr Marian Miller.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wu, C., Nebert, D.W. Update on genome completion and annotations: Protein Information Resource. Hum Genomics 1, 229 (2004). https://doi.org/10.1186/1479-7364-1-3-229
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1479-7364-1-3-229