Update on genome completion and annotations: Protein Information Resource

The Protein Information Resource (PIR) recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt -- the Universal Protein Resource -- which now unifies the PIR, Swiss-Prot and TrEMBL databases. The PIRSF (SuperFamily) classification system is central to the PIR/UniProt functional annotation of proteins, providing classifications of whole proteins into a network structure to reflect their evolutionary relationships. Data integration and associative studies of protein family, function and structure are supported by the iProClass database, which offers value-added descriptions of all UniProt proteins with highly informative links to more than 50 other databases. The PIR system allows consistent, rich and accurate protein annotation for all investigators.


Introduction
The high-throughput genome projects have resulted in a rapid accumulation of genome sequences for a large number of organisms. Meanwhile, researchers have begun to tackle gene function and other complex regulatory processes by studying organisms at the global scale for various levels of biological organisation. To fully exploit the value of the data, bioinformatics infrastructures are urgently needed to identify proteins encoded by these genomes and to understand how these proteins function in making up a living cell.
The Protein Information Resource (PIR) is a public bioinformatics database, and is located at the Georgetown University Medical Center (Washington, DC). PIR (http:// pir.georgetown.edu) provides an advanced framework for comparative analysis and functional annotation of proteins. PIR recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt 1 (http://www.uniprot.org), the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function and was created by joining the information contained in PIR-PSD, Swiss-Prot and TrEMBL. To facilitate consistent and accurate protein annotation, PIR has extended its superfamily concept and developed the PIR SuperFamily (PIRSF) classification system. 2 This framework is supported by the iProClass integrated database of protein family, function and structure. 3 iProClass offers value-added descriptions of all UniProt proteins and has highly informative links to more than 50 other databases of protein family, function, pathway, interaction, modification, structure, genome, ontology, literature and taxonomy ( Figure 1).

PIR, then and now
For more than three decades, PIR has made many protein databases and analysis tools freely accessible to the scientific community. These include the Protein Sequence Database (PSD) -the first international protein database -which grew out of the Atlas of Protein Sequence and Structure, edited by Margaret Dayhoff [1965Dayhoff [ -1978, a pioneer in molecular evolution research. As a core resource, the PIR environment is widely used by researchers to develop other bioinformatics infrastructures and algorithms and to enable basic and applied scientific research.
Coupling protein classification and data integration allows associative studies of protein family, function and structure. 3 Domain-based or structural classification-based searches allow identification of protein families sharing domains or structuralfold classes. Functional convergence (unrelated proteins with the same activity) and functional divergence are revealed by the relationships between the enzyme classification and protein family classification. With the underlying taxonomic information in hand, protein families that occur in given lineages can be identified. Combining phylogenetic pattern and biochemical pathway information for protein families allows identification of alternative pathways to the same end product in different taxonomic groups, which may suggest potential drug targets. The systematic approach for protein family curation, using integrative data, leads to novel predictions and functional inference for uncharacterised 'hypothetical' proteins, and to detection and correction of genome annotation errors (a few examples are listed in Table 1). Such studies may serve as a basis for further analysis of protein functional evolution and its relationship to the co-evolution of metabolic pathways, cellular networks and organisms.

Organisational levels of protein groups
PIR has three organisational levels of protein groups -namely, protein sequence entry, homeomorphic superfamily/family/ subfamily and domain superfamily.

Protein sequence entries
A UniProt protein sequence entry represents the unprocessed normal product of a gene (or, sometimes, of very closely-related genes) from a single species. (A number of Figure 1. Diagram of the interrelated links of the iProClass database. Comprehensive protein and superfamily views exist in two types of summary reports. The protein sequence report covers information on family, structure, function, gene, genetics, disease, ontology, taxonomy and literature, with cross-references to relevant molecular databases and executive summary lines, as well as graphical display of domain and motif regions. The superfamily report provides PIR superfamily membership information with length, taxonomy and keyword statistics, complete member listing separated by major kingdoms, family relationships at the whole protein and domain and motif levels with direct mapping to other classifications, structure and function cross-references, and domain and motif graphical display. Swiss-Prot entries still contain identical sequences from different species, which will be unmerged in future releases.) The mature protein chain and its modifications are detailed in the feature table. To the extent that that is practical, UniProt aims to have one entry for each genetic locus that encodes protein. When the sequence variation is more extensive than can be conveniently represented within the entry, however, additional entries may be constructed for splice variants, allelic variants and strain variants. The source data from which entries are constructed include entries that represent a single sequence report, either published or deposited in a databank. Often, such reports will need to be 'merged' with other reports representing the same protein sequence. The UniProt staff attempt to identify these cases and perform the required merges.

Protein families
For the purposes of standardising annotation, database entries are organised into families of closely-related sequences. These generally represent proteins with the same function in various organisms. The taxonomic distribution within a family will depend on how well conserved are the structure and function of the protein. As a general guideline, sequences having more than 50 per cent sequence identity are usually similar in structure and function, and the major sequence features are unambiguously aligned by commonlyused multiple sequence alignment programs. Therefore, 50 per cent sequence identity is used by the database staff for the provisional clustering of proteins into families. This threshold is appropriate in many cases; however, some families may be repartitioned into more convenient clusters after PIR review.

Homeomorphic superfamilies/families/ subfamilies
The PIR superfamily concept, 4 the original classification based on sequence similarity, has been used as a guiding principle to provide comprehensive and non-overlapping clustering of PIR protein sequences into a hierarchical order to reflect their evolutionary relationships. 5 To facilitate sensible propagation and standardisation of protein annotation and systematic detection of annotation errors as part of the

Wu and Nebert
Review UPDATE ON GENOME COMPLETION AND ANNOTATIONS UniProt project, PIR has extended its hierarchical superfamily concept and developed the PIRSF system, a network classification system based on the evolutionary relationships of whole proteins. Proteins are considered 'homeomorphic' if they share fulllength sequence similarity and a common domain architecture, as indicated by the same type, number and order of defined domains. Length deviation may occur for alternative-splice and alternate-initiator variants, sequence fragments and peptides derived from proteolytic processing. Variation of the domain architecture may exist for repeating domains and/or auxiliary domains, which are often mobile and may easily be lost, acquired or functionally replaced during evolution. Classification based on whole proteins, rather than on the component domains, allows annotation of both generic biochemical and specific biological functions.
The network structure accommodates a flexible number of levels that reflect varying degrees of sequence conservation (superfamily, family and subfamily). The threshold values of sequence similarity may vary at each level, depending on the evolutionary rate in each group of proteins (ie the taxonomic distribution within a protein group will depend on how well conserved are the structure and function of the protein). The network structure allows improved protein annotation, more accurate extraction of conserved functional residues, and classification of distantly-related orphan proteins. Homeomorphic families and subfamilies -generally representing proteins with the same function in various organisms -are suitable for propagating standardised protein names, positionspecific features (such as functional sites) and keywords. Distantly-related homeomorphic families and orphan proteins sharing a common domain architecture may form a homeomorphic superfamily. It is assumed, although in most cases this has not been investigated in detail, that the molecules in a homeomorphic superfamily share a common evolutionary history. Thus, it should be valid to construct an evolutionary tree from the members of a homeomorphic superfamily. If two groups of proteins with the same architecture or function are shown to have come to that structure independently (convergent evolution), they are appropriately separated into two homeomorphic superfamilies. For example, the cytochrome P450 (CYP) 6 and nitric oxide synthase (NOS) 7 families of enzymes both carry out "P450-like" oxygenation reactions and at first were believed to be evolutionarily related. Upon further in-depth analysis, however, no evidence for an evolutionary relationship of the two gene superfamilies was found, 5 so the conclusion can only be that this is a likely case in point of convergent evolution.

Domain superfamilies
Many types of domains have been found in diverse proteins. In common use, for example, the term 'protein kinase superfamily' refers to the collection of all proteins that contain a protein kinase-like domain. PIR calls such a group a 'domain superfamily'. Any given protein sequence will be assigned to only one homeomorphic superfamily, but it may contain sequence segments belonging to several domain superfamilies. 5

Recent directions for additional protein analyses and databases
With the new surge in interest in the fields of subcellular and intracellular signal transduction circuitry and 'systems biology', 8 confirmed protein -protein interactions are being registered at the Human Protein Reference Database (HPRD; http://www.hprd.org). 9 Another bioinformatics database under development is the Secreted Protein Discovery Initiative (SPDI), which has begun to identify novel and transmembrane proteins. 10 A Bayesian networks approach for predicting protein -protein interactions, genome-wide, in yeast 11 is available at: http://genecensus.org/intint. A protein interaction map for Drosophila melanogaster has very recently been developed, 12 as a starting point of a systems biology modelling for multicellular organisms, including humans.
OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm when applied to two genomes, but can be extended to cluster orthologue analysis across multiple species (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating Plasmodium falciparum genes, for example, identifies numerous enzymes that were incompletely annotated in first-pass annotation of that parasite genome. 13 Finally, the evolutionary divergence of large enzyme protein families, based on the complexities of their substrates, can be compared by a profile Hidden Markov Model method; the method was recently used to classify 47 glycosyltransferase families in the CAZy database into four superfamilies. 14