Protein-protein interaction databases: keeping up with growing interactomes
© Henry Stewart Publications 2009
Received: 30 January 2009
Accepted: 30 January 2009
Published: 1 April 2009
Over the past few years, the number of known protein-protein interactions has increased substantially. To make this information more readily available, a number of publicly available databases have set out to collect and store protein-protein interaction data. Protein-protein interactions have been retrieved from six major databases, integrated and the results compared. The six databases (the Biological General Repository for Interaction Datasets [BioGRID], the Molecular INTeraction database [MINT], the Biomolecular Interaction Network Database [BIND], the Database of Interacting Proteins [DIP], the IntAct molecular interaction database [IntAct] and the Human Protein Reference Database [HPRD]) differ in scope and content; integration of all datasets is non-trivial owing to differences in data annotation. With respect to human protein-protein interaction data, HPRD seems to be the most comprehensive. To obtain a complete dataset, however, interactions from all six databases have to be combined. To overcome this limitation, meta-databases such as the Agile Protein Interaction Database (APID) offer access to integrated protein-protein interaction datasets, although these also currently have certain restrictions.
Keywordsprotein-protein interactions PPI database bioinformatics IMEx PSI-MI
The nature of protein-protein interaction data
Proteins do not act independently but in a network of complex molecular interactions. Therefore, it is important to identify physical interactions between proteins. Different experimental techniques have been developed to measure physical interactions between proteins; these methods vary considerably, not least in terms of the data they produce.
Graph representation allows the data to be analysed using a graph-theoretical framework. Many graph analysis algorithms have been applied to PPI datasets; these approaches have been reviewed in detail elsewhere [11–16].
The primary resources for PPI data are individual scientific publications. Several public databases collect published PPI data and provide researchers access to their curated datasets. These usually reference the original publication and the experimental method that determined every individual interaction. Database designers choose to represent these data in different ways, and the wide spectrum of experimental methods makes it difficult to design a single data model to capture all necessary experimental detail. To overcome this problem, the International Molecular Exchange (IMEx; http://imex.sourceforge.net/) consortium was formed. IMEx aims to enable the exchange of data and to avoid the duplication of the curation effort. To that end, an XML-based proteomics standard, termed the proteomics standards initiative - molecular interaction (PSI-MI) has been developed . At the time of writing, however, no data had yet been exchanged, and it was therefore necessary to combine PPI data from all available databases using the authors' own scripts to obtain as comprehensive a network as possible.
DIP, IntAct and MINT are active members of the IMEx initiative; the curation accuracy of these three databases was assessed recently by Cusick et al.  HPRD focuses entirely on human proteins, providing not only information on protein interactions, but also a variety of protein-specific information, such as post-translational modifications, disease associations and enzyme-substrate relationships. One of the first interaction databases, BIND, initiated in 2001 by the University of Toronto and the University of British Columbia, is part of the Biomolecular Object Network Databank (BOND) and was subsequently acquired by the company Thomson Reuters.
The following comparison is based on complete sets of binary interactions that were downloaded from the individual databases in May 2008. IntAct and MINT derive binary interactions from protein complexes using the spokes model. No other database provided any information on which model is applied. Only 'physical interactions' are considered here, although most databases also provide 'genetic interactions' -- that is, two non-essential genes that lead to a non-viable phenotype if they are knocked out simultaneously. Furthermore, interactions were only accepted if a publication identifier was provided along with the interacting proteins.
Currently, the most comprehensive database in terms of individual interactions is IntAct, with almost 130,000 unique interactions from up to 131 different organisms. Despite these large numbers, it cites only about 3,000 different publications. Whereas IntAct seems to be concentrating on high-throughput studies, HPRD also takes into account small-scale publications. Although being restricted to human proteins, it reports over 36,000 unique interactions from more than 18,000 publications. Only BioGRID cites a similar number of publications (16,369); it is also the second largest database in terms of the number of unique interactions. It should be noted that the databases examine publications in different depth, and that higher numbers of publications do not necessarily involve a higher curation effort.
The majority of known protein interactions account for proteins from Saccharomyces cerevisiae and Homo sapiens. Individual high-throughput interaction screens were carried out for some other organisms; these high-throughput studies usually account for the majority of all known interactions in the corresponding organism. By contrast, known protein interactions for S. cerevisiae and H. sapiens are dispersed over numerous publications. For this reason, the number of interactions for humans and yeast can vary considerably between different databases, depending on their coverage of the literature.
Differences between the PPI databases
Ideally, every database would extract the same interactions from a given publication. Unfortunately, this is not the case. Of the 14,899 publications shared by at least two databases, 5,782 (39 per cent) were reported with a different number of interactions in different databases. For example, for the publication reporting the most interactions , a minimum of 18,877 (BIND) and a maximum of 20,800 interactions (DIP) were reported. According to the abstract, the number of interactions is 20,405, which, again, is different from the number reported by all five databases that cite this publication. In this case, the variation is presumably due to problems with identifier mapping. Many databases use different identifiers, which do not always map in a perfect one-to-one relationship to the originally published identifiers. BioGRID (20,220 interactions) uses the original gene identifiers, but still lacks 185 interactions.
As a second example, using a Y2H screen, Rual et al. detected 2,754 interactions between human proteins . The authors compared their experimental findings with a literature-curated PPI network of 4,076 interactions. This resulted in a combined network of 6,438 interactions. HPRD (2,371 interactions), IntAct (2,671 interactions) and MINT (2,463 interactions) report only experimentally detected interactions for this reference. BioGRID reports 6,295 interactions for this study, of which 2,594 quote Y2H as the detection method. These also overlap with the interactions reported by the other databases for this reference. The remaining 3,895 interactions quote affinity capture as the detection method and possibly refer to the literature-curated interactions.
For a number of other publications, differences can be explained by different confidence sets or thresholds [27, 28] or differences in the application of the matrix or spokes model. Often, no obvious reason for different numbers of interactions could be found.
Integration of PPI data
Integration of data from the different databases is not trivial. Although many databases provide their interactions in the proteomics standards initiative-molecular interactions (PSI-MI) format, its controlled vocabulary is often not used or is used incorrectly. Furthermore, a variety of different gene or protein identifiers are used, even within some of the databases. Although a gene can give rise to several different proteins (due to alternative splicing), we mapped all identifiers to Ensembl gene identifiers to avoid any ambiguities. This procedure is based on mapping tables obtained from UniProt . Only interactions in which both proteins could be mapped to an Ensembl gene identifier were considered for further analysis.
Redundancy of PPIs.
As mentioned above, databases focus their curation efforts on different publications. Consequently, only a subset of all protein interactions can be found in more than one database (Table 2). These range from 42 per cent of yeast interactions and 51 per cent of human interactions to 72 per cent of fly interactions and 86 per cent of worm interactions.
Overlap of human PPIs between databases.
PPI relative overlap:
None of the existing PPI databases provides an exhaustive dataset. Therefore, some groups have set up meta-databases that provide protein interaction data extracted and integrated from other databases. Currently, one of the most comprehensive meta-database appears to be the Agile Protein Interaction Database (APID) . APID extracts interactions from the six databases described above, mapping all proteins to UniProt identifiers . Via a web interface, the user can query for proteins of interest. APID references the database from which an interaction is derived and provides the related information available in the original database, such as the detection method and the publication identifier. In addition, APID incorporates biological information from various other databases, such as the Gene Ontology  and Pfam databases . Unfortunately, a download of the complete dataset is currently not possible due to licensing issues. APID is generally in good agreement with the results of the authors' data integration. For the time being, APID seems a good source of interactome data.
Several other meta-databases exist, but these usually focus on a single organism  or incorporate various other types of interactions, such as computationally predicted protein interactions and co-citation of proteins . For a comprehensive list of available databases, the reader is referred to the Pathguide .
PPI databases not only report their data in different ways, using different ontologies, but their curators also report different PPIs when examining the same publication. In addition, all databases include different publications. It is therefore not surprising that every database reports different PPIs. The pairwise overlap among databases analysed here reaches up to 75 per cent, but always falls short of a perfect 100 per cent. Similar results were obtained in related studies [12, 24]. Until a data exchange between databases is implemented, a comprehensive set of interactions can only be obtained through data integration of several databases. Meta-databases, such as APID, provide access to more comprehensive datasets, but do not always allow the download of their complete data. Furthermore, by their very nature, meta-databases will always be less up to date than the original databases.
PPI databases have improved greatly over the past couple of years, and important issues, such as data exchange, are being currently addressed by some of the databases described here. An important step towards increasing the number and quality of protein interaction data would be to introduce a submission requirement -- as, indeed, already exists for sequence and microarray data. These data have to be submitted to public databases prior to publication in a scientific journal, which ensures data availability and consistent annotation, and enables researchers to utilise the data with greatest efficiency.
The authors would like to thank all developers and curators of the protein--protein interaction databases. Without their effort, our life would be much harder. We thank Henning Hermjakob for helpful discussions. We are grateful for funding from the British Council/DAAD as part of the ARC programme (ARC1297).
- Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature. 1989, 340: 245-246. 10.1038/340245a0.View ArticlePubMedGoogle Scholar
- Rigaut G, Shevchenko A, Rutz B, et al: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotech. 1999, 17: 1030-1032. 10.1038/13732.View ArticleGoogle Scholar
- Gavin AC, Aloy P, Grandi P, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.View ArticlePubMedGoogle Scholar
- Bouwmeester T, Bauch A, Ruffner H, et al: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol. 2004, 6: 97-105. 10.1038/ncb1086.View ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.View ArticlePubMedGoogle Scholar
- Berggård T, Linse S, James P: Methods for the detection and analysis of protein-protein interactions. Proteomics. 2007, 7: 2833-2842. 10.1002/pmic.200700131.View ArticlePubMedGoogle Scholar
- Phizicky EM, Fields S: Protein-protein interactions: Methods for detection and analysis. Microbiol Rev. 1995, 59: 94-123.PubMed CentralPubMedGoogle Scholar
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput Biol. 2007, 3: e42-10.1371/journal.pcbi.0030042.PubMed CentralView ArticlePubMedGoogle Scholar
- Suderman M, Hallett M: Tools for visually exploring biological networks. Bioinformatics. 2007, 23: 2651-2659. 10.1093/bioinformatics/btm401.View ArticlePubMedGoogle Scholar
- Cline MS, Smoot M, Cerami E, et al: Integration of biologi-cal networks and gene expression data using Cytoscape. Nat Protocols. 2007, 2: 2366-2382. 10.1038/nprot.2007.324.View ArticlePubMedGoogle Scholar
- Albert R, Barabasi AL: Statistical mechanics of complex networks. Rev Mod Phys. 2002, 74: 47-97. 10.1103/RevModPhys.74.47.View ArticleGoogle Scholar
- Futschik ME, Chaurasia G, Herzel H: Comparison of human protein protein interaction maps. Bioinformatics. 2007, 23: 605-611. 10.1093/bioinformatics/btl683.View ArticlePubMedGoogle Scholar
- Huber W, Carey V, Long L, et al: Graphs in molecular biology. BMC Bioinformatics. 2007, 8: S8-PubMed CentralView ArticlePubMedGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-PubMed CentralView ArticlePubMedGoogle Scholar
- von Mering C, Krause R, Snel B, et al: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403.View ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol. 2000, 18: 1257-1261. 10.1038/82360.View ArticlePubMedGoogle Scholar
- Kerrien S, Orchard S, Montecchi-Palazzi L, et al: Broadening the horizon - Level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol. 2007, 5: 44-10.1186/1741-7007-5-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, et al: BioGRID: A general repository for interaction datasets. Nucl Acids Res. 2006, 34: D535-D539. 10.1093/nar/gkj109.PubMed CentralView ArticlePubMedGoogle Scholar
- Zanzoni A, Montecchi-Palazzi L, Quondam M, et al: MINT: A Molecular INTeraction database. FEBS Lett. 2002, 513: 135-140. 10.1016/S0014-5793(01)03293-8.View ArticlePubMedGoogle Scholar
- Bader GD, Donaldson I, Wolting C, et al: BIND - The Biomolecular Interaction Network Database. Nucl Acids Res. 2001, 29: 242-245. 10.1093/nar/29.1.242.PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, et al: DIP: The Database of Interacting Proteins. Nucl Acids Res. 2000, 28: 289-291. 10.1093/nar/28.1.289.PubMed CentralView ArticlePubMedGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, et al: IntAct: An open source molecular interaction database. Nucl Acids Res. 2004, 32: D452-D455. 10.1093/nar/gkh052.PubMed CentralView ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Amanchy R, et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13: 2363-2371. 10.1101/gr.1680803.PubMed CentralView ArticlePubMedGoogle Scholar
- Cusick ME, Hu H, Smolyar A, et al: Literature-curated protein interaction datasets. Nat Meth. 2009, 6: 39-46. 10.1038/nmeth.1284.View ArticleGoogle Scholar
- Giot L, Bader JS, Brouwer C, et al: A protein interaction map of Drosophila melanogaster, Science. 2003, 302: 1727-1736.Google Scholar
- Rual J-F, Venkatesan K, Hao T, et al: Towards a proteome-scale map of the human protein-protein interaction network. Nature. 2005, 437: 1173-1178. 10.1038/nature04209.View ArticlePubMedGoogle Scholar
- John PM, Russell SL, Asa BH, et al: Large-scale identification of yeast integral membrane protein interactions. Proc Natl Acad Sci USA. 2005, 102: 12123-12128. 10.1073/pnas.0505482102.View ArticleGoogle Scholar
- Formstecher E, Aresta S, Collura V, et al: Protein interaction mapping: A Drosophila case study. Genome Res. 2005, 15: 376-384. 10.1101/gr.2659105.PubMed CentralView ArticlePubMedGoogle Scholar
- The UniProt C: The Universal Protein Resource (UniProt). Nucl Acids Res. 2008, 36: D190-D195. 10.1093/nar/gkn141.View ArticleGoogle Scholar
- Prieto C, De Las Rivas J: APID: Agile Protein Interaction Data Analyzer. Nucl Acids Res. 2006, 34: W298-W302. 10.1093/nar/gkl128.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, et al: Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, et al: The Pfam protein families database. Nucl Acids Res. 2008, 36: D281-D288. 10.1093/nar/gkn226.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaurasia G, Iqbal Y, Hanig C, et al: UniHI: An entry gate to the human protein interactome. Nucl Acids Res. 2007, 35: D590-D594. 10.1093/nar/gkl817.PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, et al: STRING 8 - A global view on proteins and their functional interactions in 630 organisms. Nucl Acids Res. 2009, 37: D412-D416. 10.1093/nar/gkn760.PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Cary MP, Sander C: Pathguide: A pathway resource list. Nucl Acids Res. 2006, 34: D504-D506. 10.1093/nar/gkj126.PubMed CentralView ArticlePubMedGoogle Scholar