Protein-protein interaction databases: keeping up with growing interactomes

Over the past few years, the number of known protein-protein interactions has increased substantially. To make this information more readily available, a number of publicly available databases have set out to collect and store protein-protein interaction data. Protein-protein interactions have been retrieved from six major databases, integrated and the results compared. The six databases (the Biological General Repository for Interaction Datasets [BioGRID], the Molecular INTeraction database [MINT], the Biomolecular Interaction Network Database [BIND], the Database of Interacting Proteins [DIP], the IntAct molecular interaction database [IntAct] and the Human Protein Reference Database [HPRD]) differ in scope and content; integration of all datasets is non-trivial owing to differences in data annotation. With respect to human protein-protein interaction data, HPRD seems to be the most comprehensive. To obtain a complete dataset, however, interactions from all six databases have to be combined. To overcome this limitation, meta-databases such as the Agile Protein Interaction Database (APID) offer access to integrated protein-protein interaction datasets, although these also currently have certain restrictions.

The nature of protein -protein interaction data Proteins do not act independently but in a network of complex molecular interactions. Therefore, it is important to identify physical interactions between proteins. Different experimental techniques have been developed to measure physical interactions between proteins; these methods vary considerably, not least in terms of the data they produce.
To give some examples, two widely used methods adapted for high-throughput approaches are the yeast two-hybrid (Y2H) system 1 and affinity purification followed by mass spectrometry (AP-MS). 2 The Y2H system assays whether two proteins physically interact with each other (Figure 1). Genetically modified yeast strains are used to express a 'bait' and a 'prey' protein, which, if they interact, trigger the expression of a reporter gene. The method has been used for large-scale screening studies of a variety of model organisms, including yeast, fly and humans.
In an AP-MS experiment, a protein of interest is fused to a protein fragment (the 'tag'), which allows its purification ( Figure 2). This modified or tagged protein is expressed and purified from the cell extract using the tag -for example, by antibodies binding specifically to the tag. Proteins binding the tagged protein are co-purified and subsequently identified by MS. The most widely used variation of the AP-MS method is tandem affinity purification followed by mass spectrometry (TAP-MS). In TAP-MS, the protein of interest is attached to a larger protein tag, which allows two consecutive affinity purification steps. 2 Large-scale TAP-MS experiments have been performed for yeast and human proteins. 3 -5 Currently, several variations of these two methods, as well as a number of other methods, are used to identify protein -protein interactions (PPIs). 6 -8 PPI datasets are often visualised as graphs. 9,10 Proteins are represented as nodes, and interactions as connections between nodes. For example, if the interaction between two proteins is detected by a Y2H experiment, we represent this physical interaction by an undirected connection between the two nodes. In a more detailed representation, we could make a distinction between bait and prey proteins and use a directed connection to represent the interaction between two proteins, using an arrow pointing from bait to prey. The use of graphs to describe the experimental results of AP-MS protein interaction screens is not always as straightforward as for Y2H data. Due to the nature of an AP-MS experiment, which identifies a whole protein complex rather than pairwise interactions, its results can be represented as a graph, using either the matrix or the spokes model ( Figure 3). The matrix model assumes that all proteins of a purified complex interact; therefore, in the graph each protein is connected to each other. The spokes model assumes no additional interactions between proteins in a complex other than between the tagged protein and each co-purified protein.
Graph representation allows the data to be analysed using a graph-theoretical framework. Many graph analysis algorithms have been applied to PPI datasets; these approaches have been reviewed in detail elsewhere. 11 -16   . An affinity purification experiment followed by mass spectrometry. The protein of interest, F (red circle), is fused to a protein fragment -the 'tag' (red rectangle). The tag allows this protein to be purified biochemically. Proteins binding to the tagged protein (blue) are co-purified, whereas proteins not binding to protein F (yellow) are discarded. The purified proteins can be released using enzymatic cleavage (scissors) or other methods, depending on the nature of the tag. These proteins are then identified by mass spectrometry.

PPI databases
The primary resources for PPI data are individual scientific publications. Several public databases collect published PPI data and provide researchers access to their curated datasets. These usually reference the original publication and the experimental method that determined every individual interaction. Database designers choose to represent these data in different ways, and the wide spectrum of experimental methods makes it difficult to design a single data model to capture all necessary experimental detail. To overcome this problem, the International Molecular Exchange (IMEx; http:// imex.sourceforge.net/) consortium was formed. IMEx aims to enable the exchange of data and to avoid the duplication of the curation effort. To that end, an XML-based proteomics standard, termed the proteomics standards initiative -molecular interaction (PSI-MI) has been developed. 17 At the time of writing, however, no data had yet been exchanged, and it was therefore necessary to combine PPI data from all available databases using the authors' own scripts to obtain as comprehensive a network as possible.
Here, the focus is on six databases: the Biological General Repository for Interaction Datasets (BioGRID), 18 the Molecular INTeraction database (MINT), 19 the Biomolecular Interaction Network Database (BIND), 20 the Database of Interacting Proteins (DIP), 21 the IntAct molecular interaction database (IntAct) 22 and the Human Protein Reference Database (HPRD) 23 (see Table 1). These databases report only experimentally verified interactions.
DIP, IntAct and MINT are active members of the IMEx initiative; the curation accuracy of these three databases was assessed recently by Cusick et al. 24 HPRD focuses entirely on human proteins, providing not only information on protein interactions, but also a variety of protein-specific information, such as post-translational modifications, disease associations and enzyme-substrate relationships. One of the first interaction databases, BIND, initiated in 2001 by the University of Toronto and the University of British Columbia, is part of the Biomolecular Object Network Databank (BOND) and was subsequently acquired by the company Thomson Reuters.
The following comparison is based on complete sets of binary interactions that were downloaded from the individual databases in May 2008. IntAct  and MINT derive binary interactions from protein complexes using the spokes model. No other database provided any information on which model is applied. Only 'physical interactions' are considered here, although most databases also provide 'genetic interactions' -that is, two non-essential genes that lead to a non-viable phenotype if they are knocked out simultaneously. Furthermore, interactions were only accepted if a publication identifier was provided along with the interacting proteins. Currently, the most comprehensive database in terms of individual interactions is IntAct, with almost 130,000 unique interactions from up to 131 different organisms. Despite these large numbers, it cites only about 3,000 different publications. Whereas IntAct seems to be concentrating on highthroughput studies, HPRD also takes into account small-scale publications. Although being restricted to human proteins, it reports over 36,000 unique interactions from more than 18,000 publications. Only BioGRID cites a similar number of publications (16,369); it is also the second largest database in terms of the number of unique interactions. It should be noted that the databases examine publications in different depth, and that higher numbers of publications do not necessarily involve a higher curation effort.
The majority of known protein interactions account for proteins from Saccharomyces cerevisiae and Homo sapiens. Individual high-throughput interaction screens were carried out for some other organisms; these high-throughput studies usually account for the majority of all known interactions in the corresponding organism. By contrast, known protein interactions for S. cerevisiae and H. sapiens are dispersed over numerous publications. For this reason, the number of interactions for humans and yeast can vary considerably between different databases, depending on their coverage of the literature.

Differences between the PPI databases
Ideally, every database would extract the same interactions from a given publication. Unfortunately, this is not the case. Of the 14,899 publications shared by at least two databases, 5,782 (39 per cent) were reported with a different number of interactions in different databases. For example, for the publication reporting the most interactions, 25 a minimum of 18,877 (BIND) and a maximum of 20,800 interactions (DIP) were reported. According to the abstract, the number of interactions is 20,405, which, again, is different from the number reported by all five databases that cite this publication. In this case, the variation is presumably due to problems with identifier mapping. Many databases use different identifiers, which do not always map in a perfect one-to-one relationship to the originally published identifiers. BioGRID (20,220 interactions) uses the original gene identifiers, but still lacks 185 interactions.
As a second example, using a Y2H screen, Rual et al. detected 2,754 interactions between human proteins. 26 The authors compared their experimental findings with a literature-curated PPI network of 4,076 interactions. This resulted in a combined network of 6,438 interactions. HPRD (2,371 interactions), IntAct (2,671 interactions) and MINT (2,463 interactions) report only experimentally detected interactions for this reference. BioGRID reports 6,295 interactions for this study, of which 2,594 quote Y2H as the detection method. These also overlap with the interactions reported by the other databases for this reference. The remaining 3,895 interactions quote affinity capture as the detection method and possibly refer to the literature-curated interactions.
For a number of other publications, differences can be explained by different confidence sets or thresholds 27,28 or differences in the application of the matrix or spokes model. Often, no obvious reason for different numbers of interactions could be found.

Integration of PPI data
Integration of data from the different databases is not trivial. Although many databases provide their interactions in the proteomics standards initiativemolecular interactions (PSI-MI) format, its controlled vocabulary is often not used or is used incorrectly. Furthermore, a variety of different gene or protein identifiers are used, even within some of the databases. Although a gene can give rise to several different proteins (due to alternative splicing), we mapped all identifiers to Ensembl gene identifiers to avoid any ambiguities. This procedure is based on mapping tables obtained from UniProt. 29 Only interactions in which both proteins could be mapped to an Ensembl gene identifier were considered for further analysis.
After unifying all identifiers for eukaryotic organisms, the four model organisms Caenorhabditis elegans, Drosophila melanogaster, S. cerevisiae and H. sapiens showed the highest number of interactions ( Table 2). The focus here has been on PPIs in eukaryotes, but the reader should note that highthroughput datasets also exist for a variety of prokaryotes, including Escherichia coli, Campylobacter jejuni and Helicobacter pylori. Previous studies reported little overlap between individual PPI datasets. 15 Likewise, there is little redundancy in the combined set of interactions (Table 2). Between 1 per cent (D. melanogaster) and 18 per cent (H. sapiens) of all interactions are reported by more than one publication. Interestingly, the proportion of interactions that were reported by different methods reaches up to 25 per cent for yeast and 42 per cent for humans (Table 2). Although many small-scale publications apply more than one method to confirm an interaction, this number is most likely an overestimate, because databases use different nomenclature and spelling variations to describe experimental detection methods. Therefore, more interactions appear to be confirmed by several methods than really are.
As mentioned above, databases focus their curation efforts on different publications. Consequently, only a subset of all protein interactions can be found in more than one database (Table 2). These range from 42 per cent of yeast interactions and 51 per cent of human interactions to 72 per cent of fly interactions and 86 per cent of worm interactions.
To assess these differences in more detail, the relative pairwise overlap of human protein interactions between databases was calculated ( Table 3). All databases have their highest relative overlap when compared with HPRD, which reports the most interactions. High overlaps were also found between DIP and BioGRID (55 per cent) and between MINT and IntAct (59 per cent). Even the most abundant database (HPRD), however, covers only two-thirds of all reported human protein interactions.

Meta-databases
None of the existing PPI databases provides an exhaustive dataset. Therefore, some groups have set up meta-databases that provide protein interaction data extracted and integrated from other databases. Currently, one of the most comprehensive metadatabase appears to be the Agile Protein Interaction Database (APID). 30 APID extracts interactions from the six databases described above, mapping all proteins to UniProt identifiers. 29 Via a web interface, the user can query for proteins of interest. APID references the database from which an interaction is derived and provides the related information available in the original database, such as the detection method and the publication identifier. In addition, Table 2. Redundancy of PPIs. The total number of proteins and interactions (that could be mapped to Ensembl gene identifiers), as well as the number of interactions reported by more than one publication, more than one method or more than one database, is shown. Relative numbers were obtained through normalisation with the total number of interactions APID incorporates biological information from various other databases, such as the Gene Ontology 31 and Pfam databases. 32 Unfortunately, a download of the complete dataset is currently not possible due to licensing issues. APID is generally in good agreement with the results of the authors' data integration. For the time being, APID seems a good source of interactome data. Several other meta-databases exist, but these usually focus on a single organism 33 or incorporate various other types of interactions, such as computationally predicted protein interactions and co-citation of proteins. 34 For a comprehensive list of available databases, the reader is referred to the Pathguide. 35 Conclusions PPI databases not only report their data in different ways, using different ontologies, but their curators also report different PPIs when examining the same publication. In addition, all databases include different publications. It is therefore not surprising that every database reports different PPIs. The pairwise overlap among databases analysed here reaches up to 75 per cent, but always falls short of a perfect 100 per cent. Similar results were obtained in related studies. 12,24 Until a data exchange between databases is implemented, a comprehensive set of interactions can only be obtained through data integration of several databases. Meta-databases, such as APID, provide access to more comprehensive datasets, but do not always allow the download of their complete data. Furthermore, by their very nature, meta-databases will always be less up to date than the original databases.
PPI databases have improved greatly over the past couple of years, and important issues, such as data exchange, are being currently addressed by some of the databases described here. An important step towards increasing the number and quality of protein interaction data would be to introduce a submission requirement -as, indeed, already exists for sequence and microarray data. These data have to be submitted to public databases prior to publication in a scientific journal, which ensures data availability and consistent annotation, and enables researchers to utilise the data with greatest efficiency.