Skip to main content

What the papers say: Text mining for genomics and systems biology


Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.


The scientific literature provides an important source of knowledge generated by the research community; it does not become defunct five years after publication and it is not just something to promote the authors' careers. While large amounts of data relating to biological systems are stored in public repositories, an even larger amount can be found in a semi-structured form in the literature (see Figure 1). This knowledge is potentially very useful in a variety of genomics and systems biology contexts [1]. For example, manually curated and literature-derived protein-protein interaction data-sets are typically used as gold standards by the systems biology community and it is standard practice to extract parameters for mechanistic models from the literature.

Figure 1

Biology is becoming a data-driven science, with an exponential growth in the number of papers being published, increasing numbers of databases indexed in the Nucleic Acids Research (NAR) database collection and an exponential growth in the number of base pairs stored in Genbank.

Manual curation lacks the scalability to deal with the ever-increasing numbers of papers being published [2, 3] and suffers from inter-annotator disagreement: different curators may interpret a piece of text in different ways. This means that a single paper needs to be annotated at least twice if the reliability of the proposed annotations is in any way to be calculated. The increase in the numbers of papers being published also means that it is becoming harder for researchers to stay up to date with the relevant literature in their field. This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process [4].

These issues have motivated a sustained interest in the application of text mining (TM) techniques by both the industrial [5] and academic [6] communities to address some of these problems. TM refers to the process of extracting information encoded in text by authors through the use of techniques from a variety of fields such as information retrieval (IR), machine learning (ML), natural language processing (NLP), statistics and computational linguistics (CL) [7]. The use of these techniques leads to a decrease in the time and effort required to extract information from a paper, speeding up curation [8] and also providing novel opportunities for hypothesis generation using the literature. We feel that, in the context of human genomics, this is particularly promising: an increasing number of studies report rare disease-causing variants and, in order to annotate such variants, assess their functional relevance or link them to existing clinical information, TM approaches will increase in importance as an enabling technology for biomedical research.

Text mining can be thought of as a method by which a systematic review can be performed. As with all methods for reviewing the existing literature, however, there are several biases. Due to copyright issues, only a relatively small number of papers are available for full-text mining and so most work is restricted to abstracts and titles, which are freely available from MEDLINE (only 30 per cent of curated protein-protein interactions (PPIs) can be found in the abstracts rather than the full text [9]). This does mean that extracted information is subject to a selection bias, although of a different form to that seen in manual curation (where only a subset of full text papers are curated). Neither manual curation nor TM techniques can deal with the inherent publication bias in the literature. Publication bias [10] refers to the fact that only positive findings (rather than negative or no findings) tend to get reported in the literature, and certain topics and/or genes tend to be reported more when they are in vogue. There is also evidence that PPI networks derived from the literature [11, 12] are subject to ascertainment bias. This occurs when sampling is non-random and conclusions about the population are made based solely on this distorted group of frequently studied proteins. Conclusions about networks that are generated in the presence of ascertainment bias can dramatically change once the necessary corrections have been made [13]. Despite these biases, the literature is still extremely important to researchers as a method for communicating results and ideas and for testing and generating hypotheses.

Below, we give an overview of the current status of TM methodology. Some technical detail is required in order to appreciate fully the potential of this methodology, as well as its (current and future) limitations. At its worst, TM will be an exercise in high-throughput 'stamp collecting'; at best, it opens up the possibility of distilling vast amounts of published information into concrete hypotheses and functional insights into genomics and systems biology.


In order to extract knowledge from text, named entities (NEs) must first be recognised; these NEs are then normalised to identifiers and any relationships between them are identified (see Figure 2). Biological NEs correspond to classes such as genes, proteins, cell lines, species, compounds, phenotypes, diseases, etc. Named entity recognition (NER) refers to the problem of labelling both the location (start, end) and the semantic class/type of a NE in text, and normalisation refers to the process of mapping a NE to a unique identifier (or set of identifiers). Following NER and normalisation it is useful to determine if a real relationship exists between two or more NEs, as well as the type of relationship. Simply identifying that NEs occur together in a contiguous block of text does hint at the existence of some form of relationship; however, this relationship may be completely speculative, or the text may state that a relationship between the NEs does not exist [14]. In biological research papers, two entities can co-occur for many reasons, including functional, physical, syntenic and evolutionary relationships. The performance of TM systems is often evaluated using precision and recall metrics against manually curated gold standard corpora. Precision can be interpreted as the probability that a randomly selected result is a true positive and is calculated as the number of true positives obtained over the sum of true positives and false positives. Recall can be intuitively interpreted as the probability that a randomly selected positive result is correctly identified; it can be calculated as the number of true positives divided by the number of items that should be found (the sum of true positives and false negatives).

Figure 2

An example of a text-mining pipeline. Given a sentence from a paper (A), named entities (NEs) are extracted (green for species entities, red for protein/gene entities, blue for relationship cues) (B); these entities are then normalised to a corresponding identifier scheme (C); and relationships between entities extracted (D). The final result in this case is a network which explicitly encodes the semantic relationships between NEs found in the sentence. Text taken from PMID:14613582.


Biology is a dynamic and ever-expanding research area. This means that there are millions of entity names in use, with new ones constantly being created (eg through genome annotation and drug development). Neologisms are prevalent in the literature; it has been jokingly commented: 'Scientists would rather share each other's underwear than use each other's nomenclature' (Keith Yamamoto). Biological NER thus tends to be more difficult than NER tasks in other domains (eg newswires) due to the variability of biological nomenclatures [15, 16]. A single gene can have many synonyms (eg P53, TP53 and TRP53 all refer to the same gene). Gene names are subject to morphological (eg transcription factor, transcriptional factor), orthographic (eg nuclear factor [NF] kappa B, NF κB), combinatorial (eg homologue of actin, actin homologue) and inflectional variation (eg antibody, antibodies). The HUGO Gene Nomenclature committee (HGNC) was created with the aim of assigning a unique gene symbol to every gene; however, currently, not all genes have been assigned a name and there are still problems with gene names mentioned in the past literature. Gene names can overlap with other names relating to different entity types in the biological domain, as well as with words that are found in everyday language. It is often difficult to disambiguate similar entity classes, as they can have similar contexts and morphologies. For example, a simple heuristic for determining whether a term refers to a gene or protein is that proteins begin with an upper case letter (PspA) and genes begin with a lower case (pspA). This pattern is, however, not maintained consistently in scientific writing, and humans show substantial disagreement on this task,[17] with an average pair-wise agreement among three annotators of 77.58 per cent.

The Drosphilia melanogaster literature is probably the best example of the problems that exist regarding nomenclatures. Some Drosphilia genes are named after their associated phenotype, such as eyeless or fruity, which leads to difficulties in disambiguating whether it is the phenotype or the gene that is being described. Gene names such as Not and That also exist, which are homonymous (see Table 1). Some gene names are multi-word names such as Mind the gap and IL-2 receptor. In the last case, problems detecting the correct boundary may lead to the entity being tagged as IL-2, which completely alters the meaning of the entity [18].

Table 1 Table of linguistic terms

A variety of methods have been proposed for biological NER (see Table 2), with only a small portion freely available for download or publicly accessible via web servers/services. These tools fall into four main categories: dictionary-based, rule-based/pattern-based, machine-learning and hybrid systems (and combinations of these approaches). Most research in this area has concentrated on recognising gene and protein mentions; however, there has also been some work on identifying cell lines, chemicals and species. Competitions such as NLPBA [19] and BioCreative [20] are held in order to evaluate NER methods for gene mention recognition.

Table 2 Some freely available software for NLP tasks in the biological domain

Dictionary-based methods [21] work by matching text against a fixed dictionary of entity names. The performance of these methods is highly dependent on both the coverage of the dictionary and the performance of matching techniques used. Use of a simple text-matching algorithm will lead to a large number of false positives being found because of the overlap between dictionary words and common English, as well as some false negatives due to misspellings not present in the dictionary. Gene names which lead to false positives are typically filtered out of dictionaries. Most systems that are based on this method either use an approximate method of string matching [22] or expand the dictionary by generating spelling variants [23, 24]. These methods tend to lead to an increase in recall accompanied by a decrease in precision. In some cases, dictionary-based NER methods can perform normalisation at the same time [25].

Rule-based methods [26] use orthographic and morpho-syntactic features of NEs (capital letters, numbers, symbols and affixes) and their surrounding words to generate patterns and rules. Biochemical suffixes such as -ase and -in are very useful in indicating possible protein names and so a simple rule would be to tag words with these features as proteins. These systems incorporate expert knowledge easily and the rules generated are human readable and easily extendable. Rule-based techniques are able to reach high levels of precision but at the expense of recall, as they are not robust against unseen names. This is mainly because there are so many potential surface grammatical variations (active, passive voice) and it is not feasible to develop robust patterns for all of these.

Machine learning (ML) methods tend to achieve the highest performance for NER. All of the top ten performing methods in the BioCreative II gene mention task (BCII GM) used a machine-learning component. ML methods use training data in the form of a manually annotated gold standard corpus and learn features that are useful in identifying NEs in text. The performance of the methods used in NER can be very sensitive to feature selection, although this is not always the case [27]. NER can be viewed as either a classification or a sequence-labelling problem. Classification approaches normally consider NER as assigning a class to a bag of features. These features include surface clues and morpho-syntactic features of NEs and their adjacent words. These methods do not tend to take the order of features into account and support only binary classifications. Sequence labelling approaches deduce the most probable sequence of tags for a given sequence of words. Each token is assigned a tag by calculating the most likely label for the current token, given both the features of that token and the previous history of tag assignments. The performance of any ML tagger will be biased by the size, inter-annotator agreement and topic structure of the corpus (see Table 3).

Table 3 Freely available corpora for training and evaluating text-mining tools in the biological domain

Determining the correct class of an NE is complicated by the ubiquitous use of abbreviations and acronyms in biomedical research. Liu et al.[42] found that 81.2 per cent of acronyms in MEDLINE are ambiguous (eg the acronym NF can refer to 61 different full forms [43]). ML methods have been proposed for abbreviation disambiguation,[44] with some work focusing on abbreviations found in the biological literature [43, 45].

It is not just gene names that are difficult to identify; the identification of species mentions is also troublesome. Species names can be homonymous with common English words (eg 'honesty' for Lunaria annua and 'bears' for Ursidae) but also with important entities in the biological domain (eg cancer and hippocampus). The performance of a dictionary-based tagging system is again limited by the lack of coverage, widespread use of acronyms and frequent misspelling of species names. Standard dictionaries of species names such as the National Center for Biotechnology Information (NCBI) Taxonomy are incomplete, given the amazing diversity of life. They do, however, contain names for most well-studied organisms. Rule-based methods [46] have been developed which are capable of identifying species terms using rules designed for matching Linnaean binomial nomenclature. The recently published LINNAEAUS [34] system uses a dictionary and a set of regular expressions to identify species mentions in text. This system allows both the identification and normalisation of species names, features an acronym disambiguation component and achieves high performance on its own corpus.

Cell lines are widely used in biological and biomedical research as a platform for functional studies and to validate biomarkers. It is useful to identify cell line mentions as they can aid in identifying experimental techniques/conditions and to determine the species to which other entity types belong during normalisation. A recent analysis of the cell line nomenclature [47] revealed that it, too, is blighted by ambiguity and variability. Several NER taggers have been trained to identify cell line mentions in text, although there is not yet one specifically designed for tagging cell line mentions. Recently, integrating information from different sources has led to the creation of a cell line knowledge base (CLKB). This work represents the start of efforts to create a lexicon of cell line names, although it is incomplete, so dictionary-based techniques may still miss cell line mentions. As with other subsets of biological nomenclature, there is vertical polysemy (see Table 1) with other NE classes (see Figure 3).

Figure 3

(A) HUman Natural Killer; (B) Large piece of something without definite shape; (C) A well-built, sexually attractive man; (D) Hormonally Upregulated Neu-associated Kinase. Demonstration of the possible problems due to the biological nomenclature, given the sentence HUNK is associated with expression of Frizzled-2: HUNK could refer to a cell type, a protein and two common English words. While, in biological text, it is highly probable that (B) and (C) will not be relevant, it is not so easy to disambiguate (A) and (D). This is an example of the problems posed by polysemy (a word or phrase having multiple meanings), homonymity with common English words and the use of abbreviations in the literature [18].

Entity normalisation

Normalisation of NEs allows the results of text mining to be used in tasks like manual curation,[50] knowledge summarisation [51] and model construction and validation [52, 53]. The standard method of normalisation is to compare an NE against a dictionary of synonyms and identifiers, and assign the matching identifier. In some domains, this approach can achieve an extremely good performance; however, the variability and ambiguity of biological nomenclature means that this method is essentially ineffective for biological entities. The genomic nomenclature is also highly ambiguous, in that one gene name can map to multiple canonical identifiers. This means that exact text matching using a dictionary is flawed, as the term may be a variation not found in the list of synonyms. Rule-based approaches [54] have been used which try to normalise terms by applying a set of transformations to a tagged entity in order to try to make it match a term in a lexicon. String similarity metrics [55] have been used with some success [56] to match terms which are not present in the original lexicon.

Due to the ambiguity in biological nomenclatures (Figure 4), it is important to disambiguate between multiple identifiers. Several approaches have been proposed in order to deal with this problem: rule-based, ML or hybrid. Rule-based approaches [57] use various heuristics to try to assign scores to identifiers. The creation of bags of words associated with specific identifiers (known as semantic profiles) has been useful for disambiguation. These profiles are created by extracting information from various genomic knowledge sources such as UniProt, GO and Entrez. These can then be used to train a classifier to distinguish the correct identifier from incorrect ones [58]. Knowledge of paper co-authorship has been found to be useful in identifier disambiguation,[59] based on the idea that an author uses gene names consistently across all of their publications or may work on a specific set of genes consistently.

Figure 4

The genomic nomenclature is highly ambiguous. The plot shows the rank of a gene name against the total number of times that the gene name is found in Biothesaurus. The inset shows this only looking at human genes. The plot is in log-log coordinates. Both graphs show Zipf-like (discrete power-law) distributions. Biothesaurus is a collection of gene names mapped to Entrez Gene/Uniprot identifiers across approximately 7,000 species.

It is not just the proteomic and genomic nomenclatures that pose problems for normalisation. While the precise Linnaean binomial name for an organism is unambiguous, it may not be the case for its abbreviated form. Caenorhabditis elegans is commonly abbreviated to C. elegans; however, 49 other species have a name that can be abbreviated to this short form. Due to the widespread use of Caenorhabditis elegans as a model organism, the majority of mentions of C. elegans would probably normalise to NCBI Taxonomy identifier 6239 but this heuristic will have exceptions. Another problem with species normalisation is dealing with the abundance of different strains, particularly among microorganisms. It is important to disambiguate the strains if possible, as genes' functional properties can vary between strains.

Good results for normalising human gene names have been reported. The BCII GN task [60] evaluated performance against a manually annotated gold standard corpus. Overall results were promising, with a combined recall of 97.2 per cent (entries from over 20 teams). This evaluation assumed that the species was human, however. Normalisation for other species continues to be a challenge and has not been helped by the decision made at the 22nd International Society for Animal Genetics (in August 1990) that animal gene names should 'follow the rules for human gene nomenclature, including the use of identical symbols for homologous genes and the reservation of human symbols for as yet unidentified animal genes' [15]. This interspecies ambiguity of the genomic nomenclature means that identifying the correct species for a given mention is an important subtask of gene normalisation, although it has only recently begun to be considered [61].

Relation extraction

Identifying the existence and type of relationships between entities is difficult because of the numerous ways that a relationship can be proposed. A binding relationship between two proteins could, for example, be described in at least three ways:

  1. (1)

    APPL binds Akt2

  2. (2)

    Binding of Akt2 by APPL

  3. (3)

    Binding between Akt2 and APPL

Relationships between two entities can be described over multiple sentences, which can lead to complications, as anaphors need to be identified and resolved (eg APPL is later referred to as this protein in a piece of text). This limits the recall of relation extraction approaches that work at the sentence level only. The relationship type that has attracted the most effort is extracting PPIs.

A number of different approaches have been proposed in order to perform this task based on linguistic, rule-based and ML methods. Rule-based methods use a set of syntactic patterns, which specify how an interaction is described. The patterns can be manually or automatically generated. RelEX [62] applies a simple set of rules on a representation of the dependencies between words in a sentence called a dependency graph. The RLIMS-P [63] is a rule-based approach specifically designed to extract information about protein phosphorylation sites, and performs well compared with manually curated literature sets. Some ML methods treat a sentence as a sequence of words or tokens and completely ignore its syntactic structure. These approaches do not achieve good performance compared with methods which take sentence structure into account. It is clearly important to consider both contextual and linguistic features,[64, 65] such as interaction keywords and verbs,[66] to extract relationships with good precision.

To complicate matters further, authors frequently speculate about potential relationships (eg APPL may interact with Akt2). These statements do not correspond to the definition of a relationship, but that the relationship is proposed to exist. It is important to identify these speculative statements [67] and prevent them from biasing any downstream analyses. For the same reason, it is equally important to detect the negation of relationships [68] (eg APPL does not interact with Akt2).

Hypothesis generation

The scientific literature not only contains explicit knowledge, such as 'APPL interacts with Akt2', but also implicit knowledge,[69] such as hidden refutations or qualifications, inferences from transitive relations, hidden or unrecognised analogies and the accumulation of weak tests (which could be used in meta-analyses). Swanson's serendipitous discovery of the connection between Raynaud's disease and fish oil [70] is an example of performing an inference on a transitive relation to generate a novel and testable hypothesis. By reading two disjoint sets of literature (no articles are in common, and the articles in one set do not cite or mention articles in the other set), he observed that blood factors were a common theme in both the Raynaud's disease and the fish-oil literature. This led him to propose that fish oil could be used in the treatment of Raynaud's disease, and the relationship was clinically validated in 1989 [71]. The discovery led Swanson to propose that 'new hypotheses can emerge and scientific discovery can be anticipated or stimulated through the investigation of complementary but disjoint literatures'. This method of literature-based discovery is commonly referred to as Swanson's ABC model or Swanson Linking, with the hypotheses and new knowledge being described as undiscovered public knowledge. Although the model has mainly been used within the biomedical and biological fields it has also been applied to the humanities literature and the WWW (see Table 4).

Table 4 Summary of hypotheses generated using Swanson's ABC model and its extensions

Mendeleev's discovery of the law of periodicity and the development of the periodic table can be considered an early example of literature-based discovery (LBD), as it was: 'a direct outcome of the stock of generalisations and established facts which has accumulated by the end of the decade 1860-1870.' The information required to build the table of elements had already been published, but it had never been analysed as a whole [72]. More recently, Hettne et al.[73] combined TM with network analysis in order to generate new mechanistic hypotheses relating to the complex regional pain syndrome (CRPS). NF-κB was identified as potentially being involved by first extracting genes relating to CRPS from the literature and then investigating potential links between these genes which were not mentioned in the CRPS literature. This hypothesis has led to several new ideas regarding the aetiology of the disease and the proposal of a novel drug target. By exploiting the context of protein mentions, van Haagen et al.[74] were able to predict a novel interaction between CAPN3 and PARVB. Integrating information extracted from the literature with microarray experiments has led to the proposition of a relationship between SIP and the invasiveness of glioblastoma cell lines [75]. All of this work shows the potential for TM to generate testable hypotheses for use in biology.

Hypothesis generation is challenging even to humans, however. Automating this process, or formulating it in such a way that a computer can quickly generate testable scientific propositions, is a non-trivial and daunting task. Only if the universe of potential hypotheses is sufficiently simple for search or enumeration approaches to cover all potential cases is this currently feasible. We feel that the most promising strategies in the short term include the search for suitable heuristics or iterative procedures involving infrequent human input.


TM tools offer a way to retrieve the pertinent information contained within the mass of scientific literature, make it easier to explore [88] and allow the generation of novel insights into existing data, all in an automated fashion. While TM is currently noisy and imperfect, it should be remembered that, due to inter-annotator disagreement, manual curation is too. TM is not just restricted to extracting functional information; it has also been used to identify best practices within the phylogenetics domain,[89] to generate priors for network reconstruction using Bayesian networks [90] and to aid in protein structure comparison and assignment of function [91]. Recently, TM has shown the greatest potential when used in data fusion style approaches. By using information extracted from the literature, Raychaudhuri et al.[92] were able to develop a method better to distinguish between genomic regions associated with disease and false-positive regions. Ten out of 13 single nucleotide polymorphisms (SNPs) identified by their method as been associated with Crohn's disease were later validated by follow-up genotyping. STRING [93] integrates many different types of evidence about PPIs, including literature co-occurrence, phylogenetic data and results from high-throughput experiments, and has been used to predict novel PPIs in other organisms by transferring annotations to orthologous protein pairs. While there is a significant body of work on applying TM to the biological domain, however, there still remain many challenges in areas like relation extraction, species disambiguation and hypothesis generation.

Systems biology and genomics deal with large data models of unprecedented complexity; TM allows us to draw on the published literature in a disciplined manner to inform the development of quantitative models. We expect TM to become an important addition to the systems biologist's toolkit, complementing existing techniques like comparative and primary data analysis. We hope to have demonstrated the use and limitations of TM in its current guise. Being aware of the limitations, however, should enable the community to develop and adopt protocols that allow for easier, more reliable analysis of published research outputs from these tools. This is important not only for researchers, but also for publishers, funding bodies and regulators. These three players have, of course, different but, crucially, not competing interests as far as accessibility of information is concerned. Regulators, in particular, irrespective of whether or not they are engaged in accrediting new drugs or nutritional supplements or the granting of patents, stand to benefit profoundly from information that is provided in an electronically accessible and unambiguous fashion.


  1. 1.

    Ananiadou S, Kell D, Tsujii J: Text mining and its potential applications in systems biology. Trends Biotechnol. 2006, 24: 571-579. 10.1016/j.tibtech.2006.10.002.

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, et al: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007, 23: i41-i48. 10.1093/bioinformatics/btm229.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Winnenburg R, Wächter T, Plake C, Doms A, et al: Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies?. Brief Bioinform. 2008, 9: 466-478. 10.1093/bib/bbn043.

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Ng S, Wong M: Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 1999, 10: 104-112.

    CAS  Google Scholar 

  5. 5.

    Agarwal P, Searls DB: Literature mining in support of drug discovery. Brief Bioinform. 2008, 9: 479-492. 10.1093/bib/bbn035.

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Rzhetsky A, Seringhaus M, Gerstein M: Seeking a new biology through text mining. Cell. 2008, 134: 9-13. 10.1016/j.cell.2008.06.029.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  7. 7.

    Hearst M: Untangling text data mining. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 1999, 3-10.

    Chapter  Google Scholar 

  8. 8.

    Deshpande N, Fink J, Bourne P, Cohen K: Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing. 2008, 640-651.

    Google Scholar 

  9. 9.

    Blaschke C: Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics. 2001, 2: 196-206. 10.1002/cfg.91.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  10. 10.

    Knight J: Negative results: Null and void. Nature. 2003, 422: 554-555. 10.1038/422554a.

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Pfeiffer T, Hoffmann R: Temporal patterns of genes in scientific publications. Proc Natl Acad Sci USA. 2003, 104: 12052-12056.

    Article  Google Scholar 

  12. 12.

    Lehne B, Schlitt T: Protein-protein interaction databases: Keeping up with growing interactomes. Hum Genomics. 2009, 3: 291-297.

    PubMed Central  CAS  PubMed  Google Scholar 

  13. 13.

    Dickerson J, Pinney J, Robertson D: The biological context of HIV-1 host interactions reveals subtle insights into a system hijack. BMC Syst Biol. 2010, 4: 80-10.1186/1752-0509-4-80.

    PubMed Central  Article  PubMed  Google Scholar 

  14. 14.

    Jenssen T, Lægreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28.

    CAS  PubMed  Google Scholar 

  15. 15.

    Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics. 2005, 21: 248-256. 10.1093/bioinformatics/bth496.

    Article  PubMed  Google Scholar 

  16. 16.

    Mons B: Which gene did you mean?. BMC Bioinform. 2005, 6: 142-10.1186/1471-2105-6-142.

    Article  Google Scholar 

  17. 17.

    Hatzivassiloglou V, Duboue PA, Rzhetsky A: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001, 17: 97-106. 10.1093/bioinformatics/17.suppl_1.S97.

    Article  Google Scholar 

  18. 18.

    Barnes J: Conceptual biology: A semantic issue and more. Nature. 2002, 417: 587-588.

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Kim J, Ohta T, Tsuruoka Y, Tateisi YN, et al: Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 70-75.

    Google Scholar 

  20. 20.

    Smith L, Tanabe LK, Johnson R, Kuo CJ, et al: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9: S2-

    PubMed Central  Article  PubMed  Google Scholar 

  21. 21.

    Liu H, Hu ZZ, Torii M, Wu C, et al: Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006, 13: 497-507. 10.1197/jamia.M2085.

    PubMed Central  Article  PubMed  Google Scholar 

  22. 22.

    Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics. 2007, 23: 2768-2774. 10.1093/bioinformatics/btm393.

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Schuemie M, Mons B, Weeber M, Kors J: Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform. 2007, 40: 316-324. 10.1016/j.jbi.2006.09.002.

    Article  PubMed  Google Scholar 

  24. 24.

    Tsuruoka Y: Probabilistic term variant generator for biomedical terms. Proceedings of the 26th Annual International ACM SIGR Conference on Research and Development in Information Retrieval. 2003, 167-173.

    Google Scholar 

  25. 25.

    Fundel K, Güttler D, Zimmer R, Apostolakis J: A simple approach for protein name identification: Prospects and limits. BMC Bioinform. 2005, 6 (Suppl 1): S15-10.1186/1471-2105-6-S1-S15.

    Article  Google Scholar 

  26. 26.

    Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P: Protein structures and information extraction from biological texts: The PASTA system. Bioinformatics. 2003, 19: 135-143. 10.1093/bioinformatics/19.1.135.

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Hakenberg J, Bickel S, Plake C, Brefeld U, et al: Systematic feature evaluation for gene name recognition. BMC Bioinform. 2005, 6: S9-

    Article  Google Scholar 

  28. 28.

    Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text. Bioinformatics. 2002, 18: 1124-1132. 10.1093/bioinformatics/18.8.1124.

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Settles B: ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005, 21: 3191-3192. 10.1093/bioinformatics/bti475.

    CAS  Article  PubMed  Google Scholar 

  30. 30.

    Sætre R, Sagae K, Tsujii J: Syntactic features for protein-protein interaction extraction. Proceedings of the 2nd International Symposium on Languages in Biology and Medicine. 2007, 6.1-6.14.

    Google Scholar 

  31. 31.

    Leaman R, Gonzalez G: BANNER: An executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing. 2008, 652-663.

    Google Scholar 

  32. 32.

    Tsuruoka Y, Tateishi Y, Kim J, Ohta T, et al: Developing a robust part-of-speech tagger for biomedical text. Proceedings of Panhellenic Conference on Informatics. 2005, 3746: 382-392.

    Google Scholar 

  33. 33.

    Airola A, Pyysalo S, Björne J, Pahikkala T, et al: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinform. 2008, 9: S2-

    Article  Google Scholar 

  34. 34.

    Gerner M, Nenadic G, Bergman CM: LINNAEUS: A species name identification system for biomedical literature. BMC Bioinform. 2010, 11: 85-10.1186/1471-2105-11-85.

    Article  Google Scholar 

  35. 35.

    Hahn U, Buyko E, Landefeld R: An overview of JCoRe, the JULIE lab UIMA component repository. Proceedings of the LREC'08 Workshop Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP. 2008, 1-7.

    Google Scholar 

  36. 36.

    Smith L, Rindflesch T, Wilbur W: MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics. 2004, 20: 2320-2321. 10.1093/bioinformatics/bth227.

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Mika S, Rost B: NLProt: Extracting protein names and sequences from papers. Nucleic Acids Res. 2004, 32: W634-W637. 10.1093/nar/gkh427.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  38. 38.

    Hunter L, Lu Z, Firby J, Baumgartner WA, et al: OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinform. 2008, 9: 78-10.1186/1471-2105-9-78.

    Article  Google Scholar 

  39. 39.

    Corbett P, Murray-Rust P: High-throughput identification of chemistry in life science texts. Proceedings of the 2nd International Symposium on Computational Life Science. 2006, 107-118.

    Google Scholar 

  40. 40.

    Song Y, Kim E, Lee GG, Yi BK: POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics. 2005, 21: 2794-2796. 10.1093/bioinformatics/bti414.

    CAS  Article  PubMed  Google Scholar 

  41. 41.

    Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, et al: Text processing through Web services: calling Whatizit. Bioinformatics. 2008, 24: 296-298. 10.1093/bioinformatics/btm557.

    CAS  Article  PubMed  Google Scholar 

  42. 42.

    Liu H, Aronson AR, Friedman C: A study of abbreviations in MEDLINE abstracts. Proceedings/AMIA Annual Symposium AMIA Symposium. 2002, 464-468.

    Google Scholar 

  43. 43.

    Okazaki N, Ananiadou S: Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006, 22: 3089-3095. 10.1093/bioinformatics/btl534.

    CAS  Article  PubMed  Google Scholar 

  44. 44.

    Tsuruoka Y, Ananiadou S: A machine learning approach to acronym generation. Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, 25-31.

    Chapter  Google Scholar 

  45. 45.

    Bracewell D, Russell S, Wu A: Identification, expansion, and disambiguation of acronyms in biomedical texts. Lect Notes Comput Sci. 2005, 3759: 186-195. 10.1007/11576259_21.

    Article  Google Scholar 

  46. 46.

    Koning D, Sarkar I, Moritz T: TaxonGrab: Extracting taxonomic names from text. Biodiversity Inform. 2005, 2: 79-82.

    Article  Google Scholar 

  47. 47.

    Sarntivijai S, Ade AS, Athey BD, States DJ: A bioinformatics analysis of the cell line nomenclature. Bioinformatics. 2008, 24: 2760-2766. 10.1093/bioinformatics/btn502.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  48. 48.

    Pyysalo S, Ginter F, Heimonen J, Björne J, et al: BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinform. 2007, 8: 50-10.1186/1471-2105-8-50.

    Article  Google Scholar 

  49. 49.

    Wang X, Tsujii J, Ananiadou S: Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics. 2010, 26: 661-667. 10.1093/bioinformatics/btq002.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  50. 50.

    Alex B, Grover C, Haddow B, Kabadjov M, et al: Assisted curation: Does text mining really help?. Pac Symp Biocomput. 2008, 556-567.

    Google Scholar 

  51. 51.

    Craven M, Kumlien J: Constructing biological knowledge bases by extracting information from text sources. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. 1999, 77-86.

    Google Scholar 

  52. 52.

    Santos C, Eggle D, States D: Wnt pathway curation using automated natural language processing: Combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics. 2005, 21: 1653-1658. 10.1093/bioinformatics/bti165.

    CAS  Article  PubMed  Google Scholar 

  53. 53.

    Waagmeester A, Pezik P, Coort S, Tourniaire F, et al: Pathway enrichment based on text mining and its validation on carotenoid and vitamin A metabolism. OMICS. 2009, 13: 367-379. 10.1089/omi.2009.0029.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Lau WW, Johnson CA, Becker KG: Rule-based human gene normalization in biomedical text with confidence estimation. Comput Syst Bioinformatics Conf. 2007, 6: 371-379.

    Article  PubMed  Google Scholar 

  55. 55.

    Wang X, Matthews M: Comparing usability of matching techniques for normalising biomedical named entities. Pac Symp Biocomput. 2008, 13: 628-639.

    Google Scholar 

  56. 56.

    Grover C, Haddow B, Klein E, Matthews M: Adapting a relation extraction pipeline for the BioCreAtIvE II task. Proceedings of the BioCreAtIvE II Workshop. 2007

    Google Scholar 

  57. 57.

    Wang X: Rule-based protein term identification with help from automatic species tagging. Proceedings of CICLING. 2007, 288-298.

    Google Scholar 

  58. 58.

    Crim J, McDonald R, Pereira F: Automatically annotating documents with normalized gene lists. BMC Bioinform. 2005, 6: S13-

    Article  Google Scholar 

  59. 59.

    Farkas R: The strength of co-authorship in gene name disambiguation. BMC Bioinform. 2008, 9: 69-10.1186/1471-2105-9-69.

    Article  Google Scholar 

  60. 60.

    Morgan AA, Lu Z, Wang X, Cohen AM, et al: Overview of BioCreative II gene normalization. Genome Biol. 2008, 9: S3-

    PubMed Central  Article  PubMed  Google Scholar 

  61. 61.

    Kappeler T, Kaljurand K, Rinaldi F: TX task: Automatic detection of focus organisms in biomedical publications. Proceedings of the Workshop on BioNLP. 2009, 80-88.

    Chapter  Google Scholar 

  62. 62.

    Fundel K, Kuffner R, Zimmer R: RelEx-Relation extraction using dependency parse trees. Bioinformatics. 2007, 23: 365-371. 10.1093/bioinformatics/btl616.

    CAS  Article  PubMed  Google Scholar 

  63. 63.

    Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, et al: Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005, 21: 2759-2765. 10.1093/bioinformatics/bti390.

    CAS  Article  PubMed  Google Scholar 

  64. 64.

    Niu Y, Otasek D, Jurisica I: Evaluation of linguistic features useful in extraction of interactions from PubMed: Application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics. 2009, 26: 111-119.

    PubMed Central  Article  PubMed  Google Scholar 

  65. 65.

    Fayruzov T, Cock MD, Cornelis C, Hoste V: Linguistic feature analysis for protein interaction extraction. BMC Bioinform. 2009, 10: 374-10.1186/1471-2105-10-374.

    Article  Google Scholar 

  66. 66.

    Hatzivassiloglou V, Weng W: Learning anchor verbs for biological interaction patterns from published text articles. Int J Med Inform. 2002, 67: 19-32. 10.1016/S1386-5056(02)00054-0.

    Article  PubMed  Google Scholar 

  67. 67.

    Kilicoglu H, Bergler S: Recognizing speculative language in biomedical research articles: A linguistically motivated perspective. BMC Bioinform. 2008, 9: S10-

    Article  Google Scholar 

  68. 68.

    Sanchez-Graillet O, Poesio M: Negation of protein-protein interactions: Analysis and extraction. Bioinformatics. 2007, 23: i424-i432. 10.1093/bioinformatics/btm184.

    CAS  Article  PubMed  Google Scholar 

  69. 69.

    Davies R: The creation of new knowledge by information retrieval and classification. J Doc. 1989, 4: 273-301.

    Article  Google Scholar 

  70. 70.

    Swanson D: Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986, 30: 7-18.

    CAS  Article  PubMed  Google Scholar 

  71. 71.

    DiGiacomo R, Kremer J, Shah D: Fish-oil dietary supplementation in patients with Raynaud's phenomenon: A double-blind, controlled, prospective study. Am J Med. 1989, 86: 158-164. 10.1016/0002-9343(89)90261-1.

    CAS  Article  PubMed  Google Scholar 

  72. 72.

    Murray-Rust P: Data Driven Science-A Scientist's View. NSF/JISC 2007 Digital Repositories Workshop. 2007, []

    Google Scholar 

  73. 73.

    Hettne K, de Mos M, de Bruijn A, Weeber M: Applied information retrieval and multidisciplinary research: New mechanistic hypotheses in complex regional pain syndrome. J Biomed Discov Collab. 2007, 2: 2-10.1186/1747-5333-2-2.

    PubMed Central  Article  PubMed  Google Scholar 

  74. 74.

    van Haagen H, 't Hoen P, Bovo AB, de Morrée A, et al: Novel protein-protein interactions inferred from literature context. PLoS ONE. 2009, 4: e7894-10.1371/journal.pone.0007894.

    PubMed Central  Article  PubMed  Google Scholar 

  75. 75.

    Natarajan J, Berrar D, Dubitzky W, Hack C, et al: Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinform. 2006, 7: 373-10.1186/1471-2105-7-373.

    Article  Google Scholar 

  76. 76.

    Cory K: Discovering hidden analogies in an online humanities database. Comput Hum. 1997, 31: 1-12. 10.1023/A:1000422220677.

    Article  Google Scholar 

  77. 77.

    Gordon M, Lindsay R, Fan W: Literature-based discovery on the World Wide Web. ACM Trans Inter Tech. 2002, 2: 261-275. 10.1145/604596.604597.

    Article  Google Scholar 

  78. 78.

    Hristovski D, Peterlin B, Mitchell J, Humphrey S: Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005, 74: 289-298. 10.1016/j.ijmedinf.2004.04.024.

    Article  PubMed  Google Scholar 

  79. 79.

    Kostoff R, Briggs M, Lyons T: Literature-related discovery (LRD): Potential treatments for multiple sclerosis. Technol Forecast Soc Change. 2007, 75: 239-255.

    Article  Google Scholar 

  80. 80.

    Kostoff R: Literature-related discovery (LRD): Potential treatments for cataracts. Technol Forecast Soc Change. 2007, 75: 215-225.

    Article  Google Scholar 

  81. 81.

    Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004, 20: i290-i296. 10.1093/bioinformatics/bth914.

    CAS  Article  PubMed  Google Scholar 

  82. 82.

    Srinivasan P, Libbus B, Sehgal A: Mining medline: Postulating a beneficial role for curcumin longa in retinal diseases. HLT Biolink. 2004, 33-40.

    Google Scholar 

  83. 83.

    Swanson D, Smalheiser N, Bookstein A: Information discovery from complementary literatures: Categorizing viruses as potential weapons. J Am Soc Inf Sci Technol. 2001, 52: 797-812. 10.1002/asi.1135.

    Article  Google Scholar 

  84. 84.

    Weeber M, Vos R, Klein H, de Jong-van den Berg LT, et al: Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. J Am Med Inform Assoc. 2003, 10: 252-259. 10.1197/jamia.M1158.

    PubMed Central  Article  PubMed  Google Scholar 

  85. 85.

    Wren JD, Bekeredjian R, Stewart JA, Shohet RV, et al: Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004, 20: 389-398. 10.1093/bioinformatics/btg421.

    CAS  Article  PubMed  Google Scholar 

  86. 86.

    Wren J: Data-mining analysis suggests an epigenetic pathogenesis for type 2 diabetes. J Biomed Biotechnol. 2005, 2: 104-112.

    Article  Google Scholar 

  87. 87.

    Zhou X, Liu B, Wu Z, Feng Y: Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artif Intell Med. 2007, 41: 87-104. 10.1016/j.artmed.2007.07.007.

    Article  PubMed  Google Scholar 

  88. 88.

    Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005, 21: ii252-ii258. 10.1093/bioinformatics/bti1142.

    CAS  PubMed  Google Scholar 

  89. 89.

    Eales J, Pinney J, Stevens R, Robertson D: Methodology capture: Discriminating between the "best" and the rest of community practice. BMC Bioinform. 2008, 9: 359-10.1186/1471-2105-9-359.

    Article  Google Scholar 

  90. 90.

    Steele E, Tucker A, 't Hoen P, Schuemie M: Literature-based priors for gene regulatory networks. Bioinformatics. 2009, 25: 1768-1774. 10.1093/bioinformatics/btp277.

    CAS  Article  PubMed  Google Scholar 

  91. 91.

    MacCallum R, Kelley L, Sternberg M: SAWTED: Structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics. 2000, 16: 125-129. 10.1093/bioinformatics/16.2.125.

    CAS  Article  PubMed  Google Scholar 

  92. 92.

    Raychaudhuri S, Plenge RM, Rossin EJ, Ng ACY, et al: Identifying relationships among disease regions: Predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 2009, 5: e1000534-10.1371/journal.pgen.1000534.

    PubMed Central  Article  PubMed  Google Scholar 

  93. 93.

    von Mering C, Jensen L, Snel B, Hooper S: STRING: Known and prediction protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005, 33: D433-D437.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Michael PH Stumpf.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Harmston, N., Filsell, W. & Stumpf, M.P. What the papers say: Text mining for genomics and systems biology. Hum Genomics 5, 17 (2010).

Download citation


  • data mining
  • systems medicine
  • literature processing
  • hypothesis generation