- Open Access
What the papers say: Text mining for genomics and systems biology
© Henry Stewart Publications 2010
- Received: 6 August 2010
- Accepted: 6 August 2010
- Published: 1 October 2010
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.
- data mining
- systems medicine
- literature processing
- hypothesis generation
Manual curation lacks the scalability to deal with the ever-increasing numbers of papers being published [2, 3] and suffers from inter-annotator disagreement: different curators may interpret a piece of text in different ways. This means that a single paper needs to be annotated at least twice if the reliability of the proposed annotations is in any way to be calculated. The increase in the numbers of papers being published also means that it is becoming harder for researchers to stay up to date with the relevant literature in their field. This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process .
These issues have motivated a sustained interest in the application of text mining (TM) techniques by both the industrial  and academic  communities to address some of these problems. TM refers to the process of extracting information encoded in text by authors through the use of techniques from a variety of fields such as information retrieval (IR), machine learning (ML), natural language processing (NLP), statistics and computational linguistics (CL) . The use of these techniques leads to a decrease in the time and effort required to extract information from a paper, speeding up curation  and also providing novel opportunities for hypothesis generation using the literature. We feel that, in the context of human genomics, this is particularly promising: an increasing number of studies report rare disease-causing variants and, in order to annotate such variants, assess their functional relevance or link them to existing clinical information, TM approaches will increase in importance as an enabling technology for biomedical research.
Text mining can be thought of as a method by which a systematic review can be performed. As with all methods for reviewing the existing literature, however, there are several biases. Due to copyright issues, only a relatively small number of papers are available for full-text mining and so most work is restricted to abstracts and titles, which are freely available from MEDLINE (only 30 per cent of curated protein-protein interactions (PPIs) can be found in the abstracts rather than the full text ). This does mean that extracted information is subject to a selection bias, although of a different form to that seen in manual curation (where only a subset of full text papers are curated). Neither manual curation nor TM techniques can deal with the inherent publication bias in the literature. Publication bias  refers to the fact that only positive findings (rather than negative or no findings) tend to get reported in the literature, and certain topics and/or genes tend to be reported more when they are in vogue. There is also evidence that PPI networks derived from the literature [11, 12] are subject to ascertainment bias. This occurs when sampling is non-random and conclusions about the population are made based solely on this distorted group of frequently studied proteins. Conclusions about networks that are generated in the presence of ascertainment bias can dramatically change once the necessary corrections have been made . Despite these biases, the literature is still extremely important to researchers as a method for communicating results and ideas and for testing and generating hypotheses.
Below, we give an overview of the current status of TM methodology. Some technical detail is required in order to appreciate fully the potential of this methodology, as well as its (current and future) limitations. At its worst, TM will be an exercise in high-throughput 'stamp collecting'; at best, it opens up the possibility of distilling vast amounts of published information into concrete hypotheses and functional insights into genomics and systems biology.
Biology is a dynamic and ever-expanding research area. This means that there are millions of entity names in use, with new ones constantly being created (eg through genome annotation and drug development). Neologisms are prevalent in the literature; it has been jokingly commented: 'Scientists would rather share each other's underwear than use each other's nomenclature' (Keith Yamamoto). Biological NER thus tends to be more difficult than NER tasks in other domains (eg newswires) due to the variability of biological nomenclatures [15, 16]. A single gene can have many synonyms (eg P53, TP53 and TRP53 all refer to the same gene). Gene names are subject to morphological (eg transcription factor, transcriptional factor), orthographic (eg nuclear factor [NF] kappa B, NF κB), combinatorial (eg homologue of actin, actin homologue) and inflectional variation (eg antibody, antibodies). The HUGO Gene Nomenclature committee (HGNC) was created with the aim of assigning a unique gene symbol to every gene; however, currently, not all genes have been assigned a name and there are still problems with gene names mentioned in the past literature. Gene names can overlap with other names relating to different entity types in the biological domain, as well as with words that are found in everyday language. It is often difficult to disambiguate similar entity classes, as they can have similar contexts and morphologies. For example, a simple heuristic for determining whether a term refers to a gene or protein is that proteins begin with an upper case letter (PspA) and genes begin with a lower case (pspA). This pattern is, however, not maintained consistently in scientific writing, and humans show substantial disagreement on this task, with an average pair-wise agreement among three annotators of 77.58 per cent.
Table of linguistic terms
A word or phrase that refers back to an
earlier word or phrase
The coexistence of many possible meanings
for a word or phrase
Each of two or more words having the
same spelling and pronunciation but
different meanings and origins
Relating the meaning in language or logic
The arrangement of words and phrases to
create well-formed sentences in a language
Part of speech
One of the traditional categories of words
intended to reflect their functions in a
Some freely available software for NLP tasks in the biological domain
Graph Kernel 
Dictionary-based methods  work by matching text against a fixed dictionary of entity names. The performance of these methods is highly dependent on both the coverage of the dictionary and the performance of matching techniques used. Use of a simple text-matching algorithm will lead to a large number of false positives being found because of the overlap between dictionary words and common English, as well as some false negatives due to misspellings not present in the dictionary. Gene names which lead to false positives are typically filtered out of dictionaries. Most systems that are based on this method either use an approximate method of string matching  or expand the dictionary by generating spelling variants [23, 24]. These methods tend to lead to an increase in recall accompanied by a decrease in precision. In some cases, dictionary-based NER methods can perform normalisation at the same time .
Rule-based methods  use orthographic and morpho-syntactic features of NEs (capital letters, numbers, symbols and affixes) and their surrounding words to generate patterns and rules. Biochemical suffixes such as -ase and -in are very useful in indicating possible protein names and so a simple rule would be to tag words with these features as proteins. These systems incorporate expert knowledge easily and the rules generated are human readable and easily extendable. Rule-based techniques are able to reach high levels of precision but at the expense of recall, as they are not robust against unseen names. This is mainly because there are so many potential surface grammatical variations (active, passive voice) and it is not feasible to develop robust patterns for all of these.
Freely available corpora for training and evaluating text-mining tools in the biological domain
BioCreative I GM
BioCreative II GM
BioCreative II GN
BioCreative II FT
Determining the correct class of an NE is complicated by the ubiquitous use of abbreviations and acronyms in biomedical research. Liu et al. found that 81.2 per cent of acronyms in MEDLINE are ambiguous (eg the acronym NF can refer to 61 different full forms ). ML methods have been proposed for abbreviation disambiguation, with some work focusing on abbreviations found in the biological literature [43, 45].
It is not just gene names that are difficult to identify; the identification of species mentions is also troublesome. Species names can be homonymous with common English words (eg 'honesty' for Lunaria annua and 'bears' for Ursidae) but also with important entities in the biological domain (eg cancer and hippocampus). The performance of a dictionary-based tagging system is again limited by the lack of coverage, widespread use of acronyms and frequent misspelling of species names. Standard dictionaries of species names such as the National Center for Biotechnology Information (NCBI) Taxonomy are incomplete, given the amazing diversity of life. They do, however, contain names for most well-studied organisms. Rule-based methods  have been developed which are capable of identifying species terms using rules designed for matching Linnaean binomial nomenclature. The recently published LINNAEAUS  system uses a dictionary and a set of regular expressions to identify species mentions in text. This system allows both the identification and normalisation of species names, features an acronym disambiguation component and achieves high performance on its own corpus.
Normalisation of NEs allows the results of text mining to be used in tasks like manual curation, knowledge summarisation  and model construction and validation [52, 53]. The standard method of normalisation is to compare an NE against a dictionary of synonyms and identifiers, and assign the matching identifier. In some domains, this approach can achieve an extremely good performance; however, the variability and ambiguity of biological nomenclature means that this method is essentially ineffective for biological entities. The genomic nomenclature is also highly ambiguous, in that one gene name can map to multiple canonical identifiers. This means that exact text matching using a dictionary is flawed, as the term may be a variation not found in the list of synonyms. Rule-based approaches  have been used which try to normalise terms by applying a set of transformations to a tagged entity in order to try to make it match a term in a lexicon. String similarity metrics  have been used with some success  to match terms which are not present in the original lexicon.
It is not just the proteomic and genomic nomenclatures that pose problems for normalisation. While the precise Linnaean binomial name for an organism is unambiguous, it may not be the case for its abbreviated form. Caenorhabditis elegans is commonly abbreviated to C. elegans; however, 49 other species have a name that can be abbreviated to this short form. Due to the widespread use of Caenorhabditis elegans as a model organism, the majority of mentions of C. elegans would probably normalise to NCBI Taxonomy identifier 6239 but this heuristic will have exceptions. Another problem with species normalisation is dealing with the abundance of different strains, particularly among microorganisms. It is important to disambiguate the strains if possible, as genes' functional properties can vary between strains.
Good results for normalising human gene names have been reported. The BCII GN task  evaluated performance against a manually annotated gold standard corpus. Overall results were promising, with a combined recall of 97.2 per cent (entries from over 20 teams). This evaluation assumed that the species was human, however. Normalisation for other species continues to be a challenge and has not been helped by the decision made at the 22nd International Society for Animal Genetics (in August 1990) that animal gene names should 'follow the rules for human gene nomenclature, including the use of identical symbols for homologous genes and the reservation of human symbols for as yet unidentified animal genes' . This interspecies ambiguity of the genomic nomenclature means that identifying the correct species for a given mention is an important subtask of gene normalisation, although it has only recently begun to be considered .
APPL binds Akt2
Binding of Akt2 by APPL
Binding between Akt2 and APPL
Relationships between two entities can be described over multiple sentences, which can lead to complications, as anaphors need to be identified and resolved (eg APPL is later referred to as this protein in a piece of text). This limits the recall of relation extraction approaches that work at the sentence level only. The relationship type that has attracted the most effort is extracting PPIs.
A number of different approaches have been proposed in order to perform this task based on linguistic, rule-based and ML methods. Rule-based methods use a set of syntactic patterns, which specify how an interaction is described. The patterns can be manually or automatically generated. RelEX  applies a simple set of rules on a representation of the dependencies between words in a sentence called a dependency graph. The RLIMS-P  is a rule-based approach specifically designed to extract information about protein phosphorylation sites, and performs well compared with manually curated literature sets. Some ML methods treat a sentence as a sequence of words or tokens and completely ignore its syntactic structure. These approaches do not achieve good performance compared with methods which take sentence structure into account. It is clearly important to consider both contextual and linguistic features,[64, 65] such as interaction keywords and verbs, to extract relationships with good precision.
To complicate matters further, authors frequently speculate about potential relationships (eg APPL may interact with Akt2). These statements do not correspond to the definition of a relationship, but that the relationship is proposed to exist. It is important to identify these speculative statements  and prevent them from biasing any downstream analyses. For the same reason, it is equally important to detect the negation of relationships  (eg APPL does not interact with Akt2).
Summary of hypotheses generated using Swanson's ABC model and its extensions
Cory et al.
Proposed links between Frost (a 20th century poet) and Carneades (an ancient philosopher)
Gordon et al.
Finding new applications for genetic algorithms using the WWW
Hettne et al.
Proposed the role of NF-κb in the aetiology of complex regional pain syndrome
Hristovski et al.
Proposed novel candidate genes that may be involved in bilateral perisylvian polymicrogyria
Kostoff et al.
Proposed novel non-drug treatments (such as calorific restriction) for the treatment of multiple sclerosis
Kostoff et al.
Proposed 'lifestyle/dietary practices that could be interpreted as anti-cataract'
Novel uses for curcuma longa/turmeric in the treatment of retinal diseases, Crohn's disease and spinal cord-related disorders
Swanson et al.
Classifying viruses as potential biological weapons
van Haagen et al.
Predicting and identifying novel interaction partners for proteins in Escherichia coli
Weeber et al.
Novel uses of thalidomide in the treatment of myasthenia gravis, chronic hepatitis C, Heliobacter pylori-induced gastritis and acute pancreatitis
Wren et al.
Chlorpromazine may reduce cardiac hypertrophy (ABC model in conjunction with experimental evidence)
Wren et al.
Pathogenesis of non-insulin-dependent diabetes is most likely epigenetic
Zhou et al.
Combined MEDLINE with traditional Chinese medicine to propose new functional knowledge about genes
Mendeleev's discovery of the law of periodicity and the development of the periodic table can be considered an early example of literature-based discovery (LBD), as it was: 'a direct outcome of the stock of generalisations and established facts which has accumulated by the end of the decade 1860-1870.' The information required to build the table of elements had already been published, but it had never been analysed as a whole . More recently, Hettne et al. combined TM with network analysis in order to generate new mechanistic hypotheses relating to the complex regional pain syndrome (CRPS). NF-κB was identified as potentially being involved by first extracting genes relating to CRPS from the literature and then investigating potential links between these genes which were not mentioned in the CRPS literature. This hypothesis has led to several new ideas regarding the aetiology of the disease and the proposal of a novel drug target. By exploiting the context of protein mentions, van Haagen et al. were able to predict a novel interaction between CAPN3 and PARVB. Integrating information extracted from the literature with microarray experiments has led to the proposition of a relationship between SIP and the invasiveness of glioblastoma cell lines . All of this work shows the potential for TM to generate testable hypotheses for use in biology.
Hypothesis generation is challenging even to humans, however. Automating this process, or formulating it in such a way that a computer can quickly generate testable scientific propositions, is a non-trivial and daunting task. Only if the universe of potential hypotheses is sufficiently simple for search or enumeration approaches to cover all potential cases is this currently feasible. We feel that the most promising strategies in the short term include the search for suitable heuristics or iterative procedures involving infrequent human input.
TM tools offer a way to retrieve the pertinent information contained within the mass of scientific literature, make it easier to explore  and allow the generation of novel insights into existing data, all in an automated fashion. While TM is currently noisy and imperfect, it should be remembered that, due to inter-annotator disagreement, manual curation is too. TM is not just restricted to extracting functional information; it has also been used to identify best practices within the phylogenetics domain, to generate priors for network reconstruction using Bayesian networks  and to aid in protein structure comparison and assignment of function . Recently, TM has shown the greatest potential when used in data fusion style approaches. By using information extracted from the literature, Raychaudhuri et al. were able to develop a method better to distinguish between genomic regions associated with disease and false-positive regions. Ten out of 13 single nucleotide polymorphisms (SNPs) identified by their method as been associated with Crohn's disease were later validated by follow-up genotyping. STRING  integrates many different types of evidence about PPIs, including literature co-occurrence, phylogenetic data and results from high-throughput experiments, and has been used to predict novel PPIs in other organisms by transferring annotations to orthologous protein pairs. While there is a significant body of work on applying TM to the biological domain, however, there still remain many challenges in areas like relation extraction, species disambiguation and hypothesis generation.
Systems biology and genomics deal with large data models of unprecedented complexity; TM allows us to draw on the published literature in a disciplined manner to inform the development of quantitative models. We expect TM to become an important addition to the systems biologist's toolkit, complementing existing techniques like comparative and primary data analysis. We hope to have demonstrated the use and limitations of TM in its current guise. Being aware of the limitations, however, should enable the community to develop and adopt protocols that allow for easier, more reliable analysis of published research outputs from these tools. This is important not only for researchers, but also for publishers, funding bodies and regulators. These three players have, of course, different but, crucially, not competing interests as far as accessibility of information is concerned. Regulators, in particular, irrespective of whether or not they are engaged in accrediting new drugs or nutritional supplements or the granting of patents, stand to benefit profoundly from information that is provided in an electronically accessible and unambiguous fashion.
- Ananiadou S, Kell D, Tsujii J: Text mining and its potential applications in systems biology. Trends Biotechnol. 2006, 24: 571-579. 10.1016/j.tibtech.2006.10.002.View ArticlePubMedGoogle Scholar
- Baumgartner WA, Cohen KB, Fox LM, Acquaah-Mensah G, et al: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007, 23: i41-i48. 10.1093/bioinformatics/btm229.PubMed CentralView ArticlePubMedGoogle Scholar
- Winnenburg R, Wächter T, Plake C, Doms A, et al: Facts from text: Can text mining help to scale-up high-quality manual curation of gene products with ontologies?. Brief Bioinform. 2008, 9: 466-478. 10.1093/bib/bbn043.View ArticlePubMedGoogle Scholar
- Ng S, Wong M: Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 1999, 10: 104-112.Google Scholar
- Agarwal P, Searls DB: Literature mining in support of drug discovery. Brief Bioinform. 2008, 9: 479-492. 10.1093/bib/bbn035.View ArticlePubMedGoogle Scholar
- Rzhetsky A, Seringhaus M, Gerstein M: Seeking a new biology through text mining. Cell. 2008, 134: 9-13. 10.1016/j.cell.2008.06.029.PubMed CentralView ArticlePubMedGoogle Scholar
- Hearst M: Untangling text data mining. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 1999, 3-10.View ArticleGoogle Scholar
- Deshpande N, Fink J, Bourne P, Cohen K: Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing. 2008, 640-651.Google Scholar
- Blaschke C: Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp Funct Genomics. 2001, 2: 196-206. 10.1002/cfg.91.PubMed CentralView ArticlePubMedGoogle Scholar
- Knight J: Negative results: Null and void. Nature. 2003, 422: 554-555. 10.1038/422554a.View ArticlePubMedGoogle Scholar
- Pfeiffer T, Hoffmann R: Temporal patterns of genes in scientific publications. Proc Natl Acad Sci USA. 2003, 104: 12052-12056.View ArticleGoogle Scholar
- Lehne B, Schlitt T: Protein-protein interaction databases: Keeping up with growing interactomes. Hum Genomics. 2009, 3: 291-297.PubMed CentralPubMedGoogle Scholar
- Dickerson J, Pinney J, Robertson D: The biological context of HIV-1 host interactions reveals subtle insights into a system hijack. BMC Syst Biol. 2010, 4: 80-10.1186/1752-0509-4-80.PubMed CentralView ArticlePubMedGoogle Scholar
- Jenssen T, Lægreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28: 21-28.PubMedGoogle Scholar
- Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics. 2005, 21: 248-256. 10.1093/bioinformatics/bth496.View ArticlePubMedGoogle Scholar
- Mons B: Which gene did you mean?. BMC Bioinform. 2005, 6: 142-10.1186/1471-2105-6-142.View ArticleGoogle Scholar
- Hatzivassiloglou V, Duboue PA, Rzhetsky A: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001, 17: 97-106. 10.1093/bioinformatics/17.suppl_1.S97.View ArticleGoogle Scholar
- Barnes J: Conceptual biology: A semantic issue and more. Nature. 2002, 417: 587-588.View ArticlePubMedGoogle Scholar
- Kim J, Ohta T, Tsuruoka Y, Tateisi YN, et al: Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 70-75.Google Scholar
- Smith L, Tanabe LK, Johnson R, Kuo CJ, et al: Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9: S2-PubMed CentralView ArticlePubMedGoogle Scholar
- Liu H, Hu ZZ, Torii M, Wu C, et al: Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006, 13: 497-507. 10.1197/jamia.M2085.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S: Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics. 2007, 23: 2768-2774. 10.1093/bioinformatics/btm393.View ArticlePubMedGoogle Scholar
- Schuemie M, Mons B, Weeber M, Kors J: Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform. 2007, 40: 316-324. 10.1016/j.jbi.2006.09.002.View ArticlePubMedGoogle Scholar
- Tsuruoka Y: Probabilistic term variant generator for biomedical terms. Proceedings of the 26th Annual International ACM SIGR Conference on Research and Development in Information Retrieval. 2003, 167-173.Google Scholar
- Fundel K, Güttler D, Zimmer R, Apostolakis J: A simple approach for protein name identification: Prospects and limits. BMC Bioinform. 2005, 6 (Suppl 1): S15-10.1186/1471-2105-6-S1-S15.View ArticleGoogle Scholar
- Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P: Protein structures and information extraction from biological texts: The PASTA system. Bioinformatics. 2003, 19: 135-143. 10.1093/bioinformatics/19.1.135.View ArticlePubMedGoogle Scholar
- Hakenberg J, Bickel S, Plake C, Brefeld U, et al: Systematic feature evaluation for gene name recognition. BMC Bioinform. 2005, 6: S9-View ArticleGoogle Scholar
- Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text. Bioinformatics. 2002, 18: 1124-1132. 10.1093/bioinformatics/18.8.1124.View ArticlePubMedGoogle Scholar
- Settles B: ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005, 21: 3191-3192. 10.1093/bioinformatics/bti475.View ArticlePubMedGoogle Scholar
- Sætre R, Sagae K, Tsujii J: Syntactic features for protein-protein interaction extraction. Proceedings of the 2nd International Symposium on Languages in Biology and Medicine. 2007, 6.1-6.14.Google Scholar
- Leaman R, Gonzalez G: BANNER: An executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing. 2008, 652-663.Google Scholar
- Tsuruoka Y, Tateishi Y, Kim J, Ohta T, et al: Developing a robust part-of-speech tagger for biomedical text. Proceedings of Panhellenic Conference on Informatics. 2005, 3746: 382-392.Google Scholar
- Airola A, Pyysalo S, Björne J, Pahikkala T, et al: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinform. 2008, 9: S2-View ArticleGoogle Scholar
- Gerner M, Nenadic G, Bergman CM: LINNAEUS: A species name identification system for biomedical literature. BMC Bioinform. 2010, 11: 85-10.1186/1471-2105-11-85.View ArticleGoogle Scholar
- Hahn U, Buyko E, Landefeld R: An overview of JCoRe, the JULIE lab UIMA component repository. Proceedings of the LREC'08 Workshop Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP. 2008, 1-7.Google Scholar
- Smith L, Rindflesch T, Wilbur W: MedPost: A part-of-speech tagger for bioMedical text. Bioinformatics. 2004, 20: 2320-2321. 10.1093/bioinformatics/bth227.View ArticlePubMedGoogle Scholar
- Mika S, Rost B: NLProt: Extracting protein names and sequences from papers. Nucleic Acids Res. 2004, 32: W634-W637. 10.1093/nar/gkh427.PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter L, Lu Z, Firby J, Baumgartner WA, et al: OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinform. 2008, 9: 78-10.1186/1471-2105-9-78.View ArticleGoogle Scholar
- Corbett P, Murray-Rust P: High-throughput identification of chemistry in life science texts. Proceedings of the 2nd International Symposium on Computational Life Science. 2006, 107-118.Google Scholar
- Song Y, Kim E, Lee GG, Yi BK: POSBIOTM-NER: A trainable biomedical named-entity recognition system. Bioinformatics. 2005, 21: 2794-2796. 10.1093/bioinformatics/bti414.View ArticlePubMedGoogle Scholar
- Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, et al: Text processing through Web services: calling Whatizit. Bioinformatics. 2008, 24: 296-298. 10.1093/bioinformatics/btm557.View ArticlePubMedGoogle Scholar
- Liu H, Aronson AR, Friedman C: A study of abbreviations in MEDLINE abstracts. Proceedings/AMIA Annual Symposium AMIA Symposium. 2002, 464-468.Google Scholar
- Okazaki N, Ananiadou S: Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006, 22: 3089-3095. 10.1093/bioinformatics/btl534.View ArticlePubMedGoogle Scholar
- Tsuruoka Y, Ananiadou S: A machine learning approach to acronym generation. Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, 25-31.View ArticleGoogle Scholar
- Bracewell D, Russell S, Wu A: Identification, expansion, and disambiguation of acronyms in biomedical texts. Lect Notes Comput Sci. 2005, 3759: 186-195. 10.1007/11576259_21.View ArticleGoogle Scholar
- Koning D, Sarkar I, Moritz T: TaxonGrab: Extracting taxonomic names from text. Biodiversity Inform. 2005, 2: 79-82.View ArticleGoogle Scholar
- Sarntivijai S, Ade AS, Athey BD, States DJ: A bioinformatics analysis of the cell line nomenclature. Bioinformatics. 2008, 24: 2760-2766. 10.1093/bioinformatics/btn502.PubMed CentralView ArticlePubMedGoogle Scholar
- Pyysalo S, Ginter F, Heimonen J, Björne J, et al: BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinform. 2007, 8: 50-10.1186/1471-2105-8-50.View ArticleGoogle Scholar
- Wang X, Tsujii J, Ananiadou S: Disambiguating the species of biomedical named entities using natural language parsers. Bioinformatics. 2010, 26: 661-667. 10.1093/bioinformatics/btq002.PubMed CentralView ArticlePubMedGoogle Scholar
- Alex B, Grover C, Haddow B, Kabadjov M, et al: Assisted curation: Does text mining really help?. Pac Symp Biocomput. 2008, 556-567.Google Scholar
- Craven M, Kumlien J: Constructing biological knowledge bases by extracting information from text sources. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. 1999, 77-86.Google Scholar
- Santos C, Eggle D, States D: Wnt pathway curation using automated natural language processing: Combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics. 2005, 21: 1653-1658. 10.1093/bioinformatics/bti165.View ArticlePubMedGoogle Scholar
- Waagmeester A, Pezik P, Coort S, Tourniaire F, et al: Pathway enrichment based on text mining and its validation on carotenoid and vitamin A metabolism. OMICS. 2009, 13: 367-379. 10.1089/omi.2009.0029.View ArticlePubMedGoogle Scholar
- Lau WW, Johnson CA, Becker KG: Rule-based human gene normalization in biomedical text with confidence estimation. Comput Syst Bioinformatics Conf. 2007, 6: 371-379.View ArticlePubMedGoogle Scholar
- Wang X, Matthews M: Comparing usability of matching techniques for normalising biomedical named entities. Pac Symp Biocomput. 2008, 13: 628-639.Google Scholar
- Grover C, Haddow B, Klein E, Matthews M: Adapting a relation extraction pipeline for the BioCreAtIvE II task. Proceedings of the BioCreAtIvE II Workshop. 2007Google Scholar
- Wang X: Rule-based protein term identification with help from automatic species tagging. Proceedings of CICLING. 2007, 288-298.Google Scholar
- Crim J, McDonald R, Pereira F: Automatically annotating documents with normalized gene lists. BMC Bioinform. 2005, 6: S13-View ArticleGoogle Scholar
- Farkas R: The strength of co-authorship in gene name disambiguation. BMC Bioinform. 2008, 9: 69-10.1186/1471-2105-9-69.View ArticleGoogle Scholar
- Morgan AA, Lu Z, Wang X, Cohen AM, et al: Overview of BioCreative II gene normalization. Genome Biol. 2008, 9: S3-PubMed CentralView ArticlePubMedGoogle Scholar
- Kappeler T, Kaljurand K, Rinaldi F: TX task: Automatic detection of focus organisms in biomedical publications. Proceedings of the Workshop on BioNLP. 2009, 80-88.View ArticleGoogle Scholar
- Fundel K, Kuffner R, Zimmer R: RelEx-Relation extraction using dependency parse trees. Bioinformatics. 2007, 23: 365-371. 10.1093/bioinformatics/btl616.View ArticlePubMedGoogle Scholar
- Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, et al: Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics. 2005, 21: 2759-2765. 10.1093/bioinformatics/bti390.View ArticlePubMedGoogle Scholar
- Niu Y, Otasek D, Jurisica I: Evaluation of linguistic features useful in extraction of interactions from PubMed: Application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics. 2009, 26: 111-119.PubMed CentralView ArticlePubMedGoogle Scholar
- Fayruzov T, Cock MD, Cornelis C, Hoste V: Linguistic feature analysis for protein interaction extraction. BMC Bioinform. 2009, 10: 374-10.1186/1471-2105-10-374.View ArticleGoogle Scholar
- Hatzivassiloglou V, Weng W: Learning anchor verbs for biological interaction patterns from published text articles. Int J Med Inform. 2002, 67: 19-32. 10.1016/S1386-5056(02)00054-0.View ArticlePubMedGoogle Scholar
- Kilicoglu H, Bergler S: Recognizing speculative language in biomedical research articles: A linguistically motivated perspective. BMC Bioinform. 2008, 9: S10-View ArticleGoogle Scholar
- Sanchez-Graillet O, Poesio M: Negation of protein-protein interactions: Analysis and extraction. Bioinformatics. 2007, 23: i424-i432. 10.1093/bioinformatics/btm184.View ArticlePubMedGoogle Scholar
- Davies R: The creation of new knowledge by information retrieval and classification. J Doc. 1989, 4: 273-301.View ArticleGoogle Scholar
- Swanson D: Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986, 30: 7-18.View ArticlePubMedGoogle Scholar
- DiGiacomo R, Kremer J, Shah D: Fish-oil dietary supplementation in patients with Raynaud's phenomenon: A double-blind, controlled, prospective study. Am J Med. 1989, 86: 158-164. 10.1016/0002-9343(89)90261-1.View ArticlePubMedGoogle Scholar
- Murray-Rust P: Data Driven Science-A Scientist's View. NSF/JISC 2007 Digital Repositories Workshop. 2007, [http://www.sis.pitt.edu/~repwkshop/papers/murray.html]Google Scholar
- Hettne K, de Mos M, de Bruijn A, Weeber M: Applied information retrieval and multidisciplinary research: New mechanistic hypotheses in complex regional pain syndrome. J Biomed Discov Collab. 2007, 2: 2-10.1186/1747-5333-2-2.PubMed CentralView ArticlePubMedGoogle Scholar
- van Haagen H, 't Hoen P, Bovo AB, de Morrée A, et al: Novel protein-protein interactions inferred from literature context. PLoS ONE. 2009, 4: e7894-10.1371/journal.pone.0007894.PubMed CentralView ArticlePubMedGoogle Scholar
- Natarajan J, Berrar D, Dubitzky W, Hack C, et al: Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinform. 2006, 7: 373-10.1186/1471-2105-7-373.View ArticleGoogle Scholar
- Cory K: Discovering hidden analogies in an online humanities database. Comput Hum. 1997, 31: 1-12. 10.1023/A:1000422220677.View ArticleGoogle Scholar
- Gordon M, Lindsay R, Fan W: Literature-based discovery on the World Wide Web. ACM Trans Inter Tech. 2002, 2: 261-275. 10.1145/604596.604597.View ArticleGoogle Scholar
- Hristovski D, Peterlin B, Mitchell J, Humphrey S: Using literature-based discovery to identify disease candidate genes. Int J Med Inform. 2005, 74: 289-298. 10.1016/j.ijmedinf.2004.04.024.View ArticlePubMedGoogle Scholar
- Kostoff R, Briggs M, Lyons T: Literature-related discovery (LRD): Potential treatments for multiple sclerosis. Technol Forecast Soc Change. 2007, 75: 239-255.View ArticleGoogle Scholar
- Kostoff R: Literature-related discovery (LRD): Potential treatments for cataracts. Technol Forecast Soc Change. 2007, 75: 215-225.View ArticleGoogle Scholar
- Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004, 20: i290-i296. 10.1093/bioinformatics/bth914.View ArticlePubMedGoogle Scholar
- Srinivasan P, Libbus B, Sehgal A: Mining medline: Postulating a beneficial role for curcumin longa in retinal diseases. HLT Biolink. 2004, 33-40.Google Scholar
- Swanson D, Smalheiser N, Bookstein A: Information discovery from complementary literatures: Categorizing viruses as potential weapons. J Am Soc Inf Sci Technol. 2001, 52: 797-812. 10.1002/asi.1135.View ArticleGoogle Scholar
- Weeber M, Vos R, Klein H, de Jong-van den Berg LT, et al: Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. J Am Med Inform Assoc. 2003, 10: 252-259. 10.1197/jamia.M1158.PubMed CentralView ArticlePubMedGoogle Scholar
- Wren JD, Bekeredjian R, Stewart JA, Shohet RV, et al: Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics. 2004, 20: 389-398. 10.1093/bioinformatics/btg421.View ArticlePubMedGoogle Scholar
- Wren J: Data-mining analysis suggests an epigenetic pathogenesis for type 2 diabetes. J Biomed Biotechnol. 2005, 2: 104-112.View ArticleGoogle Scholar
- Zhou X, Liu B, Wu Z, Feng Y: Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artif Intell Med. 2007, 41: 87-104. 10.1016/j.artmed.2007.07.007.View ArticlePubMedGoogle Scholar
- Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics. 2005, 21: ii252-ii258. 10.1093/bioinformatics/bti1142.PubMedGoogle Scholar
- Eales J, Pinney J, Stevens R, Robertson D: Methodology capture: Discriminating between the "best" and the rest of community practice. BMC Bioinform. 2008, 9: 359-10.1186/1471-2105-9-359.View ArticleGoogle Scholar
- Steele E, Tucker A, 't Hoen P, Schuemie M: Literature-based priors for gene regulatory networks. Bioinformatics. 2009, 25: 1768-1774. 10.1093/bioinformatics/btp277.View ArticlePubMedGoogle Scholar
- MacCallum R, Kelley L, Sternberg M: SAWTED: Structure assignment with text description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics. 2000, 16: 125-129. 10.1093/bioinformatics/16.2.125.View ArticlePubMedGoogle Scholar
- Raychaudhuri S, Plenge RM, Rossin EJ, Ng ACY, et al: Identifying relationships among disease regions: Predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 2009, 5: e1000534-10.1371/journal.pgen.1000534.PubMed CentralView ArticlePubMedGoogle Scholar
- von Mering C, Jensen L, Snel B, Hooper S: STRING: Known and prediction protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005, 33: D433-D437.PubMed CentralView ArticlePubMedGoogle Scholar