Prospects for the automated extraction of mutation data from the scientific literature

Recently, Kuipers et al. reported an exciting development: the first disease-focused, locus-specific mutation database (LSDB), developed through the automated extraction of mutation data from full-text articles taken from the scientific literature. Although this report clearly constituted a landmark in its field, neither the idea nor the methodology is entirely new, since Horn et al. successfully extracted point mutation data reported (mainly in an evolutionary context) in G protein-coupled receptors and nuclear hormone receptors from full-text literature some years ago. Since then a range of different approaches has been attempted, with varying degrees of success. Kuipers et al. employed their own search tool, Mutator, however, specifically to extract disease-related missense and nonsense mutation data from the scientific literature, with a view to constructing an LSDB for Fabry disease (FMDB) containing mutations in the a-galactosidase (GLA) gene. Briefly, these authors employed a diseaseoriented PubMed keyword search to identify relevant articles from the literature to be downloaded. The full text of relevant publications was then automatically screened for mutation data, and those mutations found to occur in amino acid residues that appeared to match the GLA protein sequence were selected for inclusion in the database. Although the results appear at first glance to be impressive, they warrant closer inspection. Using Mutator, Kuipers et al. identified 367 ‘unique GLA gene mutations’ (listed in their Supplementary Table), 108 of which they claimed to be absent from the Human Gene Mutation Database (HGMD). Despite the fact that the authors failed to evaluate Mutator with respect to ‘standard’ performance measures (eg precision and recall), their tool appears to be a significant improvement over previously published methods, particularly those that simply screened PubMed abstracts rather than full text.. In their comparison with HGMD data, however, Kuipers et al. appear to have used the somewhat outdated (but nevertheless freely available) online version of the database (http://www.hgmd.org) rather than the up-to-date subscription version, HGMD Professional (http://www.biobase-international. com/pages/index.php?id=hgmddatabase). We examined the 108 GLA mutations claimed to be absent from HGMD, and found that 48 were actually present in HGMD Professional. Of the remainder, seven were listed in HGMD under an alternative mutation type (eg small indels), four still remained unresolved (in that the precise nature of the nucleotide change was unclear or ambiguous in the original report), while 24 were non-Fabry disease false positives (see below). The remaining 25 mutations ( 7 per cent of the total number of bona fide GLA mutations), reported in 12 different papers, appear to have been inadvertently omitted by HGMD. This is most likely because (i) they were mentioned only briefly within the text of the article concerned and (ii) no hint of the articles’ EDITORIAL

Recently, Kuipers et al. 1 reported an exciting development: the first disease-focused, locus-specific mutation database (LSDB), developed through the automated extraction of mutation data from full-text articles taken from the scientific literature. Although this report clearly constituted a landmark in its field, neither the idea nor the methodology is entirely new, since Horn et al. 2 successfully extracted point mutation data reported (mainly in an evolutionary context) in G protein-coupled receptors and nuclear hormone receptors from full-text literature some years ago. Since then a range of different approaches has been attempted, with varying degrees of success. 3 -6 Kuipers et al. 1 employed their own search tool, Mutator, however, specifically to extract disease-related missense and nonsense mutation data from the scientific literature, with a view to constructing an LSDB for Fabry disease (FMDB) containing mutations in the a-galactosidase (GLA) gene. Briefly, these authors employed a diseaseoriented PubMed keyword search to identify relevant articles from the literature to be downloaded. The full text of relevant publications was then automatically screened for mutation data, and those mutations found to occur in amino acid residues that appeared to match the GLA protein sequence were selected for inclusion in the database. Although the results appear at first glance to be impressive, they warrant closer inspection.
Using Mutator, Kuipers et al. 1 identified 367 'unique GLA gene mutations' (listed in their Supplementary Table), 108 of which they claimed to be absent from the Human Gene Mutation Database (HGMD 7 ). Despite the fact that the authors failed to evaluate Mutator with respect to 'standard' performance measures (eg precision and recall), their tool appears to be a significant improvement over previously published methods, particularly those that simply screened PubMed abstracts rather than full text. 6,8 -12 . In their comparison with HGMD data, however, Kuipers et al. 1 appear to have used the somewhat outdated (but nevertheless freely available) online version of the database (http://www.hgmd.org) rather than the up-to-date subscription version, HGMD Professional (http://www.biobase-international. com/pages/index.php?id=hgmddatabase).
We examined the 108 GLA mutations claimed to be absent from HGMD, and found that 48 were actually present in HGMD Professional. Of the remainder, seven were listed in HGMD under an alternative mutation type (eg small indels), four still remained unresolved (in that the precise nature of the nucleotide change was unclear or ambiguous in the original report), while 24 were non-Fabry disease false positives (see below). The remaining 25 mutations ( 7 per cent of the total number of bona fide GLA mutations), reported in 12 different papers, appear to have been inadvertently omitted by HGMD. This is most likely because (i) they were mentioned only briefly within the text of the article concerned and (ii) no hint of the articles' mutation content could have been gleaned from inspection of article titles, abstracts or keywords. False negatives appear to be less of a problem than false positives, with Mutator failing to recognise only nine of the GLA mutations logged by HGMD Professional.
The success of Mutator in identifying these hitherto latent lesions certainly testifies to the future potential utility of automatic tools designed to search for and extract mutation data directly from full-text articles. Indeed, we are currently exploring ways of incorporating automated full-text searching into HGMD's mutation identification strategy. In terms of the currently available version of the Mutator program, however, there would appear to be a significant problem with false positives. As mentioned above, when we carefully examined the 108 identified GLA mutations that were claimed to be absent from HGMD, 24 entries (22 per cent) proved to be non-GLA/non-Fabry disease false positives. Hence, even the insertion of a step into the search program which was designed to check that all identified mutations matched the GLA protein sequence had not prevented Mutator from identifying spurious mutation data (eg mutations in other genes located within article titles cited in the reference list that coincidentally matched the GLA protein sequence etc.). Taken together, it would appear that considerable work remains to be done in order to optimise the performance of the program, even for the missense/ nonsense mutation category that constitutes the specific target mutation type for Mutator. As it stands, a significant amount of manual validation and curation would still need to be performed in order for the GLA mutation dataset to be reliable in terms of its content.
Assuming that these initial difficulties can be overcome, what is the potential of this approach in terms of scaling up the extraction of mutation data for the .3,800 human genes 13 currently known to harbour mutations causing and/or associated with human inherited disease? On the basis of the report of Kuipers et al., 1 we would say that the long-term prospects are likely to be very good, assuming that the search criteria can be suitably refined, using Boolean parameters, so as to exclude the false positives. As the authors discovered, however, each gene/protein is likely to have its own particular false positives to contend with. Thus, the search for GLA mutations also pulled out mutations in the Gla (g-carboxyglutamic acid) domains of various coagulation factors. It follows that significant effort must be devoted to avoiding such false positives on a gene-wise basis. In this context, it is pertinent to point out that there are likely to be a number of categories of mutation that will be very difficult to identify correctly (or alternatively to weed out) in an entirely automated fashion. These are likely to include: † Somatic mutations, which, unlike germline mutations, are not heritable and hence are not causative of inherited disease. Some types of gene (eg tumour suppressors) are characterised by both germline and somatic mutations and both types of lesion may often be reported in the same paper. Careful reading of the manuscript is required in order to identify the germline mutations unequivocally. † Mutations introduced by in vitro mutagenesis and/or molecular modelling and which have not actually been reported in nature ('experimentally generated mutations') will be difficult to distinguish automatically from genuine disease-associated lesions. † Mutations that have occurred in an evolutionary context (ie in orthologous proteins in non-human organisms over evolutionary time) will be difficult to distinguish in an automated fashion from human disease-associated lesions. † Mutations at residues that are abundant within specific proteins (eg Gly residues in the collagens) or which occur at identically numbered amino acid residues in many proteins (eg the initiator methionine codon) are likely to represent significant sources of error (especially of the false-positive kind). † Mutations other than missense and nonsense mutations will require somewhat different search procedures. Thus, identifying other types of micro-lesion (ie micro-deletions, micro-insertions, indels, splicing-relevant and regulatory mutations) and different types of gross gene rearrangement (which together constitute .40 per cent of reported mutations causing human inherited disease; see HGMD 7 ) will not only have to take account of the DNA sequence of the gene in question, but will also require the adoption of new text-mining techniques. † Polymorphic missense variants that are neutral with respect to function/clinical phenotype and synonymous (silent) mutations that are of direct pathological significance (eg via an influence on splicing) represent two categories of mutation that will be difficult to respectively exclude from, and include in, a given dataset solely by automated methods. † Mutations in genes whose proteins are (or have been) subject to different amino acid numbering systems will be difficult to identify unequivocally. Nowadays, protein numbering has been largely standardised, so that the initiator methionine is invariably attributed þ1. In the literature, however, numbering systems for one and the same protein frequently differ, depending upon, for example, whether the initiator methionine is given as þ1 or -1 or whether the amino acid numbering starts before or after the pre-pro-peptide. Since newly discovered exons can also alter amino acid numbering, numbering schemes tend to change over time and frequently display inconsistencies between different literature reports. Such difficulties are not likely to be insuperable, however, particularly if an up-to-date amino acid reference sequence is used as a standard and the DNA sequence context of the mutation can be captured and used in the validation process. † Mutations reported only at the amino acid sequence level, and which cannot be unequivocally assigned a single valid nucleotide sequence level alteration, will require further manual curation. † Mutations that appear at first sight to be genuine but upon closer inspection prove to have been mis-typed by the original authors of the article (a very common example is provided by Glu/Gln transpositions), will in all likelihood represent a continual problem for purely automated search procedures.
For all the above-mentioned reasons, it is, at present, hard to see how the automated collation of mutation data can be wholly accurate and reliable without the subsequent deployment of labourintensive manual validation and curation steps. Even as it stands, however, it is evident that the automated data extraction tool described by Kuipers et al. 1 is likely to represent a very valuable adjunct to the semi-automated data-collation procedures currently employed by both the LSDBs and HGMD. The authors are therefore to be congratulated for the remarkable degree of success that their Mutator tool has already achieved at the pilot stage of its development. While the prospect of nextgeneration tools for the extraction and validation of mutation data is awaited with keen interest, for the time being we concur with the view expressed by Winnenburg et al. 14 and Caporaso et al. 15 that the most cost-effective and reliable approach to mutation data collection and annotation is currently for automated text-mining methods to be integrated into the manual annotation process, and for manual and automated approaches to be used in concert to mine the biomedical literature for pathological gene lesions.