biomint deliverable report d6.2 evaluation of the first ...€¦ · deliverables report...

Deliverables Report

QLRI-2002-02770 BioMinT

July 2004

BIOMINT DELIVERABLE REPORT D6.2 Evaluation of the first prototype and

Recommendations for prototype revision

AUTHORS: Compiled on behalf of the consortium by

Anne-Lise Veuthey, Violaine Pillet, Marc Zehnder (Swiss Institute of Bioinformatics)

Alex Mitchell, Paul Bradley (University of Manchester) STATUS: Final CHECKERS:

- 1 – Deliverable#/report#/version#

Deliverables Report


July 2004

PROJECT MANAGER Name: Professor Terri Attwood Address: Bioinformatics Group, School of Biological Sciences, Stopford Building, University of Manchester, Oxford Road,, Manchester, M13 9PT Phone Number: +44 161 275 5766 Fax Number:+44 161 275 5082 E-mail:[email protected] TABLE OF CONTENTS

1. EXECUTIVE OVERVIEW..................................................................................................... 3

1.1 Goals....................................................................................................................................................3

1.2 Achievements......................................................................................................................................3

2. INTRODUCTION.................................................................................................................. 5

3. DISCUSSION....................................................................................................................... 7

3.1 Gene/ProteinSynonym database evaluation....................................................................................7

3.2 Filtering/ranking/document organisation module evaluation .....................................................14

3.3 Fingerprint classification module evaluation................................................................................16

3.4 Corpora elaboration ........................................................................................................................18 3.4.1 Creation of corpora to assist Swiss-Prot annotation .............................................................18 3.4.2 Creation of corpora to assist PRINTS annotation .................................................................23

3.5 Evaluation of the information extraction (IE) module.................................................................25

3.6 Evaluation of the memory-based shallow parser (MBSP) ...........................................................26

4 CONCLUSIONS.................................................................................................................. 29

5 REFERENCES.................................................................................................................... 31


Deliverables Report


July 2004

1. EXECUTIVE OVERVIEW

1.1 Goals

The main objectives provided by this, the second deliverable of WP6, consist of assessing the quality of the preliminary prototype that has been delivered in D6.1. The main steps of the process are as follows:

• testing the conformity of the prototype according to the user requirements concerning the information retrieval and extraction services

• detecting and correcting bugs • measuring the performance in terms of precision recall and derived evaluation measures • estimating the efficiency and usability

This process requires the involvement of both users and developers in a mutual communcation process between domain experts and technological providers to discuss and clarify desired performance, user interface, experimental setup and computational feasibility over a large range of application settings. We seek a consensus between conflicting view wherever possible.

1.2 Achievements

We evaluated the preliminary prototype that has been described in D6.1. This prototype consists of a web interface that includes a synonym expansion and a meta-query interface, a filtering and ranking module and a document organisation module. We provided feedback in order to clean the gene/protein synonym database that is used to perform query expansion. We also provided advice to the developers so that they might organise the list of gene/protein names according to biological significance. We tested the performance of the preliminary ranking algorithms implemented for the purpose of result filtration, as well as those for the document organisation. Additional components of the prototype which are not yet fully integrated, as described in D6.1, have also been evaluated. For the fingerprint classification module we performed an extensive manual classification of all fingerprints in the PRINTS database according to their type. We used this data to train the module and perform a series of blind tests to evaluate performance. The results, which are available on the BioMinT reviewer website, showed a 14.1% error rate representing a 26% accuracy gain over the hand-crafted classification heuristics used previously. A paper describing this work has been accepted for publication at PKDD-2004 (Hilario et al., 2004). We have prepared extensive collections of sentences containing information on the various topics needed for database annotation. These were manually selected from the retrieved documents resulting from real queries on specific genes or proteins. This data was then used to


Deliverables Report


July 2004

train the sentence classification algorithms within the information extraction module. Following training, we performed manual evaluation of sentence classification on curated test data. The preliminary results, which are available on the BioMinT reviewer website, show precision and recall rates of 32 and 66% respectively for sentences which relate to function, and 67 and 40% for those which relate to structure. We have also provided comprehensive feedback to help improve performance. Other document sets were also created based on data information provided in the RP line from Swiss-Prot and will be used to train the classification and IE modules. We have performed a brief evaluation of the Memory-Based Shallow Parser (MBSP), manually examining its performance on biological text. Feedback between partners during this task lead to evolution of corpus generation and tagging methods, ensuring the work of the biological domain experts can be maximally exploited for data mining.


Deliverables Report


July 2004

2. INTRODUCTION

The objectives of the sixth workpackage for this running period of the BioMinT project consists of (1) the integration of the modules designed in WP2 and implemented in WP3, WP4, WP5 in a preliminary prototype, and (2) the evaluation of this prototype regarding conformity with user requirements and performance in terms of recall and precision, as well as estimated efficiency and usability. The issues concerning the integration of the different components in a functional prototype are described in report D6.1. Currently, the preliminary prototype is composed of:

• a query interface which gives access to a list of gene/protein name synonyms which can be selected by the user to expand the query with related terms;

• a module which sends the query to the PubMed server and retrieves the resulting Medline abstracts;

• a set of filtering/ranking/classification algorithms that can be selected by the user for post-processing of results.

A number modules have only become recently available and are in the process of being integrated into the prototype. These include the latest Memory-Based Shallow Parser (MBSP) which carries out tokenisation, chunking, part-of-speech tagging and subject/object detection, and the fingerprint classification module. An earlier release of the MBSP parser has already been integrated. After discussions between partners, a strategy was adopted to speed up the evaluation process by appraising these modules as soon as they became operational. In this document, we report the first assessment of the prototype and additional modules. The integrated prototype evaluation focused mainly on the query expansion module. Experts’ recommendations were used to clean up the gene/protein synonym database which is used for query expansion, as well as to reorganise the database in order to clearly differentiate synonyms from possible homonyms of a gene/protein name submitted by a user. An extensive preliminary assessment of common ranking approaches for medical annotation within SwissProt has shown very promising results, and will be investigated further (see Seewald, 2004). A preliminary ranking system to determine relevant papers for PRINTS annotation has also been assessed as useful. Filtering is in the process of being reimplemented after some discussion to enable search for model species and taxonomic search in a transparent setting. Although this work remains to be done for the document ranking modules, so far the results are quite positive. Evaluation and refinement of the PRINTS-specific fingerprint classification module was also positive. Discussions between the biological domain and data-mining experts identified a number of predictive factors which could be used to train the appropriate machine learning algorithms. Comprehensive manual appraisal of the results aided selection of those features which were most predictive, and helped identification of the best performing algorithm. A large amount of effort was put into the creation of corpora for various information topics that need to be extracted from retrieved documents. This work is an ongoing process of WP1’s task


Deliverables Report


July 2004

5. These datasets of documents, containing tagged sentences/phrases according to information type, have been useful for program training, testing and for evaluation of the performance of the extraction process in real conditions. The datasets, and the protocols by which they were generated (which have evolved during feedback between partners), are described in detail below. The finalised methods for assessing performance of the tools using these benchmarked resources are still under discussion between the partners. However, preliminary experiments have been performed on sentence classification using algorithms trained to recognise sentences relating to protein structure and function. Expert feedback and the generation of further datasets will ensure that the performance of the classifiers improves and that a more diverse set of categories is available in future prototypes. Evaluation of the memory based shallow parser (MBSP) has predominantly involved investigation of its tokenisation performance when confronted with biological text. This was followed by brief appraisal of its chunking, part of speech tagging and subject/object detection modules. It is envisaged that the MBSP will be evaluated in greater depth as it becomes fully integrated into the prototype. Similarly, the appraisal tasks that have been provided in this deliverable will continue as additional features are implemented into the preliminary prototype. The resulting experts’ recommendations will be used directly to develop the second BioMinT prototype.


Deliverables Report


July 2004

3. DISCUSSION

3.1 Gene/ProteinSynonym database evaluation

Although there are nomenclatures of gene and protein names, many authors describe these entities using their own terms. Furthermore, before such rules/nomenclatures existed, authors were free to choose the names for the genes and proteins they were studying. As a consequence, there are numerous ways (full name, symbol, synonym) to describe a single entity. For example, the “Abdominal A” gene in Drosophila can be referred to with terms such as “abd-A”, “iab-2”. “iab-5”, or even “Hyperabdominal”. Approximately 30 such names are assigned to this gene. Moreover, one term may describe several separate entities (genes/proteins) within a single species, or between different species - this is the problem of homonymy. For instance, “ACS3” simultaneously designates the human “FACL3” and “twist” genes. In addition, it may specify a gene in another species. As a consequence, in order to recover a maximum of publications that mention a specific protein/gene, it is essential to have a list of terms describing this entity at one’s disposal. Database construction A database of protein and gene names (GPSDB) was constructed for this purpose. The first step was to identify the main resources where this type of information is available: fourteen open databases, mainly of model organisms, were exploited (D3.1 and protop BioMinT prototype manual version V_0_9). For each database, specific fields were extracted (official name, symbol, synonyms, database cross-reference links, species name, entry ID, etc.). To construct the synonym database, all pair wise combinations of synonyms from a given entry, combined with species and source information of the second synonym, were created as separate table entries.

First evaluation: When using one query term, all pair wise combinations of synonyms containing this query term were listed. It was already possible to separate the species then by sorting the output according to that criterion. Judging by the output, the synonym list provided no distinction between entities within a same species, e.g., searching with “ACS3” (this term describes two entities in human: “FACL3” and “twist”) yielded a unique list of synonyms without providing any distinction between these entities. Moreover, the output list was incomplete, e.g. the “FACL3” entity is referred to in five of the 14 open databases, whereas “ACSL3”, a synonym of the “FACL3” entity, is only present in three of the five. Thus, when querying with “ACSL3”, we only retrieve synonyms from these three databases, instead of all five. In order to retrieve all the synonyms, it is necessary to merge the five database entries into a “super entry”: this is feasible using the references that link them together (this procedure is detailed in deliverable D3.1 and in protop BioMinT prototype manual


Deliverables Report


July 2004

version V_0_9). After this was performed, the distinction between homonyms was made possible and the list of synonyms was complete. Second evaluation Searching the synonym database showed that some terms or entries were unusable in the framework of the project. Indeed, various terms present in the database are not necessarily mentioned in the literature, such as accession numbers resulting from various sequencing projects (e.g. KIAA cDNA clones, BEST sequence clusters, RIKEN cDNA). On the other hand, using terms such as “G” to search PubMed recovers a large number of irrelevant documents. Such synonyms, as well as those consisting of digits only, or composed of more than 200 characters were excised out. Some entries, corresponding to pseudogenes or to non-coding sequences were also removed since our purpose is to focus on proteins and protein-encoding genes. Finally, the content of some entries was modified, e.g. additional information (such as “Note:”, comments, “-pending”, “-provisional”), or special characters (@, “, /, etc.) were removed. The Perl regular expressions used for this cleaning-up process are detailed in the protop BioMinT prototype manual version V_0_9. The resulting database currently contains 559 294 different synonyms referring to 292 472 proteins from 7396 species. GPSDB is updated every three months on average. Third evaluation The synonym database is usable in its current state: it can be searched using a gene/protein name or as string of characters. For example, if the query is “%hydrolase%” synonyms from any “meta entry” containing this term will be retrieved (“hydrolase”, but also “(1->3)-beta-glucan endohydrolase” or “(Di)nucleoside polyphosphate hydrolase”). In addition, the search can be restricted to a given species (e.g. mus musculus, or homo sapiens) and to one, several, or all referenced databases. Thus, the user can choose to refer to a single database, such as RGD (Rat DataBase), if it meets their requirements. An example of such a query is shown on Figure 1. The query can be further refined by combining these options. Providing a drop-down list allowing the user to choose from one or more "model organisms" as well as from a taxonomic range (e.g. Eukaryota, Viridiplantae, Metazoa), in addition to the editable "species field" is currently under development and will be available shortly.


Deliverables Report


July 2004

Figure 1: Interface for querying the gene/protein synonym database. It is currently possible to see whether there are homonyms once the query has been submitted. Figure 2 provides an example of query result output where it clearly appears that “ant” is used for describing two genes/proteins from Drosophila melanogaster, one gene/protein from Drosophila obscura, and two genes/proteins from E. coli. This approach suggests an obvious way to improve the synonym expansion by automatically removing homonyms. However, preliminary results (Seewald, 2004) are indecisive on whether this also improves performance in a real-world setting and further research is needed. For each synonym listed, the user can determine from which source it originates (Figure 3). Clicking on the “Source” column directly accesses the corresponding database entry. The query output allows the user to choose synonyms to formulate a PubMed query (see examples in Figures 4 and 5). Alternatively, additional terms such as species names or other search words may be incorporated into the PubMed query.


Deliverables Report


July 2004

Figure 2: Example of a query result output.


Deliverables Report


July 2004

Figure 3: Direct links to reference databases.


Deliverables Report


July 2004

Figure 4: Selection of terms for the PubMed query.


Deliverables Report


July 2004

Figure 5: Generation of a PubMed query.


Deliverables Report


July 2004

PRINTS-specific issues regarding synonym and query expansion Unlike Swiss-Prot, PRINTS annotation pertains to groups of related proteins rather than individual proteins themselves. For this reason there are often differences in the type of information reported in PRINTS and Swiss-Prot, and this information is often drawn from different types of literature (e.g., in the case of PRINTS, from (super-)family review papers rather than articles describing the characteristics of specific proteins). In order to retrieve such documents, it is necessary to adopt a slightly different approach to the PubMed query construction described above. For example, in many cases we would like to identify any hierarchical relationship between disparate proteins and use this as the basis of the query term. Work addressing this issue is ongoing, pursuing three approaches, two of which build upon the development of the synonym database. The first of these involves synonym expansion of individual protein names and examination of the results for common terms which may represent a family or super-family name. The second approach evaluates the literature returned following a PubMed query with the expanded synonym list, again looking for common terms and groupings. A third approach involves the development of a document ranking algorithm capable of identifying review type publications. It is anticipated that these complementary approaches, tightly coupled with the fingerprint classification module described in Section 3.3, will help construct search queries that return documents more relevant to PRINTS annotation than those returned by individual protein names or synonyms. We will continue to provide feedback on these approaches as they are developed and expect to have a PRINTS-specific query expansion module integrated into the prototype shortly.

3.2 Filtering/ranking/document organisation module evaluation

Since PubMed’s search engine is not “customisable” and the way in which Medline is indexed does not suit our requirements, using a species name in the query to limit the search may not be an ideal option: all publications that describe a protein from a given species but that don’t mention the latter will be lost. Moreover, phrase searches are automatically “exploded” by PubMed when the phrase is not present in its index of searchable terms. Rather we favour a strategy where, in a first step, Medline is searched with gene/protein names only in order to retrieve a maximum of documents (the user nevertheless still has the option of including additional search terms). In a second step, the user runs filtering, ranking and categorisation algorithms on the returned query. Species name and additional search terms to filter for can already be given in the query form. Filtering based on taxonomic ranges, journal names and types, and terms to exclude, are in the process of being developed. Once the Medline search is done, several document ranking/categorisation algorithms are offered to the user. In the current state of the prototype, most of these algorithms are not particularly effective since they have not been created for the task at hand, with the exception of a preliminary ranker which has specifically been trained to recognise review papers likely to be relevant to PRINTS. Initial evaluations of this ranker have shown its results to be favourable, and with further evaluation and training is underway.


Deliverables Report


July 2004

Although a local caching scheme has been implemented and is used in the current prototype, all documents are still stored in an XML format. The current XML parser contributes almost exclusively to the unsatisfactory runtime performance. We foresee two ways to ameliorate this problem: to find a faster XML parser, or to locally cache pre-parsed XML documents with respect to their important contents. This is currently under investigation. The ranking/organising/categorisation result output (Figure 6) can be viewed in another window with the score, authors, title, year and PubMed ID for each abstract. It would be important to include the name of the journal since this type of information is one of the factors users take into account for selecting documents. Also there should be a way for the user to select the documents he wishes to submit to the IE module manually. The latter information extraction phase isn’t integrated into the current prototype. However, its performance has been evaluated in a standalone manner (see Section 3.3).

Figure 6: Example of a ranking algorithm output.


Deliverables Report


July 2004

A separate systematic evaluation of four ranking modules within the BioMinT prototype was recently finished (Seewald, 2004). As data we reused the medical annotation dataset provided by the Swiss Institute for Bioinformatics which captures the relevance decisions of Swiss-Prot annotators. The results indicate that both classic ranking systems (such as LuceneRanker which is based on a full-text search engine) and learning rankers both perform satisfactorily and are useful for field tests. Our results are comparable to those reported earlier (Dobrokhotov et al., 2003) on the same dataset, i.e. Precision=Recall=63.7% vs. Precision=58.9% and Recall=69.2%, which validates our approach as state-of-the-art. We also investigated whether a local one-year snapshot of Medline improves the results of the ranking. It appears that is not the case, which confirms that our local searching approach is competitive. However, other considerations such as the possibility of overloading the PubMed servers may still make a local Medline installation the best option. A preliminary investigation into automated homonymy recognition and removal of homonyms from each expanded query confirmed that the homonymy recognition itself works quite well. Surprisingly, removing known homonyms from each expanded query has very little influence on the ranking. We are looking forward to continuing our investigation of this important topic with new data to come to a more final conclusion as to the usefulness of homonymy recognition in real-life ranking and information extraction tasks.

3.3 Fingerprint classification module evaluation When annotating PRINTS entries, determination of the type of fingerprint under consideration is vital since this influences what information should be gathered, how it should be processed and in what format the output should be presented. In a previous automated approach, embedded within the PRECIS annotation tool 2004 (Mitchell et al., 2003), we aimed to resolve this problem through hand-crafted heuristics. Whilst this approach has been relatively successful, some fingerprints are nevertheless misclassified. To resolve this problem, extensive discussions were undertaken between relevant partners, focusing on the composition and meaning of the various fingerprint fields, the precise anatomy of fingerprint motifs, and so on. These discussions lead to the identification of a number of potentially discriminating factors which might help predict whether a fingerprint is diagnostic for a family, super-family or domain. These factors included those relating to the properties of a fingerprint (number of motifs, number of true matches, etc), those relating to motifs within the fingerprint (length, depth, conservation, etc), as well as those drawn from the Swiss-Prot records of the constituent protein sequences, similar to those used by PRECIS. The various potentially discriminating factors investigated are summarised in Table 1. In order to train the appropriate machine learning algorithms on the various selected factors, we manually classified the type of every fingerprint within the PRINTS database. 80% of this data was then used to train a number of different learning algorithms. These were then tested by 10-fold cross-validation, and on a 'hold-out' dataset made up of the remaining 20% of the fingerprints. The classification of hold-out set was not disclosed to the data-mining experts in order to ensure that the tests were performed fully blind.


Deliverables Report


July 2004

Table 1: Potentially predictive factors for the determination of fingerprint type.

After training a number of algorithms, a feature selection step was performed. This determined which features were discriminating, which were redundant, and which were harmful to the fingerprint type predictions. After refining the factors used by the algorithms appropriately, the cross-validation and hold-out error rates for the various algorithms were recalculated (see Table 2), and the algorithm which performed best was selected to serve as the classifier for the module. This algorithm displays an approximately 14% error rate on both the cross-validation and hold-out datasets. This represents a 26% accuracy gain over our previous approach and is well within our acceptable boundaries for the performance of an automated system. Furthermore, this should improve as more fingerprints are generated and can be used for training. A paper describing this work has been accepted for publication at PKDD-2004 2004 (Hilario et al., 2004). The manual assessment of algorithm performance on the "hold-out" set is available from the BioMinT reviewer website.


Deliverables Report


July 2004

Table 2: Cross-validation and hold-out error rates for the top scoring algorithms following feature selection.

3.4 Corpora elaboration In order to set up, train and test NLP, document ranking or information extraction programs, it is necessary to constitute corpora of documents manually annotated for various information topics by experts beforehand. One of the aims of the BioMinT project is to detect sentences containing information for various topics corresponding to specific Swiss-Prot entry fields and PRINTS topics of interest. In order to develop tools to perform this process, a number of Swiss-Prot and PRINTS specific corpora were developed, the generation of which is described below.

3.4.1 Creation of corpora to assist Swiss-Prot annotation Several corpora were created, based on data information from the RP line from Swiss-Prot. This RP line provides general information on the content of a publication cited in a Swiss-Prot entry. For example, a RP line mentioning “Sequence from NA, Function, and Induction” implies that the publication in question contains information on these specific topics (whether the information resides in the abstract or in the full text). In a first step, all publications for a given topic in the RP line were extracted. From this set of abstracts, two domain experts manually labelled all sentences containing relevant information with respect to that topic. The latter were pooled and their content was discussed so as to homogenise the dataset. Here are examples of sentences extracted for:


• Subcellular location

Deliverables Report


July 2004

1396601|Sec12p is an ER type II membrane protein that mediates the membrane attachment of the GTP-binding Sar1 protein.

1527387|Immunocytochemical staining localized MMCP-5 to the cytoplasmic granules of serosal mast cells and Kirsten sarcoma virus-immortalized mouse mast cells.

• Tissue specificity

1573677|In situ hybridization studies demonstrated prominent expression by neurons in brain.

1289053|In the yolk sac, Glut-3 appeared to be expressed only in the endoderm while Glut-1, although expressed in both layers, was expressed more strongly in the mesoderm layer.

Table 3 shows, for each topic, the sets of sentences that have been extracted so far for six different topics with the respective sentence counts.


Deliverables Report


July 2004

Topic # extracted sentences Subcellular location 2263 Tissue specificity 3683 Developmental stage 1253 Alternative products Alternative splicing Alternative promoter Alternative initiation Undetermined

1723 994 48 64 642

PTM (Post Translational Modification) Glycosylation Phosphorylation

439 1283

Bond Disulfid bond

600

Table 3. Counts of extracted sentences for different topics

On the same principle, it is planed to extract sentences for other topics. Sets of labelled sentences for at least ten topics will be available soon. All these labelled corpora are used as a basis for training document classification and information extraction tools, as well as for assessing the performance of such tools. A second type of corpus was created in the form of XML-labelled abstracts with the collaboration of 15 Swiss-Prot curator volunteers. For this purpose, we utilised a pre-existing XML annotation tool developed by Gilles Bisson’s team from the CNRS, Grenoble, in the framework of the Caderige project (http://caderige.imag.fr/) and improved according to our recommendations. The DTD it uses presently defines a total of 21 topics (tags) and 55 subtopics (attributes) corresponding to most fields of a Swiss-Prot entry (the classes of information presented in Appendix E of deliverables D1.1-D1.4 have been slightly reorganised) and enables annotators to label words, groups of words, or whole sentences in a Medline abstract that describe a type of information they consider useful for annotating a protein in Swiss-Prot (Figure 7). The resulting file is automatically converted into the XML format and can be easily used for later processing (parsing). The XML labelling tool is integrated into the environment of the curators, and so far 714 XML documents have been labelled. Table 4 lists the total of labelled phrases for each topic. This is an ongoing process and we hope to reach the 2000 abstracts threshold soon.


Deliverables Report


July 2004

Topics and subtopics (tags and attributes) # tagged phrases Function (Function, Enzyme, Pathway, Cofactor)

999

Subunit 409 Tissue specificity 285 Region (Domain, Peptide, Transmem, Zn_fing, Repeat, Act_site, Site, Undetermined)

285

Sequence(From_NA, From_aa) 256 Subcellular location 209 PTM (Post Translational Modification) (Init_met, Signal, Transit, Propep, Carbohyd, Acetylation, Amidation, Blocked, Formylation, Gamma-carboxyglutamic_acid, Hydroxylation, Methylation, Phosphorylation, Pyrrolidone_carboxylic_acid, Sulfation, Myristate, Palmitate, Farnesyl, Geranyl-Geranyl, GPI-anchor, N-acyl_diglyceride, S-archaeol, N-octanoate, Undetermined)

147

Method_3D 102 Mutagen 92 Similarity 81 Binding (Metal, Binding, Ca_bind, DNA_bind, NP_bind, Undetermined)

74

Alternative product (Alternative_splicing, Alternative_promoter, Alternative_initiation, Undetermined)

60

Enzyme regulation 41 Developmental stage 41 Variant (Polymorphism, Undetermined)

31

Induction 31 Bond (Disulfid, Crosslnk, Undetermined)

26

Mass spectrometry (Method, MW)

4

RNA_editing 2 Disease 20 PCP (PhysicoChemical Properties) 0 Total number of labels 3195

Table 4: total number of labelled phrases per topic.


Deliverables Report


July 2004

Figure 7: Example of abstract labelled using the XML editor interface. The list of tags is displayed in the bottom right window.


Deliverables Report


July 2004

3.4.2 Creation of corpora to assist PRINTS annotation The generation of PRINTS corpora largely mirrored that for Swiss-Prot, although with some specialised differences. We produced several corpora automatically using a Perl-based tool which generated random PubMed identification numbers and retrieved the corresponding documents from the database. These were then parsed and documents whose titles or abstracts contained the strings "gene", "genetic", "genes", "proteins" or "protein" were retained and the rest discarded. This step ensured only documents broadly similar to those used by PRINTS annotators were contained within the corpora. For the first iteration of the prototype it was decided to concentrate on sentences relating to protein structure, function and disease associations, since these represent the core categories of interest for fingerprint annotation. Sentences relating to these categories were manually selected from the corpora and transcribed into plain text files. Although separate corpora were initially created for each category of interest, discussion between partners highlighted the benefits of a single corpus to resolve issues such as identification of negative examples. The disparate corpora were therefore unified and the structure, function and disease sentences identification task revisited so that all sentences from each category were fully identified. The resulting single corpus is made up of 1221 abstracts, representing 10825 individual sentences. Of these 603, 1018 and 650 are tagged as relating to function, structure and disease respectively. The corpus and sentence collections have been used to train and evaluate the sentence classifiers of BioMinT's IE module (see below). The number of example sentences for each category should continue to grow as further test sets are developed for appraisal of this IE module since, once evaluated, the test data can then be used as further training material. In addition, the number of sentence categories will grow as the unified corpora is examined for further fields of interest (e.g., protein family relationships, spatial distribution etc). This task is currently ongoing. Discussions between partners have also involved wringing maximum use from the data we have produced. Following a meeting between the biologist and the IE/natural language processing experts, it was decided that we should not only isolate individual sentences, but also tag the parts of those sentences which deal specifically with the subject, object and our category of interest. For example, given the sentence "Sequence analysis revealed that the regions that mediate the interaction between SNAP-25 and syntaxin contain heptad repeats characteristic of certain classes of alpha-helices.", the section dealing specifically with the protein subject and its structure could be identified and tagged as follows: "Sequence analysis revealed that <struc>the regions that mediate the interaction between SNAP-25 and syntaxin contain heptad repeats characteristic of certain classes of alpha-helices</struc>. We have performed this task for each of the structure, function and disease sentences derived from the corpus. We have incorporated this into an XML framework to facilitate training of the MBSP. An example showing an appropriately tagged abstract is given in Figure 8. This additional level of tagging increases the time taken to prepare additional categories of interest from the corpus, but should help improve the training of the sentence classifiers and the shallow understanding


Deliverables Report


July 2004

performance of the MBSP, thus enabling a more fine-grained approach to data mining in the long term.

<?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE article SYSTEM "article.dtd"> <article> <PubMed_ID>11959995</PubMed_ID> <text tagged-for-func="yes" tagged-for-struct="yes" tagged-for-dis="yes"> <s>In mammals, <struc>Nck represented by two genes, is a 47-kDa SH2/SH3 domain-containing protein lacking intrinsic enzymatic function</struc>.</s> <s>Here, we reported that <struc>the first and the third SH3 domains of Nck-1 interact with the C-terminal region of the beta subunit of the eukaryotic initiation factor 2 (eIF2 beta)</struc>.</s> <s><struc>Binding of eIF2 beta was specific to the SH3 domains of Nck-1, and in vivo, the interaction Nck/eIF2 beta was demonstrated by reciprocal coimmunoprecipitations</struc>.</s> <s>In addition, Nck was detected in a molecular complex with eIF2 beta in an enriched ribosomal fraction, whereas no other SH2/SH3 domain-containing adapters were found.</s> <s>Cell fractionation studies demonstrated that the presence of Nck in purified ribosomal fractions was enhanced after insulin stimulation, suggesting that growth factors dynamically regulate translocation of Nck to ribosomes.</s> <s>In HEK293 cells, we observed that transient overexpression of Nck-1 significantly enhanced Cap-dependent and -independent protein translation.</s> <s><struc>This effect of Nck-1 required the integrity of its first and third SH3 domains originally found to interact with eIF2 beta</struc>.</s> <s>Finally, in vitro, <func>Nck-1 also increased protein translation, revealing a direct role for Nck-1 in this process.</func></s> <s>Our study demonstrates that in addition to mediate receptor tyrosine kinase signaling, Nck-1 modulates protein translation potentially through its direct interaction with an intrinsic component of the protein translation machinery.</s> </text> </article>

Figure 8: Example of an XML tagged abstract. Individual sentences are designated by <s> tags, and those which relate to structure and function contain <struc> and <func> tags. Where appropriate, these tags mark the boundaries of the function and structural statements within the sentences.


Deliverables Report


July 2004

3.5 Evaluation of the information extraction (IE) module In order to perform a preliminary evaluation of the IE module, a number of sentence classification algorithms were trained on the structure and function datasets described above. Initial estimates of performance were calculated via 10-fold cross-validation, and the algorithms with the best F measures were selected. These displayed precision and recall rates of 55 and 79% respectively for sentences relating to function (F measure 65%) , and 58 and 68% for those which relate to structure (F measure 62%). To test classification performance on real world examples, we generated a corpus of 30 test abstracts, composed of 259 lines of text, over which the classifiers were run. Each sentence in the output was then manually checked and, where appropriate, tagged according to whether it represented a true-, false-, or missed-positive. This ensured that the data was not only useful for appraisal, but could be used to train the system further. This iterative testing and training approach should ensure the classification performance improves as additional evaluation takes place. Example evaluated and tagged output is shown in Figure 9, and the complete results sets are available on the BioMinT reviewer's website.

Fdt Awstf

<fp><s><func>These results indicate that the characteristics of the residue at position 101 of the alpha1 subunit play a crucial role in determining the efficacy of benzodiazepine-site ligands.</func></s></fp> <mp><s>The role of the GABA(A) receptor beta3 subunit in determining acute cocaine sensitivity and behavioral sensitization to repeated cocaine was measured in mice missing both ( - / - ) , one ( +/ - ) , or neither ( + / + ) allele of the beta3 gene .</s></mp> <tp><s><func> GB1 binds GABA and GB2 plays a major role in G-protein activation as well as in the high agonist affinity state of GB1 .</func></s></tp> <s>Locomotor stimulation induced by one cocaine injection ( 20 mg / kg , i.p. ) was found to be greater in - / - mice compared with + / + mice , whereas cocaine-induced behaviors were intermediate in + / - mice .</s>

igure 9: Example of evaluation of sentence classifier output. Individual sentences are esignated by <s> tags. True-, false- and missed-positives are indicated by <tp>, <fp> and <mp> ags respectively.


nalysis of the results revealed precision and recall rates of 32 and 66% respectively for sentences hich relate to function, and 67 and 40% for those which relate to structure. These represent

ignificant differences compared with the values generated by 10-fold cross-validation and illustrate he importance of using real world data for testing. It is likely that the higher precision performance or structure over function is due to the more specialised language used to convey structural

Deliverables Report


July 2004

information; many words within the training sets are found almost exclusively in sentences relating to protein structure (e.g., transmembrane domains, hydrophobic regions, alpha-helices, beta-pleated sheets, etc). The low precision rate for functional information was somewhat disappointing. However, this was a first attempt at sentence classification and should improve as additional testing and training is performed. In addition, we expect recall to improve across the board when the IE module is properly integrated into the prototype so that document ranking is performed to select the most relevant literature prior to the classification stage. 3.6 Evaluation of the memory-based shallow parser (MBSP) The MBSP is composed of a tokeniser, tagger, chunker and subject/object detection module. Since the tokeniser performs the first pre-processing steps of the shallow parser (detecting sentence boundaries and identifying individual tokens), evaluation of this module represented a natural starting point in appraisal of the MBSP. In addition, unlike other components of the MBSP, the output of the tokeniser is immediately comprehensible by biologists and does not require significant specialised knowledge to decode. Following training of the MBSP on the GENIA corpus, precision and recall for sentences and tokens were evaluated. The results were impressive, with precision and recall values of 99.2 and 99.8% respectively for sentences and 96.5 and 99.9% for tokens. This level of performance should be sufficiently high for the BioMinT tool, without requiring further optimisation. Evaluation of the chunker, tagger and subject/object detector was also performed, albeit briefly. This appraisal was somewhat hampered by the steep learning curve to be climbed by non-natural language processing experts so that they might understand the output of these modules. Nevertheless, we persevered, supplying the MBSP with sets of real world abstracts and manually assessing output such as that illustrated in Figure 10. Overall, the system appeared to work well, although slight problems were evident in certain cases, such as in the following example;

<NP relation="SBJ" of="2"> <W pos="DT">a</W> <W pos="NNP">G</W> </NP> <VP id="2"> <W pos="VBD">protein-coupled</W> </VP> <NP relation="OBJ" of="2"> <W pos="NN">receptor</W> </NP>

Here, a simplistic approach would be to group "G protein-coupled receptor" as one noun chunk. The MBSP however, attempts a more advanced, finely-grained analysis. The results are slightly off target; "G protein" should be grouped as a single noun chunk, and "coupled" rather than "protein-coupled" should be labelled as the verb. This would also legitimise the indicated


Deliverables Report


July 2004

subject/object relationship. It is anticipated that cases such as these will be resolved through more training, manual appraisal or the results, and expert feedback. In addition, to further train the subject/object detection module, discussions between partners have addressed corpora construction methodologies. Initially it was proposed that biologists should tag the actors, subjects and actions within sentences, as follows;

Identification and purification of [subject]human Stat proteins[/subject] [action]activated[/action] in response to [actor]interleukin-2[/actor].

However, this approach ignores the syntactic structure of the annotation. Furthermore, such tagging is often difficult for non-NLP experts to perform. A compromise was therefore adopted whereby sentences pre-classified into categories of interest are clause tagged, so that the (partial) actor-action-object event within the sentence is indicated (as described in Section 3.4.2). Such clause tagged sentences can then be shallow parsed and extraction of actor, action and objects can be attempted automatically. It is hoped that such an approach will provide high enough quality training material to improve the shallow understanding performance of the MBSP without a costly pay off in corpora production time.


Deliverables Report


July 2004

<S cnt="1"> <NP relation="SBJ" of="1"> <W pos="JJ">Modular</W> <W pos="NNS">polyketide</W> <W pos="NNS">synthases</W> </NP> <W pos="openparen">(</W> <NP> <W pos="NNPS">PKSs</W> </NP> <W pos="closeparen">)</W> <VP id="1"> <W pos="VBP">are</W> </VP> <ADJP> <W pos="RB">very</W> <W pos="JJ">large</W> </ADJP> <NP> <W pos="JJ">multifunctional</W> <W pos="JJ">enzyme</W> <W pos="NNS">complexes</W> </NP> <NP relation="SBJ" of="2"> <W pos="WDT">that</W> </NP> <VP id="2"> <W pos="VBP">synthesize</W> </VP> <NP relation="OBJ" of="2"> <W pos="DT">a</W> <W pos="NN">number</W> </NP> <PNP> <PP> <W pos="IN">of</W> </PP> <NP> <W pos="RB">medicinally</W> <W pos="JJ">important</W> <W pos="JJ">natural</W> <W pos="NNS">products</W> </NP> </PNP> <W pos="period">.</W> </S>

Figure 10: Example output of the MBSP. A single sentence has been split into tokens which are tagged with the Penn Treebank II Part of Speech Tagset. Non-overlapping, non-embedded chunks and noun chunk subject/object relationships are also tagged.


Deliverables Report


July 2004

4 CONCLUSIONS The evaluation work performed so far has allowed finalisation of the gene/protein synonym database in a way so that it is now useful to any biologist. It is indeed necessary to have a list of names/synonyms for querying Medline in order to recover a maximum of publications describing a particular gene/protein. We aim to submit a paper on this subject and to make the resource available to the biomedical community. We found the core IR module to perform satisfactorily. While the ranking modules already work for the field of medical annotation within Swiss-Prot, the filtering and classification modules need some extensions and further development. In the future we will work along three lines: - Biological experts are developing sets of rules to improve the filtering and ranking processes.

These rules will be thoroughly evaluated and compared to other ranking and filtering approaches, and may be integrated into the next prototype based on their merit.

- The preliminary user interface is being developed in a flexible fashion that will allow an easy choice of program parameters by the evaluators in order to rapidly test their effect on program performance. Currently, fixed parameter settings are set ad-hoc or optimised for training data, where available. The feedback of evaluators on the effect of parameters will speed up and ground the parameter optimisation process in a real-world setting.

- We will provide a set of benchmarked data resulting from expert manual selection of the most relevant documents of a query on a specific gene according to a specific information topic. These data will be used to evaluate and adjust the filtering/ranking and also the classification programs.

Although the PRINTS-specific modifications to the IR module are not yet fully implemented, significant progress has been made through the generation of a fingerprint classification module and also by promising preliminary results in ranking review papers. Thorough evaluation of the fingerprint classification module revealed it to be sufficiently well adapted for the classification task. We are currently devising and evaluating a number of methods by which this module might be coupled to the synonym database to perform query expansion in the most effective manner. During training of the IE module we identified a number of issues with corpora generation. These have now been resolved, ensuring the current corpora, and any generated in future, are of maximum use to the IE and NLP training tasks. Preliminary evaluation of the IE module revealed that further training is required to achieve our desired level of performance at sentence classification. This task is ongoing. Evaluation of the MBSP suggested that the module performs sufficiently well on biological text to be integrated into the prototype. We anticipate that the very fine level of detail which can be identified and extracted with this module should be very beneficial to the BioMinT tool. Evaluation of this module will continue in a more extensive form once it is fully integrated into the prototype.


Deliverables Report


July 2004

According to this workplan, we hope to be ready to provide a second prototype that will be conform to user requirements and that will be directly suitable for the further project task, updating of database annotation (WP7) and the generalisation of the tool for other proteomic applications (WP8).


Deliverables Report


July 2004


5 REFERENCES Dobrokhotov P.B., Goutte C., Veuthey A-L. and Gaussier E. (2003) Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics, 19 Suppl 1: I91-I94. Hilario, M., Mitchell, A., Kim, J-H., Bradley, P. and Attwood, T. (2004) Classifying Protein Fingerprints. 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2004), Pisa, Italy 2004. Accepted. Mitchell, A.L., Reich, J.R. & Attwood, T.K. (2003) PRECIS - An automatic tool for generating Protein Reports Engineered from Concise Information in Swiss-Prot. Bioinformatics, 19, 1664-1671. Seewald, A.K. (2004) Ranking for BioMinT: Investigating Performance, Local Search and Homonymy Recognition. Technical Report, Österreichisches Forschungsinstitut für Artificial Intelligence, Wien, TR-2004-14. Submitted to KELSI 2004.

biomint deliverable report d6.2 evaluation of the first ...€¦ · deliverables report...

Documents