a research perspective on text mining: tasks, technologies and prototype applications robert...

A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications

Robert Gaizauskas

Natural Language Processing Group

Departments of Computer Science,

University of Sheffield

September 4, 2002 Euromap Text Mining Seminar

Outline of Talk

Text Mining: Scenario, Definitions and Brief History

Text Mining Tasks + Methodologies

Text Mining Technologies

Text Mining Prototype Applications

Conclusions and Future Directions/Challenges


Text Mining: Scenario


Text Mining Scenario Components: Texts

Genres Newspapers Company reports Web pages Scientific papers Legal documents

E-Formats Word Documents (.doc, .rtf) PDF/Postscript HTML/SGML/XML

Languages English … French … Greek … Russian … Chinese … Hindi … Sanskrit

… Linear B Character encodings: ASCII, ISO 8859, Unicode


Text Mining Scenario Components: Users

User domain of interest Business – competitor intelligence, corporate intranet/memory Scientists – access to literature Military/police intelligence – open source intelligence, intranet Journalists – news archives

User level of expertise Novice/expert

User linguistic competence Adult/child Native/non-native language speaker Uni/multi-lingual


Text Mining Scenario Components: Information Access Needs

Ad hoc searching Specific questions: “What year did the Berlin Wall come down?” General background/context: “Tell me about Zakopane”

Stable intelligence gathering Scenario-related: “Build a database recording new projects in the

energy sector: the players, location, energy type, start date, capitilisation”

Entity-related: “Build a database of key scientists in the pharma industry: name, employer, position, start and end dates”

Current awareness Alerting: “Let me know when any papers are published on the

crystallographic structure of any lipase” Document selection: “Assemble articles on drug approvals”

Summarisation Single/multi-document: “Summarise the Bulger trial”


Text Mining Scenario Components: Tools

Information retrieval


What is Information Extraction?

The Information Extraction (IE) task: from each text in a set of natural language texts extract information about predefined classes of entities and relationships and place this information into a template or database record.

E.g. from financial newswire stories identify those dealing with management succession events and from these extract details of organisations and persons, the post being assumed or vacated, the reason for vacancy, etc.

IE may also be described as the activity of populating a structured information repository (database) from an unstructured, or free text, information source.


What is Information Extraction? (cont)

The resulting structured database is then used for some other purpose:

searching or analysis using conventional database queries;

data-mining;

generating a summary (perhaps in another language);

constructing indices into/within/between the source texts.


Example: A Wall Street Journal Article

<DOC>

<DOCID> wsj94_008.0212 </DOCID>

<DOCNO> 940413-0062. </DOCNO>

<HL> Who's News:

@ Burns Fry Ltd. </HL>

<DD> 04/13/94 </DD>

<SO> WALL STREET JOURNAL (J), PAGE B10 </SO>

<CO> MER </CO>

<IN> SECURITIES (SCR) </IN>

<TXT>

<p>

BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.

</p>

</TXT>

</DOC>


Example: A Management Succession Event Template

<TEMPLATE> := DOC_NR: "NUMBER" ^ CONTENT: <SUCCESSION_EVENT> *<SUCCESSION_EVENT> := ORGANIZATION: <ORGANIZATION> ^ POST: "POSITION TITLE" | "no title" ^ IN_AND_OUT: <IN_AND_OUT> + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} ^<IN_AND_OUT> := PERSON: <PERSON> ^ NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ^ ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: <ORGANIZATION> - REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} -<ORGANIZATION> := ORG_NAME: "NAME" - ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" - ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ^ ORG_LOCALE: LOCALE_STRING {{CITY, PROVINCE, COUNTRY, REGION, UNK} * ORG_COUNTRY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING *<PERSON> := PER_NAME: "NAME" - PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" *


<TEMPLATE-9404130062> := DOC_NR: "9404130062" CONTENT: <SUCCESSION_EVENT-1><SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT: <IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK<IN_AND_OUT-1> := <IN_AND_OUT-2> := IO_PERSON: <PERSON-1> IO_PERSON: <PERSON-2> NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG<ORGANIZATION-1> := <ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada<PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr."

Example: A (Partially) Filled Management Succession Event Template


Example: Uses for Templates

From the completely filled version of the preceding template a natural language summary can be generated:

BURNS FRY Ltd. named Donald Wright as executive vice president.

Donald Wright resigned as president of Merrill Lynch Canada Inc..

Mark Kassirer left as president of BURNS FRY Ltd.

Or, a table can be constructed:.

Company Post Person Direction

Burns Fry Executive VP Donald Wright In

Burns Fry President Mark Kassirer Out

Merrill Lynch Canada

President Donald Wright Out


Key Features of Information Extraction

Texts are unrestricted NL, but typically short

Template is predefined and fixed

Information extracted is `literal' or `factual‘

The precise definition of the task permits quantitative evaluation of IE systems' performance against human generated results


What IE is NOT: Information Retrieval

The Information Retrieval (IR) task: given a user query and a document collection retrieve that subset of documents from the collection which are relevant to the user's query.

E.g. given the query

exonuclease gamma-delta resolvase

return those abstracts in PubMed pertaining to these proteins

Once the IR system returns the documents, the user browses the selected documents in order to fulfil his or her information need.

Depending on the IR system, the user may be further assisted by

relevance ranking of retrieved documents

highlighting of search terms in the text to facilitate identifying passages of particular interest


Strengths and Weaknesses of IR

Strengths: Can search huge document collections very rapidly Insensitive to genre and domain of the texts Can rank documents with respect to likely relevance Searches can be iteratively refined

Weaknesses: Documents are returned not information/answers, so user must

further read texts to extract information Frequently not discriminating enough (“1563 documents match your

request”)


Strengths and Weaknesses of IE

Strengths:

Extracts facts from texts, not just texts from text collections

Can feed other powerful applications (databases, indexing engines)

Weaknesses:

Porting to new genres and domains is time-consuming and requires expert

Limited accuracy

Not fast enough to run over large text collections while user waits


A Brief History of IE

The first published work on information extraction (though it was not called this at the time) was in late 1960s

A significant precursor was the psychologist Roger Schank’s work on scripts and story understanding in the 1970’s

The 1980’s saw the emergence of some commercial systems targetted at financial transactions and newswires

The big impetus to current research started in the late 1980’s when DARPA initiated a series of competitive evaluations of “Message Understanding” systems (Message Understanding Conferences – MUC)

MUC ran for 10 years (1987-98) and significantly advanced the field

Currently there are a number of IE systems on the market and a large and on-going research effort in the field


Outline of Talk

Text Mining: A Definition and Brief History


Entity Extraction

Attribute Extraction

Relation Extraction

Event Extraction





IE Component Tasks

To fill templates IE researchers have discovered that systems must be able to perform a variety of simpler tasks

Studying and evaluating these component tasks in isolation has proved a useful way forward for IE

Component IE tasks which were specified as part of MUC:

Named Entity Recognition (persons, organisations,locations, dates)

Coreference (multiple references to same entity)

Template Elements (organisations, persons, artifacts, locations)

Template Relations (employee_of, product_of, location_of)

Scenario Template (management succession)


MUC Scoring and Scoring Metrics

Correct answers, called keys, are produced manually for all the MUC tasks.

Scoring of system results, called responses, against keys is done automatically.

At least some portion of the answer keys are multiply produced by different humans so that interannotator agreement figures can be computed.

Interannotator agreement figures of 95% are sought. Figures of less than 80% are interpreted as meaning the task is not sufficiently clearly defined.

Principal metrics are: Precision (how much of what your system returns is correct) Recall (how much of what is correct your system returns) F-measure (a weighted combination of precision and recall)


State-of-the-art Evaluation Results (MUC-7)

Task Recall Precision P & R

Named Entity 92 95 93.39

Coreference 56.1 68.8 61.8

Template

Element

86 87 86.76

Template

Relation

67 86 75.63

Scenario

Template

42 65 50.79


Outline of Talk

Text Mining: A Definition and Brief History


Entity Extraction

Attribute Extraction

Relation Extraction

Event Extraction





Applying IE to Biological Science Journal Papers

IE is an appropriate technology when: large volumes of text make human analysis infeasible template-oriented information seeking is appropriate (stable information

need, narrow domain) conventional IR is inadequate some error is tolerable

To date most IE applications are newswire-oriented, with the bulk being in the financial/competitor intelligence area

Bioinformatics applications provide an interesting challenge to IE different text types -- journal papers (SGML/PDF), abstracts (BIDS,

MEDLINE) different genre -- scientific writing different domain -- biochemistry/molecular biology


EMPathIE: Enzyme and Metabolic Pathways Information Extraction

Aim: Use IE techniques to create a database of enzyme and metabolic pathway data from academic journal papers to support drug discovery

Partners: Depts of Computer Science and Information Studies, U. of Sheffield; Glaxo-Wellcome Research; Elsevier Science

Sponsors: Glaxo-Wellcome Research; Elsevier Science

PostDoc: Dr. Kevin Humphreys

Status: Complete. Project ran 11/97 -- 11/99


EMPathIE: Scenario

metabolic processes involve biochemical reactions in which enzymes play key catalytic roles

each reaction involves an enzyme, some number of inputs and results in some number of products

sequences of such reactions form metabolic pathways

identifying pathways can suggest potential sites for the application of drugs to affect a particular end result

reactions are typically reported one/journal paper -- identifying pathways frequently requires combining information from several papers


EMPathIE: Text Sources

Project focused on 13 journal papers from FEMS Letters (Federation of European Microbiological

Societies), and Biochimica et Biophysica Acta

from 1992-1995

Papers supplied by Elsevier Science and marked up according to

their proprietary SGML DTD

mark up reliable for bibliographical and text structure information

typographical markup (e.g. italics for gene names) inconsistent

and hence ignored


Sample EMPathIE Article

Federation of European Microbiological Societies

Isocitrate lyase activity in halophilic archaea

A. Oren and P. Gurevich, The Hebrew University of Jerusalem

Abstract:

Eight species of halophilic Archaea were tested for the presence of isocitrate lyase activity. High activities (up to 100 nmol –1 mg protein -1) were detected in Haloferax mediterranei and Haloferax volcanii when grown in medium containing acetate as the principal carbon source. Little activity was found in representatives of the genera Halobacterium and Haloarcula. Isocitrate lyase from Haloferax mediterranei required high potassium chloride concentrations, optimal activity being found at 1.5-3 M potassium chloride and pH 7.0. Replacement of potassium chloride by sodium chloride resulted in much lower activities. Sulfhydryl compounds (cysteine, glutathione) were not stimulatory. In other properties (stimulation by magnesium ions, sensitivity to different inhibitors) the enzyme resembled isocitrate lyases from representatives of the Bacteria and Eucarya.

Full Text:

…


EMPathIE Template Specification

<ENZYME> := <PATHWAY> :=

NAME: "NAME" + NAME: "NAME" +

CODE: “EC_CODE" * INTERACTION: <INTERACTION> +

WEIGHT: "WEIGHT" -

SUBUNITS: "SUBUNITS" * <INTERACTION> :=

ENZYME: <ENZYME> ^

<ORGANISM> := SOURCE: <SOURCE> -

NAME: "NAME" + PARTICIPANT: <PARTICIPANT> *

STRAIN: "STRAIN" * NON_PARTICIPANT:<NON_PARTICIPANT> *

GENUS: "GENUS" -

<PARTICIPANT> :=

<COMPOUND> := COMPOUND: <COMPOUND> ^

NAME: "NAME" + TYPE: {SUBSTRATE,PRODUCT,

SUPPLIER: "SUPPLIER" * ACTIVATOR, COFACTOR,

INHIBITOR,BUFFER} ^

<SOURCE> := CONCENTRATION: "CONCENTRATION" -

ENZYME: <ENZYME> ^ TEMPERATURE: "TEMPERATURE"

ORGANISM: <ORGANISM> ^ ACIDITY: "ACIDITY" -


Filled EMPathIE Template

ENZYME-1 PATHWAY-1

Name: isocitrate lyase Name: glyoxylate cycle

E.C. Code: 4.1.3.1 Interaction: INTERACTION-1

ORGANISM-1 INTERACTION-1

Name: Haloferax volcanii Enzyme: ENZYME-1

Strain: ATCC 29605 Participants: PARTICIPANT-1

Genus: halophilic Archaea PARTICIPANT-2

COMPOUND-1 PARTICIPANT-1

Name: glyoxylate phenylhydrazone Compound: COMPOUND-1

Type: Product

COMPOUND-2 Temperature: 35C

Name: KCl

PARTICIPANT-2

SOURCE-1 Compound: COMPOUND-2

Enzyme: ENZYME-1 Type: Activator

Organism: ORGANISM-1 Concentration: 1.75 M


PASTA: Protein Active Site Template Acquisition

Aim: Use IE techniques to create a database of protein active site data from academic journal papers and abstracts to support protein structure analysis

Partners: Depts of Computer Science, Information Studies, Molecular Biology and Biotechnology, U. of Sheffield

Sponsors: BBSRC-EPSRC BioInformatics Programme

PostDoc: Dr. George Demetriou

Status: Complete. Project ran 03/98 -- 03/01


PASTA: Scenario

Extract information concerning the roles of amino acids in protein molecules and create a database of protein active sites from both scientific journal abstracts and full articles

New protein structures are being reported at very high rates in the literature


PASTA: Scenario (cont)

Full evaluation of the results of protein structure comparisons often requires the investigation of extensive literature references E.g. to determine whether an amino acid has been reported as

present in a particular region of a protein Computational methods that can extract information directly from

these articles would be very useful to biologists in comparison classification work and to those engaged in modelling studies


Sample PASTA Article (BIDS Abstract)

TI: The crystal structure of a triacylglycerol lipase from Pseudomonas cepacia reveals a highly open conformation in the absence of a bound inhibitor

AU: Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW

NA: SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA

JN: STRUCTURE, 1997, Vol.5, No.2, pp.173-185

IS: 0969-2126

AB: Background: …

Results: We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The structure shows the lipase to contain an alpha/beta-hydrolase fold and a catalytic triad comprising of residues Ser87, His286 and Asp264. The enzyme shares several structural features with homologous lipases from Pseudomonas glumae (PgL) and Chromobacterium viscosum (CvL), including a calcium-binding site. The present structure of Pet reveals a highly open conformation with a solvent-accessible active site. This is in contrast to the structures of PgL and Pet in which the active site is buried under a closed or partially opened 'lid', respectively.

Conclusions: …


(Partially) Filled PASTA Template

<TEMPLATE-str-1997-5-2-1>:=

DOC_JR: "STRUCTURE, 1997, Vol.5, No.2, pp.173-185"

DOC_AUTH: "Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW"

DOC_IS: "0969-2126“

<RESIDUE-str-1997-5-2-1>:=

RES_TYPE: SERINE

RES_NO: "87"

SITE/FUNCTION: "catalytic","hydrolytic activity", "interfacial activation", "stereoselectivity",

"calcium-binding site", ”active-site"

SEC_STRUCT: A-HELIX

QUATERN_STRUCT: -

REGION: 'lid'

INTERACTION: -

<PROTEIN>:= <SPECIES-str-1997-5-2-1>:=

PRO_NAME: "Triacylglycerol lipase“ SPE_NAME: "Pseudomonas cepacia"

PRO_SCOP_FAM: "Lipase“ SPE_NAME_TYPE: SCIENTIFIC

PDB_CODE: 1LGY

<IN_PROTEIN -str-1997-5-2-1 >: = <IN-SPECIES-str-1997-5-2-1 >:=

RESIDUE: <RESIDUE-str-1997-5-2-1 PROTEIN: <PROTEIN-str-1997-5-2-1

PROTEIN: <PROTEIN-str-1997-5-2-1> SPECIES: <SPECIES-str-1997-5-2-1>


Outcomes I: The PASTA System

System processes texts in four principal stages: text preprocessing performs text structure analysis and tokenisation lexical and terminological processing performs morphological

analysis, multi-token matching against terminology lexicons, and small-scale parsing using terminology grammars

parsing and semantic interpretation splits text into sentences, tags tokens with parts-of-speech, performs partial phrasal parsing and compositional semantic interpretation into a predicate-argument “logical form”

discourse interpretation integrates each sentence's predicate-argument representation into a hierarchically structured semantic net encoding the system's domain model

A final stage generates template output as required.


PASTA System: Text Preprocessing

Text structure analysis Scientific articles typically have a rigid structure, including abstract,

introduction, method and materials, results, and discussion sections. Certain sections can be targeted for detailed analysis while others can

be skipped completely. Where articles are available in SGML with a DTD, an initial module is

used to identify particular markup, specified in a configuration file, for use by subsequent modules.

Where articles are in plain text, an initial `sectioniser' module is used to identify and classify significant sections using sets of regular expressions.

Tokenisation in addition to the normal white-space/punctuation delimited tokenisation

required for newswires, scientific papers require further sophistication: NaCl ,Tyr152


PASTA System: Lexical and Terminological processing

The main information sources used for terminology identification in the biochemical domain are: case-insensitive terminology lexicons (at present approximately 25,000

component terms in 52 categories -- see next slide) morphological cues, mainly standard biochemical suffixes hand-constructed grammar rules for each terminology class

For example, the enzyme name mannitol-1-phosphate5-dehydrogenase would be recognised1. by the classification of mannitol as a potential compound modifier and

phosphate as a compound -- both matched in the terminology lexicon

2. by morphological analysis suggesting dehydrogenase as a potential enzyme head, due to its suffix –ase

3. by domain-specific grammar rules combining the enzyme head with a known compound and modifier which can play the role of enzyme modifier


Biochemical Terminological Lists

protein names (trypsin, lipase, etc.) amino acids (Glycine, Phe, etc.) gene names species (human, E.coli, etc.) secondary structure (alpha helix, beta sheet,

etc.) supersecondary structure (coiled-coil alpha helix,

etc.) quaternary structure (dimer, hexamer, etc.)

regions (carboxy-terminal) and sites (glycosylation site, etc)

chains (butyl chain, catalytic chain, etc.) interactions (hydrogen bonds, contacts) bases (DNA, RNA) elements (N, Ca, NZ, etc.) non-protein entities (cofactors, substrates, etc.) measure terms (kcal, millimeter, joule, etc.)

Principal Term Classes

Principal Terminology Resouces

Protein Data Bank Enzyme classification SCOP classification

CATH classification IUPAC / IUBMB Nomenclature

Recommendations


PASTA System: Parsing and Semantic Interpretation

The syntactic processing modules treat any terms recognised in the previous stage as non-decomposable units, with a syntactic role of proper noun.

As a consequence: The sentence splitting module cannot propose sentence boundaries

within a preclassified term. The part-of-speech tagger only attempts to assign tags to tokens which

are not part of proposed terms. The phrasal parser treats terms as preparsed noun phrases.

Parsing is carried out with a general phrasal (feature-based unification) grammar of English.

The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the “best”, possibly partial, parse of each sentence.

This predicate logic-like representation is passed on as input to the discourse interpretation stage.


“This cleft contains the putative catalytic residue Glu132 above the core of the beta-barrel.”

.

PASTA System: Parsing and Semantic Interpretation (cont)

Semantic Analysis

contain(e1), cleft(e2), lsubj(e1,2),det(e2,this),residue(e3), lobj(e1,e3), name(e3,”Glu32”), adj(e3,putative),adj(e3,catalytic)core(e4),above(e1,e4)secondary_structure(e5),name(e5,”beta-barrel”),of(e4,e5)

the putative catalytic residue

DetN

This cleft

S

VP

V NP PP

contains

above

PNP PP

the coreof the beta-barrel

Syntactic Analysis


PASTA System: Discourse Interpretation

The semantic representation of each sentence is added to a predefined domain model made up of an ontology, or concept hierarchy, and inheritable attributes and inference rules associated with concept nodes in the

hierarchy The domain model is gradually populated with instances of concepts from

the text to become a discourse model A powerful coreference mechanism attempts to merge each newly

introduced instance with an existing one, subject to various syntactic and semantic constraints.

Inference rules of particular instance types may then fire to hypothesise the existence of instances required to fill a template (e.g. an organism with a source_of relation to an enzyme).

The coreference mechanism will then attempt to resolve the hypothesised instances with actual instances from the text – making up for deficiencies in parsing.


PASTA System: Discourse Interpretation (cont)

1. The three-dimensional structure of Endo H has been determined …

2. A shallow curved cleft runs across the surface of the molecule from …

3. This cleft contains the putative catalytic residues Asp130 and Glu132 …

From 1, Endo H is identified as a protein – protein(e1),name(e1,”Endo H”) – and added to the discourse model

From 2, the cleft is identified – cleft(e23) – and the molecule – molecule(e25) Ontology records that proteins are molecules and coreference resolves e25 and e1 Domain model/ontology records that clefts are regions and that regions are located in

proteins – a protein, say e42, is hypothesized and the relation located_in(e23,e42) In the absence of full semantic analysis of “runs across the surface of”, coreference

picks the closest protein and resolve e42 with e1/e25 – i.e. the cleft is assumed to be in Endo H.

From 3, the analysis is as before – the cleft is identified as, say e52, and the residue, e61 coreference resolves the cleft e52 with the preceding e23 The domain model allows reasoning from “contains” to establish the relation

located_in(e61,e23) – the residue is located in the cleft Transitiviy of located_in permits the conclusion: located_in(e61,e1) – Glu132 is in

EndoH


Outcomes II: Text Corpora

1500 BIDS abstracts from 24 molecular biology journals from 1994-98 ASCII text ~250 words each structured keyword fields in header

300 full journal paper from Molecular Biology and Structure from 1994-1998 from publishers' websites (HTML/ASCII)


Annotated Corpora

Annotated corpora are needed for system development and evaluation

For development, PASTA researchers at Sheffield manually prepared

terminology-tagged 52 journal article abstracts for the term classes: protein, species residue, site, region, secondary structure, super secondary structure, quaternary structure, chain, base, atom, non-protein, interactions (1376 term occurrences)

filled templates derived from 25 abstracts used for training

For final blind evaluation, independent domain experts prepared

62 terminology-tagged abstracts for the term classes

• 20 texts annotated by both annotators

• interannotator agreement is low (as assessed by MUC scorer)

filled templates from 30 abstracts

• 10 annotated by both annotators


Evaluation

To evaluate system’s performance is measured against manually annotated corpora using automatic scorer developed in the DARPA MUC evaluations

On development texts terminology evaluation results:

Recall: 88% Precision: 94% P & R: 91%

In final blind evaluation terminology evaluation results:

Recall: 82% Precision: 84% P & R: 83% Template filling evaluation results:

Recall: 68% Precision: 71% P & R: 69%


Outcomes III: Browser-based Interface

Raw templates or texts annotated with identifiers for protein and residue names are not of much use to the working biologist

Most effective delivery platform is a Web-browser Therefore we have designed and implemented a browser-based

interface to allow a user to browse the results has added benefit that links to source texts can easily be added – can

help to overcome IE system’s errors


Outcomes III: Browser-based Interface


Outcomes IV: Active PASTA – The PASTA Daemon

PASTA is being integrated into a web-linked system that automatically on a daily basis retrieves texts related to protein structure from Medline runs the text through PASTA to extract protein/residue/active site

information integrates the extracted information into previously extracted

tables/indices publishes the results via the PASTA browser-interface on a web-server

Result will be a web site accessible by molecular biologists with PASTA-extracted information plus links back to Medline for confirmation/refutation


E-Science: MyGrid

MyGrid is an EPRSC-funded E-Science project involving: University of Manchester (Computer Science) EBI – Hinxton University of Southampton (Computer Science) University of Newcastle (Computer Science) University of Nottingham (Computer Science) University of Sheffield (Computer Science)

Aim: To build a virtual workbench to support the E-Biologist in performing in silico experiments involving transparent access to distributed Structured data resources (e.g. Swissprot, PDB) Textual data resources (e.g. Medline, On-line journals) Algorithms (e.g. Blast) Processing resources


E-Science: MyGrid (cont)

Sheffield will provide text-mining technology (EMPathIE, PASTA)

Current activities: Integrating UMLS into terminology processing components Integrating the Gene Ontology into PASTA discourse model

(DAML+OIL) Acquiring Medline locally for terminology mining and indexing

experiments Making aspects of PASTA available as a Web Service (via SOAP)


Conclusions + Future Work

EMPathIE and PASTA demonstrate the challenges encountered and the benefits gained by applying IE techniques to new areas

terminology is particularly critical/difficult in this area

Evaluation scores are not as high as for MUC-7, but

tasks are harder

training resources are much more limited

Future work includes:

Improved techniques for handling terminological variants

improved techniques to produce IE system resources automatically or semi-automatically: terminology lists, grammars, domain models/ontologies

richer domain modelling


THE END

a research perspective on text mining: tasks, technologies and prototype applications robert...

Documents

euromap text mining

free text

scenario slide

unicode slide

information source

information extraction

university of sheffield

bulger trial slide