a research perspective on text mining: tasks, technologies and prototype applications robert...
TRANSCRIPT
A Research Perspective on Text Mining: Tasks, Technologies and Prototype Applications
Robert Gaizauskas
Natural Language Processing Group
Departments of Computer Science,
University of Sheffield
September 4, 2002 Euromap Text Mining Seminar
Outline of Talk
Text Mining: Scenario, Definitions and Brief History
Text Mining Tasks + Methodologies
Text Mining Technologies
Text Mining Prototype Applications
Conclusions and Future Directions/Challenges
September 4, 2002 Euromap Text Mining Seminar
Outline of Talk
Text Mining: Scenario, Definitions and Brief History
Text Mining Tasks + Methodologies
Text Mining Technologies
Text Mining Prototype Applications
Conclusions and Future Directions/Challenges
September 4, 2002 Euromap Text Mining Seminar
Text Mining: Scenario
September 4, 2002 Euromap Text Mining Seminar
Text Mining Scenario Components: Texts
Genres Newspapers Company reports Web pages Scientific papers Legal documents
E-Formats Word Documents (.doc, .rtf) PDF/Postscript HTML/SGML/XML
Languages English … French … Greek … Russian … Chinese … Hindi … Sanskrit
… Linear B Character encodings: ASCII, ISO 8859, Unicode
September 4, 2002 Euromap Text Mining Seminar
Text Mining Scenario Components: Users
User domain of interest Business – competitor intelligence, corporate intranet/memory Scientists – access to literature Military/police intelligence – open source intelligence, intranet Journalists – news archives
User level of expertise Novice/expert
User linguistic competence Adult/child Native/non-native language speaker Uni/multi-lingual
September 4, 2002 Euromap Text Mining Seminar
Text Mining Scenario Components: Information Access Needs
Ad hoc searching Specific questions: “What year did the Berlin Wall come down?” General background/context: “Tell me about Zakopane”
Stable intelligence gathering Scenario-related: “Build a database recording new projects in the
energy sector: the players, location, energy type, start date, capitilisation”
Entity-related: “Build a database of key scientists in the pharma industry: name, employer, position, start and end dates”
Current awareness Alerting: “Let me know when any papers are published on the
crystallographic structure of any lipase” Document selection: “Assemble articles on drug approvals”
Summarisation Single/multi-document: “Summarise the Bulger trial”
September 4, 2002 Euromap Text Mining Seminar
Text Mining Scenario Components: Tools
Information retrieval
September 4, 2002 Euromap Text Mining Seminar
What is Information Extraction?
The Information Extraction (IE) task: from each text in a set of natural language texts extract information about predefined classes of entities and relationships and place this information into a template or database record.
E.g. from financial newswire stories identify those dealing with management succession events and from these extract details of organisations and persons, the post being assumed or vacated, the reason for vacancy, etc.
IE may also be described as the activity of populating a structured information repository (database) from an unstructured, or free text, information source.
September 4, 2002 Euromap Text Mining Seminar
What is Information Extraction? (cont)
The resulting structured database is then used for some other purpose:
searching or analysis using conventional database queries;
data-mining;
generating a summary (perhaps in another language);
constructing indices into/within/between the source texts.
September 4, 2002 Euromap Text Mining Seminar
Example: A Wall Street Journal Article
<DOC>
<DOCID> wsj94_008.0212 </DOCID>
<DOCNO> 940413-0062. </DOCNO>
<HL> Who's News:
@ Burns Fry Ltd. </HL>
<DD> 04/13/94 </DD>
<SO> WALL STREET JOURNAL (J), PAGE B10 </SO>
<CO> MER </CO>
<IN> SECURITIES (SCR) </IN>
<TXT>
<p>
BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president of Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokeswoman said it hasn't named a successor to Mr. Wright, who is expected to begin his new position by the end of the month.
</p>
</TXT>
</DOC>
September 4, 2002 Euromap Text Mining Seminar
Example: A Management Succession Event Template
<TEMPLATE> := DOC_NR: "NUMBER" ^ CONTENT: <SUCCESSION_EVENT> *<SUCCESSION_EVENT> := ORGANIZATION: <ORGANIZATION> ^ POST: "POSITION TITLE" | "no title" ^ IN_AND_OUT: <IN_AND_OUT> + VACANCY_REASON: {DEPART_WORKFORCE, REASSIGNMENT, NEW_POST_CREATED, OTH_UNK} ^<IN_AND_OUT> := PERSON: <PERSON> ^ NEW_STATUS: {IN, IN_ACTING, OUT, OUT_ACTING} ^ ON_THE_JOB: {YES, NO, UNCLEAR} OTHER_ORG: <ORGANIZATION> - REL_OTHER_ORG: {SAME_ORG, RELATED_ORG, OUTSIDE_ORG} -<ORGANIZATION> := ORG_NAME: "NAME" - ORG_ALIAS: "ALIAS" * ORG_DESCRIPTOR: "DESCRIPTOR" - ORG_TYPE: {GOVERNMENT, COMPANY, OTHER} ^ ORG_LOCALE: LOCALE_STRING {{CITY, PROVINCE, COUNTRY, REGION, UNK} * ORG_COUNTRY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING *<PERSON> := PER_NAME: "NAME" - PER_ALIAS: "ALIAS" * PER_TITLE: "TITLE" *
September 4, 2002 Euromap Text Mining Seminar
<TEMPLATE-9404130062> := DOC_NR: "9404130062" CONTENT: <SUCCESSION_EVENT-1><SUCCESSION_EVENT-1> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "executive vice president" IN_AND_OUT: <IN_AND_OUT-1> <IN_AND_OUT-2> VACANCY_REASON: OTH_UNK<IN_AND_OUT-1> := <IN_AND_OUT-2> := IO_PERSON: <PERSON-1> IO_PERSON: <PERSON-2> NEW_STATUS: OUT NEW_STATUS: IN ON_THE_JOB: NO ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-2> REL_OTHER_ORG: OUTSIDE_ORG<ORGANIZATION-1> := <ORGANIZATION-2> := ORG_NAME: "Burns Fry Ltd.“ ORG_NAME: "Merrill Lynch Canada Inc." ORG_ALIAS: "Burns Fry“ ORG_ALIAS: "Merrill Lynch" ORG_DESCRIPTOR: "this brokerage firm“ ORG_DESCRIPTOR: "a unit of Merrill Lynch & Co." ORG_TYPE: COMPANY ORG_TYPE: COMPANY ORG_LOCALE: Toronto CITY ORG_COUNTRY: Canada<PERSON-1> := <PERSON-2> := PER_NAME: "Mark Kassirer" PER_NAME: "Donald Wright" PER_ALIAS: "Wright" PER_TITLE: "Mr."
Example: A (Partially) Filled Management Succession Event Template
September 4, 2002 Euromap Text Mining Seminar
Example: Uses for Templates
From the completely filled version of the preceding template a natural language summary can be generated:
BURNS FRY Ltd. named Donald Wright as executive vice president.
Donald Wright resigned as president of Merrill Lynch Canada Inc..
Mark Kassirer left as president of BURNS FRY Ltd.
Or, a table can be constructed:.
Company Post Person Direction
Burns Fry Executive VP Donald Wright In
Burns Fry President Mark Kassirer Out
Merrill Lynch Canada
President Donald Wright Out
September 4, 2002 Euromap Text Mining Seminar
Key Features of Information Extraction
Texts are unrestricted NL, but typically short
Template is predefined and fixed
Information extracted is `literal' or `factual‘
The precise definition of the task permits quantitative evaluation of IE systems' performance against human generated results
September 4, 2002 Euromap Text Mining Seminar
What IE is NOT: Information Retrieval
The Information Retrieval (IR) task: given a user query and a document collection retrieve that subset of documents from the collection which are relevant to the user's query.
E.g. given the query
exonuclease gamma-delta resolvase
return those abstracts in PubMed pertaining to these proteins
Once the IR system returns the documents, the user browses the selected documents in order to fulfil his or her information need.
Depending on the IR system, the user may be further assisted by
relevance ranking of retrieved documents
highlighting of search terms in the text to facilitate identifying passages of particular interest
September 4, 2002 Euromap Text Mining Seminar
Strengths and Weaknesses of IR
Strengths: Can search huge document collections very rapidly Insensitive to genre and domain of the texts Can rank documents with respect to likely relevance Searches can be iteratively refined
Weaknesses: Documents are returned not information/answers, so user must
further read texts to extract information Frequently not discriminating enough (“1563 documents match your
request”)
September 4, 2002 Euromap Text Mining Seminar
Strengths and Weaknesses of IE
Strengths:
Extracts facts from texts, not just texts from text collections
Can feed other powerful applications (databases, indexing engines)
Weaknesses:
Porting to new genres and domains is time-consuming and requires expert
Limited accuracy
Not fast enough to run over large text collections while user waits
September 4, 2002 Euromap Text Mining Seminar
A Brief History of IE
The first published work on information extraction (though it was not called this at the time) was in late 1960s
A significant precursor was the psychologist Roger Schank’s work on scripts and story understanding in the 1970’s
The 1980’s saw the emergence of some commercial systems targetted at financial transactions and newswires
The big impetus to current research started in the late 1980’s when DARPA initiated a series of competitive evaluations of “Message Understanding” systems (Message Understanding Conferences – MUC)
MUC ran for 10 years (1987-98) and significantly advanced the field
Currently there are a number of IE systems on the market and a large and on-going research effort in the field
September 4, 2002 Euromap Text Mining Seminar
Outline of Talk
Text Mining: A Definition and Brief History
Text Mining Tasks + Methodologies
Entity Extraction
Attribute Extraction
Relation Extraction
Event Extraction
Text Mining Technologies
Text Mining Prototype Applications
Conclusions and Future Directions/Challenges
September 4, 2002 Euromap Text Mining Seminar
IE Component Tasks
To fill templates IE researchers have discovered that systems must be able to perform a variety of simpler tasks
Studying and evaluating these component tasks in isolation has proved a useful way forward for IE
Component IE tasks which were specified as part of MUC:
Named Entity Recognition (persons, organisations,locations, dates)
Coreference (multiple references to same entity)
Template Elements (organisations, persons, artifacts, locations)
Template Relations (employee_of, product_of, location_of)
Scenario Template (management succession)
September 4, 2002 Euromap Text Mining Seminar
MUC Scoring and Scoring Metrics
Correct answers, called keys, are produced manually for all the MUC tasks.
Scoring of system results, called responses, against keys is done automatically.
At least some portion of the answer keys are multiply produced by different humans so that interannotator agreement figures can be computed.
Interannotator agreement figures of 95% are sought. Figures of less than 80% are interpreted as meaning the task is not sufficiently clearly defined.
Principal metrics are: Precision (how much of what your system returns is correct) Recall (how much of what is correct your system returns) F-measure (a weighted combination of precision and recall)
September 4, 2002 Euromap Text Mining Seminar
State-of-the-art Evaluation Results (MUC-7)
Task Recall Precision P & R
Named Entity 92 95 93.39
Coreference 56.1 68.8 61.8
Template
Element
86 87 86.76
Template
Relation
67 86 75.63
Scenario
Template
42 65 50.79
September 4, 2002 Euromap Text Mining Seminar
Outline of Talk
Text Mining: A Definition and Brief History
Text Mining Tasks + Methodologies
Entity Extraction
Attribute Extraction
Relation Extraction
Event Extraction
Text Mining Technologies
Text Mining Prototype Applications
Conclusions and Future Directions/Challenges
September 4, 2002 Euromap Text Mining Seminar
Outline of Talk
Text Mining: A Definition and Brief History
Text Mining Tasks + Methodologies
Entity Extraction
Attribute Extraction
Relation Extraction
Event Extraction
Text Mining Technologies
Text Mining Prototype Applications
Conclusions and Future Directions/Challenges
September 4, 2002 Euromap Text Mining Seminar
Applying IE to Biological Science Journal Papers
IE is an appropriate technology when: large volumes of text make human analysis infeasible template-oriented information seeking is appropriate (stable information
need, narrow domain) conventional IR is inadequate some error is tolerable
To date most IE applications are newswire-oriented, with the bulk being in the financial/competitor intelligence area
Bioinformatics applications provide an interesting challenge to IE different text types -- journal papers (SGML/PDF), abstracts (BIDS,
MEDLINE) different genre -- scientific writing different domain -- biochemistry/molecular biology
September 4, 2002 Euromap Text Mining Seminar
EMPathIE: Enzyme and Metabolic Pathways Information Extraction
Aim: Use IE techniques to create a database of enzyme and metabolic pathway data from academic journal papers to support drug discovery
Partners: Depts of Computer Science and Information Studies, U. of Sheffield; Glaxo-Wellcome Research; Elsevier Science
Sponsors: Glaxo-Wellcome Research; Elsevier Science
PostDoc: Dr. Kevin Humphreys
Status: Complete. Project ran 11/97 -- 11/99
September 4, 2002 Euromap Text Mining Seminar
EMPathIE: Scenario
metabolic processes involve biochemical reactions in which enzymes play key catalytic roles
each reaction involves an enzyme, some number of inputs and results in some number of products
sequences of such reactions form metabolic pathways
identifying pathways can suggest potential sites for the application of drugs to affect a particular end result
reactions are typically reported one/journal paper -- identifying pathways frequently requires combining information from several papers
September 4, 2002 Euromap Text Mining Seminar
EMPathIE: Text Sources
Project focused on 13 journal papers from FEMS Letters (Federation of European Microbiological
Societies), and Biochimica et Biophysica Acta
from 1992-1995
Papers supplied by Elsevier Science and marked up according to
their proprietary SGML DTD
mark up reliable for bibliographical and text structure information
typographical markup (e.g. italics for gene names) inconsistent
and hence ignored
September 4, 2002 Euromap Text Mining Seminar
Sample EMPathIE Article
Federation of European Microbiological Societies
Isocitrate lyase activity in halophilic archaea
A. Oren and P. Gurevich, The Hebrew University of Jerusalem
Abstract:
Eight species of halophilic Archaea were tested for the presence of isocitrate lyase activity. High activities (up to 100 nmol –1 mg protein -1) were detected in Haloferax mediterranei and Haloferax volcanii when grown in medium containing acetate as the principal carbon source. Little activity was found in representatives of the genera Halobacterium and Haloarcula. Isocitrate lyase from Haloferax mediterranei required high potassium chloride concentrations, optimal activity being found at 1.5-3 M potassium chloride and pH 7.0. Replacement of potassium chloride by sodium chloride resulted in much lower activities. Sulfhydryl compounds (cysteine, glutathione) were not stimulatory. In other properties (stimulation by magnesium ions, sensitivity to different inhibitors) the enzyme resembled isocitrate lyases from representatives of the Bacteria and Eucarya.
Full Text:
…
September 4, 2002 Euromap Text Mining Seminar
EMPathIE Template Specification
<ENZYME> := <PATHWAY> :=
NAME: "NAME" + NAME: "NAME" +
CODE: “EC_CODE" * INTERACTION: <INTERACTION> +
WEIGHT: "WEIGHT" -
SUBUNITS: "SUBUNITS" * <INTERACTION> :=
ENZYME: <ENZYME> ^
<ORGANISM> := SOURCE: <SOURCE> -
NAME: "NAME" + PARTICIPANT: <PARTICIPANT> *
STRAIN: "STRAIN" * NON_PARTICIPANT:<NON_PARTICIPANT> *
GENUS: "GENUS" -
<PARTICIPANT> :=
<COMPOUND> := COMPOUND: <COMPOUND> ^
NAME: "NAME" + TYPE: {SUBSTRATE,PRODUCT,
SUPPLIER: "SUPPLIER" * ACTIVATOR, COFACTOR,
INHIBITOR,BUFFER} ^
<SOURCE> := CONCENTRATION: "CONCENTRATION" -
ENZYME: <ENZYME> ^ TEMPERATURE: "TEMPERATURE"
ORGANISM: <ORGANISM> ^ ACIDITY: "ACIDITY" -
September 4, 2002 Euromap Text Mining Seminar
Filled EMPathIE Template
ENZYME-1 PATHWAY-1
Name: isocitrate lyase Name: glyoxylate cycle
E.C. Code: 4.1.3.1 Interaction: INTERACTION-1
ORGANISM-1 INTERACTION-1
Name: Haloferax volcanii Enzyme: ENZYME-1
Strain: ATCC 29605 Participants: PARTICIPANT-1
Genus: halophilic Archaea PARTICIPANT-2
COMPOUND-1 PARTICIPANT-1
Name: glyoxylate phenylhydrazone Compound: COMPOUND-1
Type: Product
COMPOUND-2 Temperature: 35C
Name: KCl
PARTICIPANT-2
SOURCE-1 Compound: COMPOUND-2
Enzyme: ENZYME-1 Type: Activator
Organism: ORGANISM-1 Concentration: 1.75 M
September 4, 2002 Euromap Text Mining Seminar
PASTA: Protein Active Site Template Acquisition
Aim: Use IE techniques to create a database of protein active site data from academic journal papers and abstracts to support protein structure analysis
Partners: Depts of Computer Science, Information Studies, Molecular Biology and Biotechnology, U. of Sheffield
Sponsors: BBSRC-EPSRC BioInformatics Programme
PostDoc: Dr. George Demetriou
Status: Complete. Project ran 03/98 -- 03/01
September 4, 2002 Euromap Text Mining Seminar
PASTA: Scenario
Extract information concerning the roles of amino acids in protein molecules and create a database of protein active sites from both scientific journal abstracts and full articles
New protein structures are being reported at very high rates in the literature
September 4, 2002 Euromap Text Mining Seminar
PASTA: Scenario (cont)
Full evaluation of the results of protein structure comparisons often requires the investigation of extensive literature references E.g. to determine whether an amino acid has been reported as
present in a particular region of a protein Computational methods that can extract information directly from
these articles would be very useful to biologists in comparison classification work and to those engaged in modelling studies
September 4, 2002 Euromap Text Mining Seminar
Sample PASTA Article (BIDS Abstract)
TI: The crystal structure of a triacylglycerol lipase from Pseudomonas cepacia reveals a highly open conformation in the absence of a bound inhibitor
AU: Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW
NA: SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA SEOUL NATL UNIV,COLL NAT SCI,DEPT CHEM,SEOUL 151742,SOUTH KOREA
JN: STRUCTURE, 1997, Vol.5, No.2, pp.173-185
IS: 0969-2126
AB: Background: …
Results: We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The structure shows the lipase to contain an alpha/beta-hydrolase fold and a catalytic triad comprising of residues Ser87, His286 and Asp264. The enzyme shares several structural features with homologous lipases from Pseudomonas glumae (PgL) and Chromobacterium viscosum (CvL), including a calcium-binding site. The present structure of Pet reveals a highly open conformation with a solvent-accessible active site. This is in contrast to the structures of PgL and Pet in which the active site is buried under a closed or partially opened 'lid', respectively.
Conclusions: …
September 4, 2002 Euromap Text Mining Seminar
(Partially) Filled PASTA Template
<TEMPLATE-str-1997-5-2-1>:=
DOC_JR: "STRUCTURE, 1997, Vol.5, No.2, pp.173-185"
DOC_AUTH: "Kim_KK, Song_HK, Shin_DH, Hwang_KY, Suh_SW"
DOC_IS: "0969-2126“
<RESIDUE-str-1997-5-2-1>:=
RES_TYPE: SERINE
RES_NO: "87"
SITE/FUNCTION: "catalytic","hydrolytic activity", "interfacial activation", "stereoselectivity",
"calcium-binding site", ”active-site"
SEC_STRUCT: A-HELIX
QUATERN_STRUCT: -
REGION: 'lid'
INTERACTION: -
<PROTEIN>:= <SPECIES-str-1997-5-2-1>:=
PRO_NAME: "Triacylglycerol lipase“ SPE_NAME: "Pseudomonas cepacia"
PRO_SCOP_FAM: "Lipase“ SPE_NAME_TYPE: SCIENTIFIC
PDB_CODE: 1LGY
<IN_PROTEIN -str-1997-5-2-1 >: = <IN-SPECIES-str-1997-5-2-1 >:=
RESIDUE: <RESIDUE-str-1997-5-2-1 PROTEIN: <PROTEIN-str-1997-5-2-1
PROTEIN: <PROTEIN-str-1997-5-2-1> SPECIES: <SPECIES-str-1997-5-2-1>
September 4, 2002 Euromap Text Mining Seminar
Outcomes I: The PASTA System
System processes texts in four principal stages: text preprocessing performs text structure analysis and tokenisation lexical and terminological processing performs morphological
analysis, multi-token matching against terminology lexicons, and small-scale parsing using terminology grammars
parsing and semantic interpretation splits text into sentences, tags tokens with parts-of-speech, performs partial phrasal parsing and compositional semantic interpretation into a predicate-argument “logical form”
discourse interpretation integrates each sentence's predicate-argument representation into a hierarchically structured semantic net encoding the system's domain model
A final stage generates template output as required.
September 4, 2002 Euromap Text Mining Seminar
PASTA System: Text Preprocessing
Text structure analysis Scientific articles typically have a rigid structure, including abstract,
introduction, method and materials, results, and discussion sections. Certain sections can be targeted for detailed analysis while others can
be skipped completely. Where articles are available in SGML with a DTD, an initial module is
used to identify particular markup, specified in a configuration file, for use by subsequent modules.
Where articles are in plain text, an initial `sectioniser' module is used to identify and classify significant sections using sets of regular expressions.
Tokenisation in addition to the normal white-space/punctuation delimited tokenisation
required for newswires, scientific papers require further sophistication: NaCl ,Tyr152
September 4, 2002 Euromap Text Mining Seminar
PASTA System: Lexical and Terminological processing
The main information sources used for terminology identification in the biochemical domain are: case-insensitive terminology lexicons (at present approximately 25,000
component terms in 52 categories -- see next slide) morphological cues, mainly standard biochemical suffixes hand-constructed grammar rules for each terminology class
For example, the enzyme name mannitol-1-phosphate5-dehydrogenase would be recognised1. by the classification of mannitol as a potential compound modifier and
phosphate as a compound -- both matched in the terminology lexicon
2. by morphological analysis suggesting dehydrogenase as a potential enzyme head, due to its suffix –ase
3. by domain-specific grammar rules combining the enzyme head with a known compound and modifier which can play the role of enzyme modifier
September 4, 2002 Euromap Text Mining Seminar
Biochemical Terminological Lists
protein names (trypsin, lipase, etc.) amino acids (Glycine, Phe, etc.) gene names species (human, E.coli, etc.) secondary structure (alpha helix, beta sheet,
etc.) supersecondary structure (coiled-coil alpha helix,
etc.) quaternary structure (dimer, hexamer, etc.)
regions (carboxy-terminal) and sites (glycosylation site, etc)
chains (butyl chain, catalytic chain, etc.) interactions (hydrogen bonds, contacts) bases (DNA, RNA) elements (N, Ca, NZ, etc.) non-protein entities (cofactors, substrates, etc.) measure terms (kcal, millimeter, joule, etc.)
Principal Term Classes
Principal Terminology Resouces
Protein Data Bank Enzyme classification SCOP classification
CATH classification IUPAC / IUBMB Nomenclature
Recommendations
September 4, 2002 Euromap Text Mining Seminar
PASTA System: Parsing and Semantic Interpretation
The syntactic processing modules treat any terms recognised in the previous stage as non-decomposable units, with a syntactic role of proper noun.
As a consequence: The sentence splitting module cannot propose sentence boundaries
within a preclassified term. The part-of-speech tagger only attempts to assign tags to tokens which
are not part of proposed terms. The phrasal parser treats terms as preparsed noun phrases.
Parsing is carried out with a general phrasal (feature-based unification) grammar of English.
The phrasal grammar includes compositional semantic rules, which are used to construct a semantic representation of the “best”, possibly partial, parse of each sentence.
This predicate logic-like representation is passed on as input to the discourse interpretation stage.
September 4, 2002 Euromap Text Mining Seminar
“This cleft contains the putative catalytic residue Glu132 above the core of the beta-barrel.”
.
PASTA System: Parsing and Semantic Interpretation (cont)
Semantic Analysis
contain(e1), cleft(e2), lsubj(e1,2),det(e2,this),residue(e3), lobj(e1,e3), name(e3,”Glu32”), adj(e3,putative),adj(e3,catalytic)core(e4),above(e1,e4)secondary_structure(e5),name(e5,”beta-barrel”),of(e4,e5)
the putative catalytic residue
DetN
This cleft
S
VP
V NP PP
contains
above
PNP PP
the coreof the beta-barrel
Syntactic Analysis
September 4, 2002 Euromap Text Mining Seminar
PASTA System: Discourse Interpretation
The semantic representation of each sentence is added to a predefined domain model made up of an ontology, or concept hierarchy, and inheritable attributes and inference rules associated with concept nodes in the
hierarchy The domain model is gradually populated with instances of concepts from
the text to become a discourse model A powerful coreference mechanism attempts to merge each newly
introduced instance with an existing one, subject to various syntactic and semantic constraints.
Inference rules of particular instance types may then fire to hypothesise the existence of instances required to fill a template (e.g. an organism with a source_of relation to an enzyme).
The coreference mechanism will then attempt to resolve the hypothesised instances with actual instances from the text – making up for deficiencies in parsing.
September 4, 2002 Euromap Text Mining Seminar
PASTA System: Discourse Interpretation (cont)
1. The three-dimensional structure of Endo H has been determined …
2. A shallow curved cleft runs across the surface of the molecule from …
3. This cleft contains the putative catalytic residues Asp130 and Glu132 …
From 1, Endo H is identified as a protein – protein(e1),name(e1,”Endo H”) – and added to the discourse model
From 2, the cleft is identified – cleft(e23) – and the molecule – molecule(e25) Ontology records that proteins are molecules and coreference resolves e25 and e1 Domain model/ontology records that clefts are regions and that regions are located in
proteins – a protein, say e42, is hypothesized and the relation located_in(e23,e42) In the absence of full semantic analysis of “runs across the surface of”, coreference
picks the closest protein and resolve e42 with e1/e25 – i.e. the cleft is assumed to be in Endo H.
From 3, the analysis is as before – the cleft is identified as, say e52, and the residue, e61 coreference resolves the cleft e52 with the preceding e23 The domain model allows reasoning from “contains” to establish the relation
located_in(e61,e23) – the residue is located in the cleft Transitiviy of located_in permits the conclusion: located_in(e61,e1) – Glu132 is in
EndoH
September 4, 2002 Euromap Text Mining Seminar
Outcomes II: Text Corpora
1500 BIDS abstracts from 24 molecular biology journals from 1994-98 ASCII text ~250 words each structured keyword fields in header
300 full journal paper from Molecular Biology and Structure from 1994-1998 from publishers' websites (HTML/ASCII)
September 4, 2002 Euromap Text Mining Seminar
Annotated Corpora
Annotated corpora are needed for system development and evaluation
For development, PASTA researchers at Sheffield manually prepared
terminology-tagged 52 journal article abstracts for the term classes: protein, species residue, site, region, secondary structure, super secondary structure, quaternary structure, chain, base, atom, non-protein, interactions (1376 term occurrences)
filled templates derived from 25 abstracts used for training
For final blind evaluation, independent domain experts prepared
62 terminology-tagged abstracts for the term classes
• 20 texts annotated by both annotators
• interannotator agreement is low (as assessed by MUC scorer)
filled templates from 30 abstracts
• 10 annotated by both annotators
September 4, 2002 Euromap Text Mining Seminar
Evaluation
To evaluate system’s performance is measured against manually annotated corpora using automatic scorer developed in the DARPA MUC evaluations
On development texts terminology evaluation results:
Recall: 88% Precision: 94% P & R: 91%
In final blind evaluation terminology evaluation results:
Recall: 82% Precision: 84% P & R: 83% Template filling evaluation results:
Recall: 68% Precision: 71% P & R: 69%
September 4, 2002 Euromap Text Mining Seminar
Outcomes III: Browser-based Interface
Raw templates or texts annotated with identifiers for protein and residue names are not of much use to the working biologist
Most effective delivery platform is a Web-browser Therefore we have designed and implemented a browser-based
interface to allow a user to browse the results has added benefit that links to source texts can easily be added – can
help to overcome IE system’s errors
September 4, 2002 Euromap Text Mining Seminar
Outcomes III: Browser-based Interface
September 4, 2002 Euromap Text Mining Seminar
Outcomes III: Browser-based Interface
September 4, 2002 Euromap Text Mining Seminar
Outcomes IV: Active PASTA – The PASTA Daemon
PASTA is being integrated into a web-linked system that automatically on a daily basis retrieves texts related to protein structure from Medline runs the text through PASTA to extract protein/residue/active site
information integrates the extracted information into previously extracted
tables/indices publishes the results via the PASTA browser-interface on a web-server
Result will be a web site accessible by molecular biologists with PASTA-extracted information plus links back to Medline for confirmation/refutation
September 4, 2002 Euromap Text Mining Seminar
E-Science: MyGrid
MyGrid is an EPRSC-funded E-Science project involving: University of Manchester (Computer Science) EBI – Hinxton University of Southampton (Computer Science) University of Newcastle (Computer Science) University of Nottingham (Computer Science) University of Sheffield (Computer Science)
Aim: To build a virtual workbench to support the E-Biologist in performing in silico experiments involving transparent access to distributed Structured data resources (e.g. Swissprot, PDB) Textual data resources (e.g. Medline, On-line journals) Algorithms (e.g. Blast) Processing resources
September 4, 2002 Euromap Text Mining Seminar
E-Science: MyGrid (cont)
Sheffield will provide text-mining technology (EMPathIE, PASTA)
Current activities: Integrating UMLS into terminology processing components Integrating the Gene Ontology into PASTA discourse model
(DAML+OIL) Acquiring Medline locally for terminology mining and indexing
experiments Making aspects of PASTA available as a Web Service (via SOAP)
September 4, 2002 Euromap Text Mining Seminar
Conclusions + Future Work
EMPathIE and PASTA demonstrate the challenges encountered and the benefits gained by applying IE techniques to new areas
terminology is particularly critical/difficult in this area
Evaluation scores are not as high as for MUC-7, but
tasks are harder
training resources are much more limited
Future work includes:
Improved techniques for handling terminological variants
improved techniques to produce IE system resources automatically or semi-automatically: terminology lists, grammars, domain models/ontologies
richer domain modelling
September 4, 2002 Euromap Text Mining Seminar
THE END