using ontologies for text processing
DESCRIPTION
Using ontologies for text processing. Overview. Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem Describe the lexical ambiguity problem and its central importance in natural language processing - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/1.jpg)
Lawrence Hunter & K. Bretonnel Cohen Center for Computational PharmacologyUCHSC School of Medicine
http://[email protected]
Using ontologies for text processing
![Page 2: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/2.jpg)
Overview
Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problemDescribe the lexical ambiguity problem and its central importance in natural language processingDemonstrate how GO, combined with Direct Memory Access Parsing, provides a simple solution to some instances of this problemArgue no alternative is likely to work as well
![Page 3: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/3.jpg)
Lexical Ambiguity
A word (character string) means different things in different contexts – How can a program disambiguate (tell which is meant)?
Widespread problem even in “simple” bioNLP– DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001]– Gene symbol vs. non-gene acronym [Pustejovsky et al.
2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and Hearst 2003]
– Gene/product vs. any other noun [Tanabe and Wilbur, 2002]
![Page 4: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/4.jpg)
A particular example
“Hunk” can be a– Cell type: human natural killer– Gene: hormonally upregulated Neu-associated kinase– Medical abbreviation: radiographic/orthopedic joint
classification system– Non-technical English: a large lump, piece, or portion
All occur in Medline documents….(e.g. “hunk of metal” in article on ambulance design)
![Page 5: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/5.jpg)
How do ontologies help?
The idea that knowledge is relevant to understanding words in context is controversial only among linguists, but…
Direct Memory Access Parsing (DMAP) [Martin, 1991] [Fitzgerald, 2000] technique demonstrates the power of knowledge-based method for disambiguation
GO & similar efforts make DMAP (or other knowledge-based methods) practical today
![Page 6: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/6.jpg)
What is DMAP?
Conceptual parser– Maps from text to conceptual representations organized in
packaging and abstraction hierarchies (like GO)– In contrast to: pure syntactic parsers, pattern matching and
machine learning systems
Conceptual representations include lexical patterns that specify how to recognize the concept in text– Patterns consist of text literals and/or references to other concepts– Organized around concepts, not words; no independent lexicon.
Recognition creates expectations for related concepts
![Page 7: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/7.jpg)
A real example
ID: cell-type-HUNKIS-A: cell-typelex: human natural killer
HUNK
RESULTS
ID: gene-26559IS-A: genelex:
hormonally upregulated Neu-associated kinase
HUNK
hormonally upregulated neu tumor-associated kinase
ID: GO-0006350lex: transcription expression
ID: gene-expressionslots: expressed-item: gene mechanism: expressionlex: (gene) (expression)
“…Hunk expression is restricted to subsets of cells…” [Gardner et al. 2000]
![Page 8: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/8.jpg)
(parse ‘(Hunk))e-gene-26559 begin: 1 end: 1e-cell-type-HUNK begin: 1 end: 1
(parse ‘(Hunk expression))c-gene-expression-1 begin: 1 end: 2 expressed-item: e-gene-26559 begin: 1 end: 1 mechanism: GO:0006350 begin: 2 end: 2
DMAP output with and without context
Hunk alone: ambiguous
Hunk expression:not ambiguous
![Page 9: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/9.jpg)
DMAP can handle much more complex constructions
“Hunk is expressed in mouse epithelial cells during cell proliferation.”
c-localized-gene-expression
expressed-item: e-gene-26559
mechanism: GO:0006350
where: c-epithelial-cell
taxon: ncbi_10090
when: GO:0008283
But uses our enriched knowledge-base, not just GO
![Page 10: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/10.jpg)
Even just DMAP/GO is a big win
Recall 7,042 ambiguous symbols for 9,723 genes
Straightforward to disambiguate symbols that map to 2 or more genes when:– Each ambiguous gene referent has GO annotations, and – There is no overlap between the annotations for the genes
3,333 of the symbols (for 4715 of the genes) have this feature – nearly half the problem is solved!
![Page 11: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/11.jpg)
Compare the alternatives
Statistical or machine learning approaches– Must avoid being fooled by word “cells” in example– Scalability: need statistics for many covariates of every
ambiguous word; doesn’t exploit the abstraction hierarchy
Full syntactic parse doesn’t disambiguate at all!
Cascaded FST’s, pattern-matching, etc.– Where is source of knowledge for these?– Much DMAP lexical information can be taken directly from
GO (and LocusLink, etc.)
![Page 12: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/12.jpg)
Acknowledgments
Philip V. Ogren
Daniel J. McGoldrick
Christoffer S. Crosby
Jens Eberlein
George K. Acquaah-Mensah
I/NET’s (http://inetmi.com) CM / CMP software
Support from Wyeth Genetics Institute, NIAAA
http://compbio.uchsc.edu
![Page 13: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/13.jpg)
Biognosticopoea representation of the hunk gene
![Page 14: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/14.jpg)
Attachment ambiguity
Attachment ambiguity– These findings suggest that FAK functions in the
regulation of cell migration and cell proliferation. (Gilmore and Romer 1996:1209)
– What does FAK do?• ALMOST RIGHT:• FAK functions in the regulation of cell migration• FAK functions in cell proliferation• RIGHT:• FAK functions in the regulation of cell migration• FAK functions in the regulation of cell proliferation
![Page 15: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/15.jpg)
Attachment ambiguity
GO-0016477 isA go-process lex: cell migrationGO-0008283 isA go-process lex: cell proliferationGO-0042127 isA go-process lex: regulation of cell proliferation regulation of ((go-process) and)* cell proliferationGO-0030334 lex: regulation of cell migration regulation of ((go-process) and)* cell migration
![Page 16: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/16.jpg)
Attachment ambiguity
(parse ‘(These findings suggest that FAK functions in the regulation of cell migration and cell proliferation))
GO:30334
begin: 9 end: 12
GO:0042127
begin: 9 end: 15
![Page 17: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/17.jpg)
What do we have so far?
Gene Ontology
UMLS
MeSH
…
![Page 18: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/18.jpg)
What more do we need?
FamilyLocation– Macroanatomical– Subcellular localization
StructureFunction– Disease associations– Protein/protein interactions– …..
![Page 19: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/19.jpg)
Where can we get it?
GO definitions
UMLS definitions
MeSH notes
Biomedical literature
![Page 20: Using ontologies for text processing](https://reader033.vdocuments.site/reader033/viewer/2022051216/56814e5e550346895dbbfb7b/html5/thumbnails/20.jpg)
If you don’t like DMAP….
full syntactic parse first
cascaded FST’s
“a little syntax, a little semantics”
machine learning
pattern-matching
All can benefit from ontology/KB