obie – ontology-based information extraction · • text mining – information extraction (ie)...
TRANSCRIPT
1
OBIE – Ontology-based Information Extraction
An Approach to Extract and Deal with
Imprecise Temporal Data and Spelling Errors
PhD Proposal
HEGLER TISSOT
Advisor: Marcos Didonet Del Fabro
Universidade Federal do Paraná
Curitiba – Brazil
Fev / 2014
1
2
• Introduction– Context– Motivating example– Problem– Objetives
• State of the art– Information Extraction– Ontologies– OBIE Systems– Temporal Information
• Proposed work– Spelling errors– Temporal Information
Outline2
3
• Information Management – Large volumes of data are available
• 80% = text (on the Internet or within companies)(Aranha, 2007)
– Unstructured data formats• New system modeling and building techniques
– What is the challenge?• ???
Context
4
• Information Management
– Text Mining• From “Data Mining”• a technology which the purpose of extracting non-
trivial and interesting knowledge from large collections of unstructured documents
• Classification / Clustering• Indexing for Search
• Information Extraction
Context
5
• Text Mining – Classification / clustering
• Machine learning algorithms
Based on medical records
textual content, how to
identify a possible Group of
Disease?
Context
6
• Text Mining – Classification / clustering
• Machine learning algorithms
Based on medical records
textual content, how to
identify a possible Group of
Disease?
The most discriminant
words do not necessarily
represent the most
suitable concepts.
(“not” >> E10 x E11)
Context
7
• Text Mining – Information Extraction (IE)
• In IE, relevant information from natural language (NL) texts is identified, collected and normalized.
– NLP» Natural Language Processing» Exhaustive deep NL analysis of all aspects of a text
– OBIE (Ontology-based Information Extraction)
Context
8
Context
Unstructured Textual Content
Medical Record Sample:Blood pressure is lower. No vision complaints. Sub optimal sugar, control with retinopathy and neuropathy, high glucometer readings. Will work harder on diet. Will increase insulin by 2 units.
Information Extractionvision OK
high
lower
blood pressure
glucometer
sugar
retinopathy neuropathy
IE
Ontology-basedInformation Extraction
OBIE
8
9
Where to apply?1. Information Extraction
• Medical records– Statistical view from unstrucured data
• Internet data/news/...– Knowing more from competitors
• Social Networks– How to sort and identify specific profiles?
» (e.g. drug dealers)
• Documents– How many (Word,PDF,...) documents do you have in your
computer/server? What do they say?
Context
10
2. Answering natural language queries
Context
Quais os pacientes que apresentaram os
sintomas X, Y ou Z em casos de doenças A, B
ou C nos dois últimos anos?
Resultado:Paciente 1Paciente 2Paciente 3...
“match”
11
3. Semantic + Analytical Data
Context
Quais os melhores clientes no último
semestre?
12
Medical Record Example (in Portuguese)
Motivating Example12
13
Medical Record Example (in Portuguese)
Motivating Example13
14
Temporal Information (Precise + Imprecise)
Motivating Example14
15
Spelling Errors
Motivating Example15
16
– Extract temporal information from text– Organize events in a timeline– Uncertainty → imprecise temporal data
• “a few weeks ago”
• “the coming months”• “around 10:00 am”• “in the beginning of next month”
– Spelling errors• “in the last tree days”
Problem16
17
OBIE approach to extract and deal with:
– Uncertain Temporal Information
– Spelling Errors
Objective17
18
(Bird et al., 2009)
Information Extraction18
19
(Nedellec and Nazarenko, 2006)
Ontology-based Information Extraction (OBIE)19
Ontologies
20
– Formal specification of concepts
– Knowledge Domain
Classes
+ Instances
+ Properties
+ Relations
+ Axioms
= Formal Conceptualization
(Gruber, 1993) Ontology Web Language (OWL)
Ontology20
21
– process unstructured text (natural language)– guided by ontologies– present output using ontologies– specific knowledge domain extraction
(Wimalasuriya and Dejing, 2010)
– Desired features•String similarity
•Inexact matching
•Large repositories
•Multiple ontologies
•Temporal information
OBIE Systems21
22
OBIE General Framework22
23
OBIE General Framework23
24
OBIE General Framework24
25
OBIE General Framework25
26
2. State of the Art26
27
– Organize events in a timeline– Establish chronological order– Answer temporal questions (Wong et al., 2005)
– “Which were the most prescribed drugs in the last weeks?”
– “Who did use aspirin before having <symptom>?”
– “When did <event-description> happen?”
– Challenges (Temporal Information Extraction)• Linguistics: different expressions• Reference resolution: “tomorrow”
• Negation: “not before”, “it didn´t happen last year”
Temporal Information Extraction27
28
Tokens that represent a temporal entity (point in time, duration, frequency) (Sanampudi and Kumari, 2010)
• Explicit– January 2013
• Implicit– Christmas 2012
• Relative (indexed)– Yesterday, next month, three days ago (Alonso et al., 2007)
• Vague– Several weeks, in the next days (Schilder and Habel,
2003)
Temporal Expressions28
29
– Formal representation of temporal concepts• OWL-time (Hobbs and Pan, 2004)
• TL-OWL (Kim et al., 2008)
• other Temporal approaches and OWL Extensions
– Challenges (Imprecise Temporal Information)– Extraction
– Representation
– Logics and algebra
Temporal Ontologies29
30
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
30
31
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
before( A.begin , B.begin )? → probably true/false
31
32
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
before( A.begin , B.begin )? → probably true/false
before( B.begin , A.end )? → probably true/false
32
33
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
before( A.begin , B.begin )? → probably true/false
before( B.begin , A.end )? → probably true/false
before( A.end , B.end )? → probably true/false
33
34
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
before( A.begin , B.begin )? → probably true/false
before( B.begin , A.end )? → probably true/false
before( A.end , B.end )? → probably true/false
before( A.begin , B.end )? → TRUE
34
35
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
C
before( A.begin , B.begin )? → probably true/false
<statement>: after( B.begin , C.end )
35
36
Dealing with Imprecise Temporal Information
Logics and algebra
begin
begin
end
end
A
B
C
before( A.begin , B.begin )? → probabily true/false
<statement>: after( B.begin , C.end )
before( A.begin , B.begin )? → TRUE
36
37
– Inaccurate temporal expression (+)– Define temporal concepts (+)– Perform arithmetic or logic operations (-)
Temporal Information Approaches37
38
– Spelling ErrorsWordNet Extensions to deal with phonetic similarity
– Imprecise Temporal Information Extraction
Proposed Work38
39
* ED (Levenshtein, 1966), TS (Oliver, 1993), HD (Hamming, 1950), LCS (Allison and Dix, 1986), SWD (Smith and Waterman, 1981), MED (Monge and Elkan, 1996), JWD (Winkler and Thibaudeau, 1991), Soudex (Knuth, 1968), FastSS (Bocek et al., 2007)
Spelling Errors x Similarity39
40
– Multi language / Multi dictionary
– Derivative Words• Verb conjugation in Portuguese
– 13 tenses; 67 variations;» Unlike English (7 variations for ‘to be’)
{am, is, are, was, were, being, been}
– Fast Phonetic Similarity Search
WordNet Extensions40
41
Stringsim function
Fast Phonetic Similarity Search41
42
Stringsim function
Fast Phonetic Similarity Search42
43
PhoneticMapPT
Fast Phonetic Similarity Search43
44
PhoneticMapSimPT function
Fast Phonetic Similarity Search44
45
– Similarity Search Methods• Full• Fast
– PhoneticSearchPT function (fast search method)
Fast Phonetic Similarity Search45
46
– Precise x Imprecise Temporal Information• “08:15 am” x “earlyer in the morning”
» Which one happened before?
– Experiment• 4,748 medical records (MR)• 3,583 imprecise expressions (in 2,018 MR – 42,5%)
Uncertain Temporal Information Extraction46
47
• Temporal Information Mapping• Extracting Process
• Answering User Queries
• Case Study
Proposed Activities47
48
• Temporal Information Mapping• Extracting Process
• Answering User Queries
• Case Study
Proposed Activities
A1 A3 A2
A.1. Temporal Ontology
A.2. Temporal Expressions
A.3. Numeral Ontology
48
49
• Temporal Information Mapping• Extracting Process
• Answering User Queries
• Case Study
Proposed Activities
B.1. Temporal Representation
B.2. Annotation Schemes
B.3. Phonetic Similarity
B.4. OWL Extension
B.5. Extraction Rules
B1
B2
B3
B4B5
49
50
• Temporal Information Mapping• Extracting Process
• Answering User Queries
• Case Study
Proposed Activities
C.1. Temporal Algebra
C.2. Analytical Queries C1
C2
50
51
• Temporal Information Mapping• Extracting Process
• Answering User Queries
• Case Study
Proposed Activities
D.1. Ontology Generator
D.2. Domain Ontology
D.3. Information Extraction
D.4. Accuracy EvaluationD1
D2
51
52
Proposed Activities
A.1
A.2
A.3
B.1
B.2
B.3
B.4
B.5
C.1
C.2
D.1
D.2
D.3
D.4
P.x
T
Temporal Ontology
Temporal Expressions
Numeral Ontology
Temporal Representation
Annotation Schemes
Phonetic Similarity
Temporal OWL Extension
Extraction Rules
Temporal Algebra
Analytical Queries
Ontology Generator
Domain Ontology
Information Extraction
Accuracy Evaluation
Articles
Thesis
Create an ontology to define imprecise temporal concepts
List possible temporal expressions in Portuguese and English
Search for a numeral ontology that maps numeric values in the form of words
Define a representation for temporal expressions
Review and adapt temporal annotation schemes to support uncertain temporal data
Apply the Phonetic Fast Search method in the annotation and extraction processes
Propose an extension to the OWL metamodel to support temporal-dependent elements
Define a set of extraction rules needed to handle uncertain temporal data
Review the literature concerning Temporal Algebra
Convert natural language queries to analytical queries
Review the literature to describe methods to convert data models to ontologies
Create an ontology to handle medical domain knowledge available in InfoSaude
Design and develop part of the proposed framework - case study - medical records
Search for a benchmark to measure accuracy of proposed work
52
53
• Schedule
Uncertain Temporal Information Extraction53
54
– Uncertain x Imprecise x Inaccurate• How do differences in such word senses can
contribute to organize such temporal expressions into different groups?
– Fuzzy Logic• Imprecise time ≡ fuzzy time? How to apply fuzzy
logic to inaccurate temporal data? Are there alternatives?
– OBIE Accuracy• How to evaluate IE accuracy?
Pending questions...54
55
OBIE – Ontology-based Information Extraction
An Approach to Extract and Deal with
Imprecise Temporal Data and Spelling Errors
PhD Proposal
HEGLER TISSOT
Advisor: Marcos Didonet Del Fabro
Universidade Federal do Paraná
Curitiba – Brazil
Fev / 2014
55