obie – ontology-based information extraction · • text mining – information extraction (ie)...

55
1 OBIE – Ontology-based Information Extraction An Approach to Extract and Deal with Imprecise Temporal Data and Spelling Errors PhD Proposal HEGLER TISSOT Advisor: Marcos Didonet Del Fabro Universidade Federal do Paraná Curitiba – Brazil Fev / 2014 1

Upload: others

Post on 23-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

1

OBIE – Ontology-based Information Extraction

An Approach to Extract and Deal with

Imprecise Temporal Data and Spelling Errors

PhD Proposal

HEGLER TISSOT

Advisor: Marcos Didonet Del Fabro

Universidade Federal do Paraná

Curitiba – Brazil

Fev / 2014

1

Page 2: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

2

• Introduction– Context– Motivating example– Problem– Objetives

• State of the art– Information Extraction– Ontologies– OBIE Systems– Temporal Information

• Proposed work– Spelling errors– Temporal Information

Outline2

Page 3: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

3

• Information Management – Large volumes of data are available

• 80% = text (on the Internet or within companies)(Aranha, 2007)

– Unstructured data formats• New system modeling and building techniques

– What is the challenge?• ???

Context

Page 4: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

4

• Information Management

– Text Mining• From “Data Mining”• a technology which the purpose of extracting non-

trivial and interesting knowledge from large collections of unstructured documents

• Classification / Clustering• Indexing for Search

• Information Extraction

Context

Page 5: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

5

• Text Mining – Classification / clustering

• Machine learning algorithms

Based on medical records

textual content, how to

identify a possible Group of

Disease?

Context

Page 6: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

6

• Text Mining – Classification / clustering

• Machine learning algorithms

Based on medical records

textual content, how to

identify a possible Group of

Disease?

The most discriminant

words do not necessarily

represent the most

suitable concepts.

(“not” >> E10 x E11)

Context

Page 7: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

7

• Text Mining – Information Extraction (IE)

• In IE, relevant information from natural language (NL) texts is identified, collected and normalized.

– NLP» Natural Language Processing» Exhaustive deep NL analysis of all aspects of a text

– OBIE (Ontology-based Information Extraction)

Context

Page 8: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

8

Context

Unstructured Textual Content

Medical Record Sample:Blood pressure is lower. No vision complaints. Sub optimal sugar, control with retinopathy and neuropathy, high glucometer readings. Will work harder on diet. Will increase insulin by 2 units.

Information Extractionvision OK

high

lower

blood pressure

glucometer

sugar

retinopathy neuropathy

IE

Ontology-basedInformation Extraction

OBIE

8

Page 9: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

9

Where to apply?1. Information Extraction

• Medical records– Statistical view from unstrucured data

• Internet data/news/...– Knowing more from competitors

• Social Networks– How to sort and identify specific profiles?

» (e.g. drug dealers)

• Documents– How many (Word,PDF,...) documents do you have in your

computer/server? What do they say?

Context

Page 10: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

10

2. Answering natural language queries

Context

Quais os pacientes que apresentaram os

sintomas X, Y ou Z em casos de doenças A, B

ou C nos dois últimos anos?

Resultado:Paciente 1Paciente 2Paciente 3...

“match”

Page 11: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

11

3. Semantic + Analytical Data

Context

Quais os melhores clientes no último

semestre?

Page 12: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

12

Medical Record Example (in Portuguese)

Motivating Example12

Page 13: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

13

Medical Record Example (in Portuguese)

Motivating Example13

Page 14: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

14

Temporal Information (Precise + Imprecise)

Motivating Example14

Page 15: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

15

Spelling Errors

Motivating Example15

Page 16: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

16

– Extract temporal information from text– Organize events in a timeline– Uncertainty → imprecise temporal data

• “a few weeks ago”

• “the coming months”• “around 10:00 am”• “in the beginning of next month”

– Spelling errors• “in the last tree days”

Problem16

Page 17: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

17

OBIE approach to extract and deal with:

– Uncertain Temporal Information

– Spelling Errors

Objective17

Page 18: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

18

(Bird et al., 2009)

Information Extraction18

Page 19: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

19

(Nedellec and Nazarenko, 2006)

Ontology-based Information Extraction (OBIE)19

Ontologies

Page 20: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

20

– Formal specification of concepts

– Knowledge Domain

Classes

+ Instances

+ Properties

+ Relations

+ Axioms

= Formal Conceptualization

(Gruber, 1993) Ontology Web Language (OWL)

Ontology20

Page 21: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

21

– process unstructured text (natural language)– guided by ontologies– present output using ontologies– specific knowledge domain extraction

(Wimalasuriya and Dejing, 2010)

– Desired features•String similarity

•Inexact matching

•Large repositories

•Multiple ontologies

•Temporal information

OBIE Systems21

Page 22: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

22

OBIE General Framework22

Page 23: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

23

OBIE General Framework23

Page 24: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

24

OBIE General Framework24

Page 25: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

25

OBIE General Framework25

Page 26: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

26

2. State of the Art26

Page 27: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

27

– Organize events in a timeline– Establish chronological order– Answer temporal questions (Wong et al., 2005)

– “Which were the most prescribed drugs in the last weeks?”

– “Who did use aspirin before having <symptom>?”

– “When did <event-description> happen?”

– Challenges (Temporal Information Extraction)• Linguistics: different expressions• Reference resolution: “tomorrow”

• Negation: “not before”, “it didn´t happen last year”

Temporal Information Extraction27

Page 28: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

28

Tokens that represent a temporal entity (point in time, duration, frequency) (Sanampudi and Kumari, 2010)

• Explicit– January 2013

• Implicit– Christmas 2012

• Relative (indexed)– Yesterday, next month, three days ago (Alonso et al., 2007)

• Vague– Several weeks, in the next days (Schilder and Habel,

2003)

Temporal Expressions28

Page 29: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

29

– Formal representation of temporal concepts• OWL-time (Hobbs and Pan, 2004)

• TL-OWL (Kim et al., 2008)

• other Temporal approaches and OWL Extensions

– Challenges (Imprecise Temporal Information)– Extraction

– Representation

– Logics and algebra

Temporal Ontologies29

Page 30: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

30

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

30

Page 31: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

31

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

before( A.begin , B.begin )? → probably true/false

31

Page 32: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

32

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

before( A.begin , B.begin )? → probably true/false

before( B.begin , A.end )? → probably true/false

32

Page 33: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

33

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

before( A.begin , B.begin )? → probably true/false

before( B.begin , A.end )? → probably true/false

before( A.end , B.end )? → probably true/false

33

Page 34: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

34

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

before( A.begin , B.begin )? → probably true/false

before( B.begin , A.end )? → probably true/false

before( A.end , B.end )? → probably true/false

before( A.begin , B.end )? → TRUE

34

Page 35: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

35

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

C

before( A.begin , B.begin )? → probably true/false

<statement>: after( B.begin , C.end )

35

Page 36: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

36

Dealing with Imprecise Temporal Information

Logics and algebra

begin

begin

end

end

A

B

C

before( A.begin , B.begin )? → probabily true/false

<statement>: after( B.begin , C.end )

before( A.begin , B.begin )? → TRUE

36

Page 37: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

37

– Inaccurate temporal expression (+)– Define temporal concepts (+)– Perform arithmetic or logic operations (-)

Temporal Information Approaches37

Page 38: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

38

– Spelling ErrorsWordNet Extensions to deal with phonetic similarity

– Imprecise Temporal Information Extraction

Proposed Work38

Page 39: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

39

* ED (Levenshtein, 1966), TS (Oliver, 1993), HD (Hamming, 1950), LCS (Allison and Dix, 1986), SWD (Smith and Waterman, 1981), MED (Monge and Elkan, 1996), JWD (Winkler and Thibaudeau, 1991), Soudex (Knuth, 1968), FastSS (Bocek et al., 2007)

Spelling Errors x Similarity39

Page 40: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

40

– Multi language / Multi dictionary

– Derivative Words• Verb conjugation in Portuguese

– 13 tenses; 67 variations;» Unlike English (7 variations for ‘to be’)

{am, is, are, was, were, being, been}

– Fast Phonetic Similarity Search

WordNet Extensions40

Page 41: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

41

Stringsim function

Fast Phonetic Similarity Search41

Page 42: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

42

Stringsim function

Fast Phonetic Similarity Search42

Page 43: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

43

PhoneticMapPT

Fast Phonetic Similarity Search43

Page 44: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

44

PhoneticMapSimPT function

Fast Phonetic Similarity Search44

Page 45: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

45

– Similarity Search Methods• Full• Fast

– PhoneticSearchPT function (fast search method)

Fast Phonetic Similarity Search45

Page 46: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

46

– Precise x Imprecise Temporal Information• “08:15 am” x “earlyer in the morning”

» Which one happened before?

– Experiment• 4,748 medical records (MR)• 3,583 imprecise expressions (in 2,018 MR – 42,5%)

Uncertain Temporal Information Extraction46

Page 47: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

47

• Temporal Information Mapping• Extracting Process

• Answering User Queries

• Case Study

Proposed Activities47

Page 48: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

48

• Temporal Information Mapping• Extracting Process

• Answering User Queries

• Case Study

Proposed Activities

A1 A3 A2

A.1. Temporal Ontology

A.2. Temporal Expressions

A.3. Numeral Ontology

48

Page 49: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

49

• Temporal Information Mapping• Extracting Process

• Answering User Queries

• Case Study

Proposed Activities

B.1. Temporal Representation

B.2. Annotation Schemes

B.3. Phonetic Similarity

B.4. OWL Extension

B.5. Extraction Rules

B1

B2

B3

B4B5

49

Page 50: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

50

• Temporal Information Mapping• Extracting Process

• Answering User Queries

• Case Study

Proposed Activities

C.1. Temporal Algebra

C.2. Analytical Queries C1

C2

50

Page 51: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

51

• Temporal Information Mapping• Extracting Process

• Answering User Queries

• Case Study

Proposed Activities

D.1. Ontology Generator

D.2. Domain Ontology

D.3. Information Extraction

D.4. Accuracy EvaluationD1

D2

51

Page 52: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

52

Proposed Activities

A.1

A.2

A.3

B.1

B.2

B.3

B.4

B.5

C.1

C.2

D.1

D.2

D.3

D.4

P.x

T

Temporal Ontology

Temporal Expressions

Numeral Ontology

Temporal Representation

Annotation Schemes

Phonetic Similarity

Temporal OWL Extension

Extraction Rules

Temporal Algebra

Analytical Queries

Ontology Generator

Domain Ontology

Information Extraction

Accuracy Evaluation

Articles

Thesis

Create an ontology to define imprecise temporal concepts

List possible temporal expressions in Portuguese and English

Search for a numeral ontology that maps numeric values in the form of words

Define a representation for temporal expressions

Review and adapt temporal annotation schemes to support uncertain temporal data

Apply the Phonetic Fast Search method in the annotation and extraction processes

Propose an extension to the OWL metamodel to support temporal-dependent elements

Define a set of extraction rules needed to handle uncertain temporal data

Review the literature concerning Temporal Algebra

Convert natural language queries to analytical queries

Review the literature to describe methods to convert data models to ontologies

Create an ontology to handle medical domain knowledge available in InfoSaude

Design and develop part of the proposed framework - case study - medical records

Search for a benchmark to measure accuracy of proposed work

52

Page 53: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

53

• Schedule

Uncertain Temporal Information Extraction53

Page 54: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

54

– Uncertain x Imprecise x Inaccurate• How do differences in such word senses can

contribute to organize such temporal expressions into different groups?

– Fuzzy Logic• Imprecise time ≡ fuzzy time? How to apply fuzzy

logic to inaccurate temporal data? Are there alternatives?

– OBIE Accuracy• How to evaluate IE accuracy?

Pending questions...54

Page 55: OBIE – Ontology-based Information Extraction · • Text Mining – Information Extraction (IE) • In IE, relevant information from natural language (NL) texts is identified, collected

55

OBIE – Ontology-based Information Extraction

An Approach to Extract and Deal with

Imprecise Temporal Data and Spelling Errors

PhD Proposal

HEGLER TISSOT

Advisor: Marcos Didonet Del Fabro

Universidade Federal do Paraná

Curitiba – Brazil

Fev / 2014

55