ontology-based information extraction and structuring stephen w. liddle † school of accountancy...

27
Ontology-Based Information Extraction and Structuring Stephen W. Liddle School of Accountancy and Information Systems Brigham Young University Douglas M. Campbell, David W. Embley, and Randy D. Smith Research funded in part by Faneuil Research and Novell, Inc. Copyright 1998

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Ontology-Based InformationExtraction and Structuring

Stephen W. Liddle†

School of Accountancy and Information Systems

Brigham Young University

Douglas M. Campbell, David W. Embley,‡ and Randy D. SmithResearch funded in part by †Faneuil Research and ‡Novell, Inc.

Copyright 1998

Page 2: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Motivation

Database-style queries are effective– Find red cars, 1993 or newer, < $5,000

• Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000

Web is not a database– Uses keyword search– Retrieves documents, not records– Assuming we have a range operator:

• “red” and (1993 to 1998) and (1 to 5000)

Page 3: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Solutions

Web query languages Wait for XML to emerge

– Interoperation/Standards?– XML query language?

Wrappers– Hand-written or semi-automatically

generated parsers– Specific to source site, subject to change

Page 4: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Our Approach

Automatic wrapper generation Based on application ontology

– Augmented conceptual model– Defines constants, keywords, their

relationships Best for:

– Narrow ontological breadth– Data-rich documents

Page 5: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Car-Ad Ontology Object-Relationship Model + Data Frames

Year Price

MakeMileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has

is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Graphical

Car [0:1] has Year [1:*];Year {regexp[2]: “\d{2} : \b’\d{2}\b, … };Car [0:1] has Make [1:*];Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … };Car [0:1] has Model [1:*];Model {…};Car [0:1] has Mileage [1:*];Mileage {regexp[8] “\b[1-9]\d{1,2}k”, “1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” } {context: “\bmiles\b”, “\bmi\.”, “\bmi\b”};Car [0:*] has Feature [1:*];Feature {regexp[20]: -- Colors “\baqua\s+metallic\b”, “\bbeige\b”, … -- Transmission “(5|6)\s*spd\b”, “auto : \bauto(\.|,)”, -- Accessories “\broof\s+rack\b”, “\bspoiler\b”, …...

Textual

(See Figures 2 & 3 of Paper)

Page 6: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Fixed Processes ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

(See Figure 1 of Paper)

Page 7: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Constant/KeywordRecognizer

Database-InstanceGenerator

UnstructuredDocument

Data-Record Table

PopulatedDatabase

Make : \bchev\b…KEYWORD(Mileage) : \bmiles\bKEYWORD(Mileage) : \bmi\....

create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...

Object: Car;...Car: Year [0:1];Car: Make [0:1];…CarFeature: Car [0:*] has Feature [1:*];

Ontology Parser ApplicationOntology

OntologyParser

Constant/KeywordMatching Rules

List of Objects, Relation-ships, and Constraints

DatabaseScheme

Page 8: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

ApplicationOntology

OntologyParser

Database-InstanceGenerator

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Constant/Keyword Recognizer

Descriptor/String/Position(start/end)Year|97|1|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|108|114PhoneNr|556-3800|146|153

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415.JERRY SEINER MIDVALE, 566-3800

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Data-Record Table

Page 9: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

ApplicationOntology

OntologyParser

Constant/KeywordRecognizer

UnstructuredDocument

Constant/KeywordMatching Rules

Database-Instance Generator

insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Database-InstanceGenerator

Data-Record Table

List of Objects, Relation-ships, and Constraints

DatabaseScheme

PopulatedDatabase

Page 10: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Heuristics

Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint

violation

Page 11: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Keyword Proximity

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 12: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Subsumed/Overlapping Constants

Make|CHEV|5|8Make|CHEVROLET|5|13Model|Cavalier|15|22Feature|Red|25|27Feature|5 spd|30|34Mileage|7,000|42|46KEYWORD(Mileage)|miles|48|52Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 13: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Functional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 14: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Nonfunctional Relationships

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Page 15: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

First Occurrence without Constraint Violation

Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147PhoneNr|566-3802|149|156

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802

Page 16: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800

Recall & Precision

N

CRecall

IC

C

Precision

N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly

(of facts available, how many did we find?)

(of facts retrieved, how many were relevant?)

Page 17: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Experimental ResultsSalt Lake Tribune

Tuning set: 100Test set: 116

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99

(See Table 1 of Paper)

Page 18: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Trouble Spots Unbounded sets

– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models

Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns

Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns

Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos

Page 19: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Contributions

Fully automatic technique for wrapper generation

Uses syntactic, not semantic constant-recognition techniques

Adapts readily to different unstructured document formats

Good precision & recall ratios Implemented (Perl, C++, Lex/Yacc, Java)

Page 20: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Limitations

Works best for data-rich documents, narrow ontological domains

Ontology creation is still manual– Domain expert– Trained in our conceptual model & tools

Page 21: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Future Work

Graphical ontology editor Improve automatic record-boundary

recognition– Make suitable for broader domains

(obituaries, university catalog, etc.) Improve heuristics

– Use a declarative language– Employ more of OSM’s rich constraints

Page 22: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Future Work (cont.)

Add operations to data frames– General constraints– Canonical representations– Inferred information

Develop ontology libraries Finish porting to 100% Java Incorporate learning/feedback Ontology-enabled agents

Page 23: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Our Web Site

I have a demo on my laptop Can download from our Web site BYU Data Extraction Group

http://osm7.cs.byu.edu/deg

(See Reference 13 of Paper)

Page 24: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Other Domains

Job Listings Obituaries University Course Catalogs

Page 25: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Job Listings ResultsLos Angeles Times

Tuning set: 50Test set: 50

Recall % Precision %Degree 100 100Skill 74 100Contact 100 100Email 91 83Fax 91 100Voice 79 92

(See Table 2 of Paper)

Page 26: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Obituaries ResultsSalt Lake Tribune

Tuning set: ~40Test set: 38

Recall % Precision %Deceased Name 100 100Age 91 95Birth Date 100 97Death Date 94 100Funeral Date 92 100Funeral Address 96 96Funeral Time 97 100Interment Address 100 100Viewing 93 96Viewing Date 70 100Viewing Address 76 100Beginning Time 88 100Ending Time 90 100Relationship 81 93Relative Name 88 71

(See our forthcoming ER’98 paper for details.)

Page 27: Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas

Obituaries ResultsArizona Daily Star

Tuning set: ~40Test set: 90

Recall % Precision %Deceased Name 100 100Age 86 98Birth Date 96 96Death Date 84 99Funeral Date 96 93Funeral Address 82 82Funeral Time 92 87Interment Address 100 100Viewing 97 100Viewing Date 100 100Viewing Address 95 100Beginning Time 93 96Ending Time 95 100Relationship 92 97Relative Name 95 74

(See our forthcoming ER’98 paper for details.)