ontology-based information extraction and structuring stephen w. liddle † school of accountancy...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Ontology-Based InformationExtraction and Structuring
Stephen W. Liddle†
School of Accountancy and Information Systems
Brigham Young University
Douglas M. Campbell, David W. Embley,‡ and Randy D. SmithResearch funded in part by †Faneuil Research and ‡Novell, Inc.
Copyright 1998
Motivation
Database-style queries are effective– Find red cars, 1993 or newer, < $5,000
• Select * From Car Where Color=“red” And Year >= 1993 And Price < 5000
Web is not a database– Uses keyword search– Retrieves documents, not records– Assuming we have a range operator:
• “red” and (1993 to 1998) and (1 to 5000)
Solutions
Web query languages Wait for XML to emerge
– Interoperation/Standards?– XML query language?
Wrappers– Hand-written or semi-automatically
generated parsers– Specific to source site, subject to change
Our Approach
Automatic wrapper generation Based on application ontology
– Augmented conceptual model– Defines constants, keywords, their
relationships Best for:
– Narrow ontological breadth– Data-rich documents
Car-Ad Ontology Object-Relationship Model + Data Frames
Year Price
MakeMileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has
is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
Graphical
Car [0:1] has Year [1:*];Year {regexp[2]: “\d{2} : \b’\d{2}\b, … };Car [0:1] has Make [1:*];Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … };Car [0:1] has Model [1:*];Model {…};Car [0:1] has Mileage [1:*];Mileage {regexp[8] “\b[1-9]\d{1,2}k”, “1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” } {context: “\bmiles\b”, “\bmi\.”, “\bmi\b”};Car [0:*] has Feature [1:*];Feature {regexp[20]: -- Colors “\baqua\s+metallic\b”, “\bbeige\b”, … -- Transmission “(5|6)\s*spd\b”, “auto : \bauto(\.|,)”, -- Accessories “\broof\s+rack\b”, “\bspoiler\b”, …...
Textual
(See Figures 2 & 3 of Paper)
Fixed Processes ApplicationOntology
OntologyParser
Constant/KeywordRecognizer
Database-InstanceGenerator
UnstructuredDocument
Constant/KeywordMatching Rules
Data-Record Table
List of Objects, Relation-ships, and Constraints
DatabaseScheme
PopulatedDatabase
(See Figure 1 of Paper)
Constant/KeywordRecognizer
Database-InstanceGenerator
UnstructuredDocument
Data-Record Table
PopulatedDatabase
Make : \bchev\b…KEYWORD(Mileage) : \bmiles\bKEYWORD(Mileage) : \bmi\....
create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...
Object: Car;...Car: Year [0:1];Car: Make [0:1];…CarFeature: Car [0:*] has Feature [1:*];
Ontology Parser ApplicationOntology
OntologyParser
Constant/KeywordMatching Rules
List of Objects, Relation-ships, and Constraints
DatabaseScheme
ApplicationOntology
OntologyParser
Database-InstanceGenerator
List of Objects, Relation-ships, and Constraints
DatabaseScheme
PopulatedDatabase
Constant/Keyword Recognizer
Descriptor/String/Position(start/end)Year|97|1|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|108|114PhoneNr|556-3800|146|153
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415.JERRY SEINER MIDVALE, 566-3800
Constant/KeywordRecognizer
UnstructuredDocument
Constant/KeywordMatching Rules
Data-Record Table
ApplicationOntology
OntologyParser
Constant/KeywordRecognizer
UnstructuredDocument
Constant/KeywordMatching Rules
Database-Instance Generator
insert into Car values(1001, “97”, “CHEV”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)
Database-InstanceGenerator
Data-Record Table
List of Objects, Relation-ships, and Constraints
DatabaseScheme
PopulatedDatabase
Heuristics
Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint
violation
Keyword Proximity
Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
Subsumed/Overlapping Constants
Make|CHEV|5|8Make|CHEVROLET|5|13Model|Cavalier|15|22Feature|Red|25|27Feature|5 spd|30|34Mileage|7,000|42|46KEYWORD(Mileage)|miles|48|52Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147
'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
Functional Relationships
Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
Nonfunctional Relationships
Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
First Occurrence without Constraint Violation
Year|97|2|3Make|CHEV|5|8Model|Cavalier|10|17Feature|Red|20|22Feature|5 spd|25|29Mileage|7,000|37|41KEYWORD(Mileage)|miles|43|47Price|11,995|101|106Mileage|11,995|101|106PhoneNr|566-3800|140|147PhoneNr|566-3802|149|156
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800, 566-3802
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800
Recall & Precision
N
CRecall
IC
C
Precision
N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly
(of facts available, how many did we find?)
(of facts retrieved, how many were relevant?)
Experimental ResultsSalt Lake Tribune
Tuning set: 100Test set: 116
Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99
(See Table 1 of Paper)
Trouble Spots Unbounded sets
– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models
Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns
Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns
Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos
Contributions
Fully automatic technique for wrapper generation
Uses syntactic, not semantic constant-recognition techniques
Adapts readily to different unstructured document formats
Good precision & recall ratios Implemented (Perl, C++, Lex/Yacc, Java)
Limitations
Works best for data-rich documents, narrow ontological domains
Ontology creation is still manual– Domain expert– Trained in our conceptual model & tools
Future Work
Graphical ontology editor Improve automatic record-boundary
recognition– Make suitable for broader domains
(obituaries, university catalog, etc.) Improve heuristics
– Use a declarative language– Employ more of OSM’s rich constraints
Future Work (cont.)
Add operations to data frames– General constraints– Canonical representations– Inferred information
Develop ontology libraries Finish porting to 100% Java Incorporate learning/feedback Ontology-enabled agents
Our Web Site
I have a demo on my laptop Can download from our Web site BYU Data Extraction Group
http://osm7.cs.byu.edu/deg
(See Reference 13 of Paper)
Other Domains
Job Listings Obituaries University Course Catalogs
Job Listings ResultsLos Angeles Times
Tuning set: 50Test set: 50
Recall % Precision %Degree 100 100Skill 74 100Contact 100 100Email 91 83Fax 91 100Voice 79 92
(See Table 2 of Paper)
Obituaries ResultsSalt Lake Tribune
Tuning set: ~40Test set: 38
Recall % Precision %Deceased Name 100 100Age 91 95Birth Date 100 97Death Date 94 100Funeral Date 92 100Funeral Address 96 96Funeral Time 97 100Interment Address 100 100Viewing 93 96Viewing Date 70 100Viewing Address 76 100Beginning Time 88 100Ending Time 90 100Relationship 81 93Relative Name 88 71
(See our forthcoming ER’98 paper for details.)
Obituaries ResultsArizona Daily Star
Tuning set: ~40Test set: 90
Recall % Precision %Deceased Name 100 100Age 86 98Birth Date 96 96Death Date 84 99Funeral Date 96 93Funeral Address 82 82Funeral Time 92 87Interment Address 100 100Viewing 97 100Viewing Date 100 100Viewing Address 95 100Beginning Time 93 96Ending Time 95 100Relationship 92 97Relative Name 95 74
(See our forthcoming ER’98 paper for details.)