Semiautomatic Generation of Resilient Data-Extraction Ontologies
Yihong Ding
Data Extraction GroupBrigham Young University
Sponsored by NSF
2
Wrapper-Driven Data Extraction
Web data extraction– Obtain user-specified information from Web documents
Wrapper– Convert implicit HTML data into explicit formatted data– Data-source-specified, high performance
Examples:– SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …
3
Common Problem of Wrappers
<LI> <A HREF="…"> Mani Chandy </A>,
<I>Professor of Computer Science</I>
and <I>Executive Officer for Computer
Science</I>
b
U_U
N_N
? / ε etc.
? / ε
? / ε
? / next_token
? / next_token
s<U,U> / ε
s<b,U> /“U=” + next_token
s<N,N> / εs<b,N> /“N=” + next_token
s<U,N> /“N=” + next_token
SoftMealy
Resiliency fixed domainchangeable layout
Scalabilityunchanged existing wrapperextendable domain and functions
4
Data-Extraction Ontology
Structure– Object sets– Relationship sets– Participation constraints– Data frames
Pros: resilient and scalableCons: hard to create– Knowledge requirements– Tedious and error-prone work
Car [-> object];
Car [0:1] has Make [1:*];Make matches [10] constant { extract "\baudi\b"; };end;
Car [0:1] has Model [1:*];Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; };end;
Car [0:1] has Mileage [1:*];Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";};end;
Car [0:1] has Price [1:*];Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";};end;
5
Motif of Ontology Generation
Human Brain
Concepts of Interest
Concepts with Relations
Data-Extraction Ontology
Knowledge Base
Sample Documents
6
Thesis Statement
Given: knowledge baseInput: sample Web pages of interest Output: a data-extraction ontology for the domain of interest
Between input and output: this is the work of this thesis
7
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
8
Primary Knowledge Source
Requirements– Available – General in coverage– Rich in meaningful relationship– Encoded in or easily converted to XML
Mikrokosmos (K) Ontology– Developed by NMSU jointly with U.S. DoD– Contains over 5000 concepts– Connects to an average 14 links per concept– Represented in XML format
9
Integrated Knowledge Base
Data-Frame Library
KOntolog
y
Synonym Dictionary
(WordNet)
Lexicons
KNOWLEDGE BASE
10
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
11
Domain Specification
Training documents– Data-rich – Narrow in topic breadth
Preprocessing
12
Example – Car AdvertisementRecord 1:
00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446
Record 2:
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250
Record 3:
02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755
Record 4:
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
13
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
14
Concept Selection
Selection strategies– Compare a string with the
name of a concept– Compare a string with the
values belonging to a concept
– Apply data-frame recognizers to recognize a string
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
KB
<PHONE-NR>
15
Concept Selection
Reasons of conflict– Synonymy– Polysemy
Conflict resolution– Same-string only one
meaning– Favor longer over shorter– Context decides meaning
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250.
KB<PRICE>
<MILEAGE>
price
by keyword identification
16
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
17
Relationship Retrieval
<AUTOMOBILE>
<PRICE>
<PHONE-NR>
<YEAR>
<CENTURY>
KB
<MILEAGE>
<AUDIO-MEDIA-ARTIFACT>
18
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
19
<AUTOMOBILE>
<PRICE>
Constraint Discovery
<AUTOMOBILE>
<PRICE>
02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1]
20
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
21
Ontology Generation
concept nodes object setspaths relationship setsdiscovered constraints participation constraintsconcept recognizers data frames
22
Automatically Generated Ontology -- Car Advertisement
(01) {Automobile [-> object];}
(02) {Automobile [0:1] has Mileage [1:1];}
(03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];}
(12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];}
(20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}
23
Ontology-Generation Procedure
Concept Selection
RelationRetrieval
ConstraintDiscovery
Data Extraction Ontology
interact if necessary
Integrated Knowledge Base
Knowledge Sources
pre-processing
Results Storage
ExtractionProcessing
ResultEvaluation
training documents
pre-processing clean records
testdocuments
24
Updating Strategies
Remove all bad relationship sets
Modify remaining incorrect relationship sets– Substitute incorrect object sets– Reduce long n-ary relationship sets – Fix participation constraints
Adjust names or re-arrange sequences
Add new relationship sets
25
Final Ontology
Car [-> object]Car [0:1] has Year [1:*]Car [0:1] has Mileage [1:*]Car [0:1] has Price [1:*]PhoneNr [1:*] is for Car [0:1]PhoneNr [0:1] has Extension [1:*]Car [0:*] has Feature [1:*]Car [0:1] has Make [1:*]Car [0:1] has Model [1:*]
26
Evaluation Criteria
Basic measures– POG (Precision of Ontology Generation)– ROG (Recall of Ontology Generation)
Human constraints– PROG (Pseudo-ROG)– Comparing with an expert-created ontology
Knowledge base constraints– EPROG (Effective-PROG)
Correctness dependency– DEPROG (Dependent-EPROG)– For example: relationship sets depends on object sets
27
Evaluation Results
28
Discussion of Results
Bottleneck: cannot generate what not in the knowledge base
Object sets– Concept-selection procedure works well– Desired concept not shown in training records
• Rarely occurring concept not severe even if we don’t fix the error• Example: extension
– Aggregation and union• USAddressCity, USAddressState, USAddressZipCode Location• CropPlant, AnimalProduct, FruitFoodStuff AgriculturalProduct
– Close-meaning concepts: FurniturePart Furnished
29
Discussion of Results
Relationship sets– Binary relationship sets over 95% – Most errors due to incorrectly generated object sets– Semantically incorrect relationship sets
• Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year
– n-ary relationship sets (usually huge)
Participation constraints– Error due to lack of training examples – How much is enough?
30
Knowledge Base Extensibility
Add SALT -- a new knowledge sourceSuccessfully integrated into existing KBSample new relationship set (DOE abstract domain)– CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation
31
Conclusion
Experimented with knowledge-base construction and extension
Standardized application domain specification
Generated data-extraction ontologies from a specified domain and an integrated knowledge base
Showed DEPROG results of more than 70% on average and over 90% for well-defined domains
32
Future Work
Build a general-purpose knowledge source for data-extraction usage
Study more about data frames– Can a system correctly identify concepts with data frames?– Can a system update a data frame to fit a special situation?– Can a system generate a data frame from a collection of
information of interest?