record-boundary discovery in web documents d.w. embley, y. jiang, y.-k. ng data-extraction group*...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Record-Boundary Discovery in Web Documents
D.W. Embley, Y. Jiang, Y.-K. NgData-Extraction Group*
Department of Computer Science
Brigham Young UniversityProvo, UT, USA
*Funded in part by Novell, Inc., Ancestry.com, Inc., and Faneuil Research.
Record-Boundary DiscoveryLarger Goal: Information Extraction<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr>…</body></html>
#####
#####
#####
Year Make Model PhoneNr
Desired ObjectiveQuery the Web Like a Database
Example: Get the year, make, model, and price for 1987 or later cars that are red or white.
Year Make Model Price-----------------------------------------------------------------------97 CHEVY Cavalier 11,99594 DODGE 4,99594 DODGE Intrepid 10,00091 FORD Taurus 3,50090 FORD Probe88 FORD Escort 1,000
for a page of unstructured records, rich in data and narrow in ontological breadth
Approach and LimitationsAutomatic Ontology-Based
Wrapper Generation
ApplicationOntology
OntologyParser
Constant/KeywordRecognizer
Database-InstanceGenerator
UnstructuredRecords
Constant/KeywordMatching Rules
Data-Record Table
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
PopulatedDatabase
Record Extractor
Web Page
Application Ontology:Object-Relationship Model Instance
Car [-> object];Car [0..1] has Model [1..*];Car [0..1] has Make [1..*];Car [0..1] has Year [1..*];Car [0..1] has Price [1..*];Car [0..1] has Mileage [1..*];PhoneNr [1..*] is for Car [0..1];PhoneNr [0..1] has Extension [1..*];Car [0..*] has Feature [1..*];
Year Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
Application Ontology: Data Frames
Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, …end;Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, …end;Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\b “\bmi.\b”;end;...
Ontology Parser
Make : chevy…KEYWORD(Mileage) : \bmiles\b...
create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...
Object: Car;...Car: Year [0..1];Car: Make [0..1];…CarFeature: Car [0..*] has Feature [1..*];
ApplicationOntology
OntologyParser
Constant/KeywordMatching Rules
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
Record Extractor
<html>…<h4> ‘97 CHEVY Cavalier, Red, 5 spd, … </h4><hr><h4> ‘85 DODGE Daytona, needs paint, … </h4><hr>….</html>
…#####‘97 CHEVY Cavalier, Red, 5 spd, …#####‘85 DODGE Daytona, needs paint, …#####...
UnstructuredRecords
Record Extractor
Web Page
Record Extractor:High Fan-Out Heuristic
<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, … </h4><hr><h4> ‘85 DODGE Daytona, needs … </h4><hr>…</body></html>
html
head
title
body
… hr h4 b hr h4 ...h1
Candidate Separator Tags
Record Extractor:Record-Separator Heuristics
IT: Identifiable “html separator” Tags
HT: Highest-count Tags
SD: Standard Deviation
OM: Ontological Match
RP: Repeating-tag Patterns
IT: Identifiable “html separator” Tags
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
hr tr td a table p br h4 h1 strong b i
HT: Highest-count Tags
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
Tag Count----------------- hr 4 h4 3 b 1
SD: Standard Deviation
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
hr ( = 45.5)-------------------159 characters 63 characters 62 characters
h4 ( = 48.0)--------------------159 characters 63 characters
OM: Ontological Match
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
Record Estimator: average of count of Year, Make, and Model = 3.
Closest candidate separator count: h4 = 3, hr = 4, b = 1.
RP: Repeating-tag Patterns
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
3 pairs: <hr><h4>
Of the tags in the repeating pattern, h4 is closest with 3, then hr with 4.
Record Extractor:Consensus Heuristic
Certainty is a generalization of: C(E1) + C(E2) - C(E1)C(E2).C denotes certainty and Ei is the evidence for an observation.
Our certainties are based on observations from 10 differentsites for 2 different applications (car ads and obituaries)
Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%
Record Extractor:Example Consensus Heuristic
Rank Computed IT HT SD OM RP Certainty Factorhr 1 1 1 2 2 .994h4 2 2 2 1 1 .983b 3 3 - 3 - .182
e.g., b: 0 + .165 + .02 - 0.165 - 0.02 - .165.02 + 0.165.02 = .1817
Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%
Record Extractor: Results
4 different applications (car ads, job ads, obituaries,university courses) with 5 new/different sites for eachapplication
Heuristic Success Rate IT 95% HT 45% SD 65% OM 80% RP 75%Consensus 100%
Constant/Keyword Recognizer
Descriptor/String/Position(start/end)
‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Constant/KeywordRecognizer
UnstructuredRecords
Constant/KeywordMatching Rules
Data-Record Table
Database Instance Generator
• Keyword proximity• Subsumed and overlapping
constants• Functional relationships• Nonfunctional relationships• First occurrence without
constraint violation
Heuristics
Descriptor/String/Position(start/end)
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Database-InstanceGenerator
Data-Record Table
Record-Level Objects,Relationships, and Constraints
=2 {
} =52
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Database-Instance Generator
insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)
Database-InstanceGenerator
Data-Record Table
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
PopulatedDatabase
Recall & Precision
N
CRecall
IC
C
Precision
N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly
(of facts available, how many did we find?)
(of facts retrieved, how many were relevant?)
Results: Car Ads
Training set for tuning ontology: 100Test set: 116
Salt Lake Tribune
Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99
Car Ads: Comments
• Unbounded sets– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models
• Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns
• Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns
• Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos
Results: Computer Job Ads
Training set for tuning ontology: 50Test set: 50
Los Angeles Times
Recall % Precision %Degree 100 100Skill 74 100Email 91 83Fax 100 100Voice 79 92
Results: Obituaries
Training set for tuning ontology: ~ 24Test set: 90
Arizona Daily Star
Recall % Precision %DeceasedName* 100 100Age 86 98BirthDate 96 96DeathDate 84 99FuneralDate 96 93FuneralAddress 82 82FuneralTime 92 87…Relationship 92 97RelativeName* 95 74
*partial or full name
Cautions
• Ontology Creation and Tuning– Regular expressions (tool for experimentation)– Category specialization and cultural localization
• Record Separation– Web page has multiple records satisfying an ontology– (HTML) record separator exists
• Attribute-Value Pair Generation– Context-sensitive recognizable/categorizable constants– Topic switches within records
Conclusions• Given an ontology and a Web page with multiple records, it is
possible to extract and structure the data automatically.• Record Separation Results: 100%• Recall and Precision Results
– Car Ads: ~ 94% recall and ~ 99% precision– Job Ads: ~ 84% recall and ~ 98% precision– Obituaries: ~ 90% recall and ~ 95% precision (except names: ~ 73% precision)
• Future Work– Find and categorize pages of interest.– Relax restrictions for record separation.– Strengthen heuristics for extraction.– Add richer conversions and additional constraints to data frames.
http://www.deg.byu.edu/