schema matching and data extraction over html tables cui tao data extraction research group...
Post on 19-Dec-2015
225 views
TRANSCRIPT
Schema Matching and Data Extraction over HTML Tables
Cui Tao
Data Extraction Research GroupDepartment of Computer Science
Brigham Young University
supported by NSF
2
Introduction
Many tables on the Web Ontology-based extraction:
Works well for unstructured or semi-structured data What about structured data – tables?
How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns
4
Problem
Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,
Engine, Fuel Economy} Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Different schemas
12
Solution Detect the table of interest Form attribute-value pairs (adjust
if necessary) Do extraction Infer mappings from extraction
patterns
13
SolutionDetect The Table of Interest
Top-level tables Table size: at least 3 rows and
columns Grid layout: same # of values Attributes Value density:
# of ontology extracted values total # of values in the table
14
SolutionDetect The Table of Interest
Linked-page tables Table size: at least 2 rows and
columns Attributes Attribute-value-pair pattern Page-spanning tables
17
SolutionForm Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
18
SolutionAdjust Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
19
SolutionAdd Information Hidden Behind Links
Unstructured and semi-structured:
concatenate
<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>
Single attribute value pairs:Pair them together
List:Mark the beginning
and the end
<
>
20
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
21
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car.
22
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
23
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
24
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
25
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
26
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
27
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
28
Experimental Results − Table Location
Car advertisement application domain
12
2
Structured Linked Page Location
Precision:86%
Recall:92%
Testing Set
53
Training Set7
87%(46)
100%(7)
Top Table
Location
Precision:100%Recall:87%
46
100%(7)
28
Linked Pages
13
15
29
Experimental Results − Mapping
Car advertisement application domain 46 recognized tables in the testing
set Total 319 mappings Precision: 95.8% Recall: 92.8% Top-level tables: 77% of the 296
correct mappings Linked tables: 19.6% Both: 3.4%
30
Experimental Results − Table Location
Cell-phone sales application domain
Testing Set
12
Training Set5
92%(11)
100%(5)
Top Table
Location
Precision:100%Recall:92%
Linked Pages
11
100%(5)
3
31
Experimental Results − Mapping
Cell-phone sales application domain 11 recognized tables in the testing
Set Total 97 mappings Precision: 90.1% Recall: 85.4% Top-level tables: 85.4% of the 88
correct mappings Linked tables: 50.5% Both: 35.9%