schema matching and data extraction over html tables
DESCRIPTION
Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web How to integrate data stored in different tables? Detect the table of interest - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/1.jpg)
Schema Matching and Data Extraction over HTML Tables
Cui Tao
Data Extraction Research GroupDepartment of Computer Science
Brigham Young University
supported by NSF
![Page 2: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/2.jpg)
Introduction
Many tables on the Web How to integrate data stored in
different tables? Detect the table of interest Form attribute-value pairs (adjust if
necessary) Do extraction Infer mappings from extraction patterns
![Page 3: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/3.jpg)
ProblemDetecting The Table of Interest
?
![Page 4: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/4.jpg)
Problem
Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,
Engine, Fuel Economy} Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Different schemas
![Page 5: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/5.jpg)
ProblemAttribute is Value
![Page 6: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/6.jpg)
Problem Attribute-Value is Value
? ?
![Page 7: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/7.jpg)
ProblemValue is not Value
![Page 8: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/8.jpg)
ProblemFactored Values
![Page 9: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/9.jpg)
ProblemSplit Values
![Page 10: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/10.jpg)
ProblemMerged Values
![Page 11: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/11.jpg)
ProblemInformation Behind Links
Single-ColumnTable (formattedas list)
Tableextendingover severalpages
![Page 12: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/12.jpg)
Solution Detect the table of interest Form attribute-value pairs (adjust
if necessary) Do extraction Infer mappings from extraction
patterns
![Page 13: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/13.jpg)
SolutionDetect The Table of Interest
‘Real’ table test Same number of values Table size
Attribute test Density measure test
# of ontology extracted values total # of values in the table
![Page 14: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/14.jpg)
Solution Remove Factoring
2001
2001
2001
2000
2000
2000
2000
2000
2000
1999
1999
![Page 15: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/15.jpg)
SolutionReplace Boolean Values
![Page 16: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/16.jpg)
SolutionForm Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
![Page 17: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/17.jpg)
SolutionAdjust Attribute-Value Pairs
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
![Page 18: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/18.jpg)
SolutionAdd Information Hidden Behind Links
Unstructured and semi-structured:
concatenate
<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>
Single attribute value pairs:Pair them together
List:Mark the beginning
and the end
<
>
![Page 19: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/19.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 20: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/20.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car.
![Page 21: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/21.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 22: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/22.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 23: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/23.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 24: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/24.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 25: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/25.jpg)
SolutionInferred Mapping Creation
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
![Page 26: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/26.jpg)
Experimental ResultsCar Advertisement Application domain 10 “training” tables
100% of the 57 mappings (no false mappings) 94.6% precision of the values in linked pages
(5.4% false declarations) 50 test tables
94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked
pages, we obtained 97% recall and 86% precision
![Page 27: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/27.jpg)
Other Applications Cell Phone Plan Application domain Soccer Player Application domain
![Page 28: Schema Matching and Data Extraction over HTML Tables](https://reader036.vdocuments.site/reader036/viewer/2022062409/56814de3550346895dbb525b/html5/thumbnails/28.jpg)
Contribution Provides an approach to extract
information automatically from HTML tables
Suggests a different way to solve the problem of schema matching