schema matching and data extraction over html tables cui tao data extraction research group...

32
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF

Post on 19-Dec-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Schema Matching and Data Extraction over HTML Tables

Cui Tao

Data Extraction Research GroupDepartment of Computer Science

Brigham Young University

supported by NSF

2

Introduction

Many tables on the Web Ontology-based extraction:

Works well for unstructured or semi-structured data What about structured data – tables?

How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

3

ProblemDetecting The Table of Interest

?

4

Problem

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

5

ProblemAttribute is Value

6

Problem Attribute-Value is Value

? ?

7

ProblemValue is not Value

8

ProblemFactored Values

9

ProblemSplit Values

10

ProblemMerged Values

11

ProblemInformation Behind Links

List

Tableextendingover severalpages

12

Solution Detect the table of interest Form attribute-value pairs (adjust

if necessary) Do extraction Infer mappings from extraction

patterns

13

SolutionDetect The Table of Interest

Top-level tables Table size: at least 3 rows and

columns Grid layout: same # of values Attributes Value density:

# of ontology extracted values total # of values in the table

14

SolutionDetect The Table of Interest

Linked-page tables Table size: at least 2 rows and

columns Attributes Attribute-value-pair pattern Page-spanning tables

15

Solution Remove Factoring

2001

2001

2001

2000

2000

2000

2000

2000

2000

1999

1999

16

SolutionReplace Boolean Values

17

SolutionForm Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

18

SolutionAdjust Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

19

SolutionAdd Information Hidden Behind Links

Unstructured and semi-structured:

concatenate

<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>

Single attribute value pairs:Pair them together

List:Mark the beginning

and the end

<

>

20

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

21

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

22

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

23

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

24

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

25

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

26

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

27

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

28

Experimental Results − Table Location

Car advertisement application domain

12

2

Structured Linked Page Location

Precision:86%

Recall:92%

Testing Set

53

Training Set7

87%(46)

100%(7)

Top Table

Location

Precision:100%Recall:87%

46

100%(7)

28

Linked Pages

13

15

29

Experimental Results − Mapping

Car advertisement application domain 46 recognized tables in the testing

set Total 319 mappings Precision: 95.8% Recall: 92.8% Top-level tables: 77% of the 296

correct mappings Linked tables: 19.6% Both: 3.4%

30

Experimental Results − Table Location

Cell-phone sales application domain

Testing Set

12

Training Set5

92%(11)

100%(5)

Top Table

Location

Precision:100%Recall:92%

Linked Pages

11

100%(5)

3

31

Experimental Results − Mapping

Cell-phone sales application domain 11 recognized tables in the testing

Set Total 97 mappings Precision: 90.1% Recall: 85.4% Top-level tables: 85.4% of the 88

correct mappings Linked tables: 50.5% Both: 35.9%

32

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching