schema matching and data extraction over html tables cui tao data extraction research group...

Post on 19-Dec-2015

225 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Schema Matching and Data Extraction over HTML Tables

Cui Tao

Data Extraction Research GroupDepartment of Computer Science

Brigham Young University

supported by NSF

2

Introduction

Many tables on the Web Ontology-based extraction:

Works well for unstructured or semi-structured data What about structured data – tables?

How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

3

ProblemDetecting The Table of Interest

?

4

Problem

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

5

ProblemAttribute is Value

6

Problem Attribute-Value is Value

? ?

7

ProblemValue is not Value

8

ProblemFactored Values

9

ProblemSplit Values

10

ProblemMerged Values

11

ProblemInformation Behind Links

List

Tableextendingover severalpages

12

Solution Detect the table of interest Form attribute-value pairs (adjust

if necessary) Do extraction Infer mappings from extraction

patterns

13

SolutionDetect The Table of Interest

Top-level tables Table size: at least 3 rows and

columns Grid layout: same # of values Attributes Value density:

# of ontology extracted values total # of values in the table

14

SolutionDetect The Table of Interest

Linked-page tables Table size: at least 2 rows and

columns Attributes Attribute-value-pair pattern Page-spanning tables

15

Solution Remove Factoring

2001

2001

2001

2000

2000

2000

2000

2000

2000

1999

1999

16

SolutionReplace Boolean Values

17

SolutionForm Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

18

SolutionAdjust Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

19

SolutionAdd Information Hidden Behind Links

Unstructured and semi-structured:

concatenate

<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>

Single attribute value pairs:Pair them together

List:Mark the beginning

and the end

<

>

20

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

21

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

22

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

23

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

24

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

25

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

26

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

27

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

28

Experimental Results − Table Location

Car advertisement application domain

12

2

Structured Linked Page Location

Precision:86%

Recall:92%

Testing Set

53

Training Set7

87%(46)

100%(7)

Top Table

Location

Precision:100%Recall:87%

46

100%(7)

28

Linked Pages

13

15

29

Experimental Results − Mapping

Car advertisement application domain 46 recognized tables in the testing

set Total 319 mappings Precision: 95.8% Recall: 92.8% Top-level tables: 77% of the 296

correct mappings Linked tables: 19.6% Both: 3.4%

30

Experimental Results − Table Location

Cell-phone sales application domain

Testing Set

12

Training Set5

92%(11)

100%(5)

Top Table

Location

Precision:100%Recall:92%

Linked Pages

11

100%(5)

3

31

Experimental Results − Mapping

Cell-phone sales application domain 11 recognized tables in the testing

Set Total 97 mappings Precision: 90.1% Recall: 85.4% Top-level tables: 85.4% of the 88

correct mappings Linked tables: 50.5% Both: 35.9%

32

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching

top related