anhai doan pedro domingos alon levy department of computer science & engineering university of...

17
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Learning Source Descriptions Descriptions for Data Integration for Data Integration

Upload: evangeline-charles

Post on 30-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

AnHai Doan

Pedro Domingos

Alon Levy

Department of Computer Science & EngineeringUniversity of Washington

Learning Source DescriptionsLearning Source Descriptionsfor Data Integrationfor Data Integration

Page 2: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

2

OverviewOverview

Problem definition– schema matching

Solution– multi-strategy learning

Prototype system– LSD (Learning Source Descriptions)

Experiments Related work Summary & future work

Page 3: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

3

Data IntegrationData Integration

Find houses with four bathrooms and price under $500,000

mediated schema

superhomes.com

source schema

realestate.com

source schema

homeseekers.com

source schema

wrapper wrapperwrapper

Page 4: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

4

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

house

location contact-info

house

address

agent-name agent-phone

num-baths amenities

full-baths half-baths handicap-equipped

contact

name phone

Page 5: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

5

Map of the ProblemMap of the Problemsource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

Page 6: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

6

Current State of AffairsCurrent State of Affairs

Largely done by hand– labor intensive & error prone– key bottleneck in building applications

Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data

Need automatic approaches to scale up!

Page 7: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

7

Use machine learning to match schemas Basic idea

1. create training data– manually map a set of sources to mediated schema

2. train system on training data– learns from

– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...

3. system proposes mappings for subsequent sources

Our Approach Our Approach

Page 8: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

8

ExampleExample

realestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location

Seattle, WASeattle, WADallas, TX...

listed-price

$250,000$162,000$180,000...

agent-phone

(206) 729 0831(206) 321 4571(214) 722 4035...

comments

Fantastic house ...Great ...Hurry! ......

Page 9: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

9

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits certain types of information

Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner

Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy

Page 10: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

10

LearnersLearners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Examples– Name matcher

– agent-name => (name,0.7), (phone,0.3)

– Frequency learner– “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)

Page 11: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

11

Training the LearnersTraining the Learnersrealestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location listed-price agent-phone comments

Name Matcher

(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...

Frequency Learner

(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...

Page 12: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

12

Applying the LearnersApplying the Learners

homes.com

address phone price description

mediated schema

area

Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher

Frequency Learner

Name MatcherFrequency Learner

Meta-learner

Meta-learneraddressaddressdescriptionaddress

Combiner

address

Page 13: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

13

The LSD SystemThe LSD System

Base learners/modules– name matcher– Naive Bayesian learner– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer

Meta-learner– uses stacking [Ting&Witten99, Wolpert92]– uses training data to learn weights for base learners – combines predictions using confidence scores/weights

Page 14: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

14

Experiments Experiments

Sources Coverage# of MatchableLeaf Elements

BestSingle Learner

LSD

realestate.yahoo national 31 63% 77%

homeseekers.com national 31 52% 64%

nkymls.com Kentucky 28 64% 75%

texasproperties.com Texas 42 59% 62%

windermere.com Northwest 35 55% 63%

Page 15: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

15

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98],

ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

Multi-strategy learning in other domains– series of workshops [91,93,96,98,00]– [Freitag98], Proverb [Keim et. al. 99]

Page 16: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

16

SummarySummary

Schema matching– automated by learning

Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data

Implemented LSD– promising results with initial experiments

Page 17: AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

17

Future WorkFuture Worksource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings