anhai doan pedro domingos alon levy department of computer science & engineering university of...

AnHai Doan

Pedro Domingos

Alon Levy

Department of Computer Science & EngineeringUniversity of Washington

Learning Source DescriptionsLearning Source Descriptionsfor Data Integrationfor Data Integration

2

OverviewOverview

Problem definition– schema matching

Solution– multi-strategy learning

Prototype system– LSD (Learning Source Descriptions)

Experiments Related work Summary & future work

3

Data IntegrationData Integration

Find houses with four bathrooms and price under $500,000

mediated schema

superhomes.com

source schema

realestate.com

source schema

homeseekers.com

source schema

wrapper wrapperwrapper

4

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

house

location contact-info

house

address

agent-name agent-phone

num-baths amenities

full-baths half-baths handicap-equipped

contact

name phone

5

Map of the ProblemMap of the Problemsource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

6

Current State of AffairsCurrent State of Affairs

Largely done by hand– labor intensive & error prone– key bottleneck in building applications

Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data

Need automatic approaches to scale up!

7

Use machine learning to match schemas Basic idea

1. create training data– manually map a set of sources to mediated schema

2. train system on training data– learns from

– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...

3. system proposes mappings for subsequent sources

Our Approach Our Approach

8

ExampleExample

realestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location

Seattle, WASeattle, WADallas, TX...

listed-price

$250,000$162,000$180,000...

agent-phone

(206) 729 0831(206) 321 4571(214) 722 4035...

comments

Fantastic house ...Great ...Hurry! ......

9

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits certain types of information

Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner

Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy

10

LearnersLearners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Examples– Name matcher

– agent-name => (name,0.7), (phone,0.3)

– Frequency learner– “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)

11

Training the LearnersTraining the Learnersrealestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...


mediated schema

location listed-price agent-phone comments

Name Matcher

(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...

Frequency Learner

(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...

12

Applying the LearnersApplying the Learners

homes.com


mediated schema

area

Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher

Frequency Learner

Name MatcherFrequency Learner

Meta-learner

Meta-learneraddressaddressdescriptionaddress

Combiner

address

13

The LSD SystemThe LSD System

Base learners/modules– name matcher– Naive Bayesian learner– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer

Meta-learner– uses stacking [Ting&Witten99, Wolpert92]– uses training data to learn weights for base learners – combines predictions using confidence scores/weights

14

Experiments Experiments

Sources Coverage# of MatchableLeaf Elements

BestSingle Learner

LSD

realestate.yahoo national 31 63% 77%

homeseekers.com national 31 52% 64%

nkymls.com Kentucky 28 64% 75%

texasproperties.com Texas 42 59% 62%

windermere.com Northwest 35 55% 63%

15

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98],

ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

Multi-strategy learning in other domains– series of workshops [91,93,96,98,00]– [Freitag98], Proverb [Keim et. al. 99]

16

SummarySummary

Schema matching– automated by learning

Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data

Implemented LSD– promising results with initial experiments

17

Future WorkFuture Worksource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

anhai doan pedro domingos alon levy department of computer science & engineering university of...

Documents