anhai doan pedro domingos alon levy department of computer science & engineering university of...
TRANSCRIPT
AnHai Doan
Pedro Domingos
Alon Levy
Department of Computer Science & EngineeringUniversity of Washington
Learning Source DescriptionsLearning Source Descriptionsfor Data Integrationfor Data Integration
2
OverviewOverview
Problem definition– schema matching
Solution– multi-strategy learning
Prototype system– LSD (Learning Source Descriptions)
Experiments Related work Summary & future work
3
Data IntegrationData Integration
Find houses with four bathrooms and price under $500,000
mediated schema
superhomes.com
source schema
realestate.com
source schema
homeseekers.com
source schema
wrapper wrapperwrapper
4
Semantic Mappings between SchemasSemantic Mappings between Schemas
Mediated & source schemas = XML DTDs
house
location contact-info
house
address
agent-name agent-phone
num-baths amenities
full-baths half-baths handicap-equipped
contact
name phone
5
Map of the ProblemMap of the Problemsource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings
6
Current State of AffairsCurrent State of Affairs
Largely done by hand– labor intensive & error prone– key bottleneck in building applications
Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data
Need automatic approaches to scale up!
7
Use machine learning to match schemas Basic idea
1. create training data– manually map a set of sources to mediated schema
2. train system on training data– learns from
– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...
3. system proposes mappings for subsequent sources
Our Approach Our Approach
8
ExampleExample
realestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location
Seattle, WASeattle, WADallas, TX...
listed-price
$250,000$162,000$180,000...
agent-phone
(206) 729 0831(206) 321 4571(214) 722 4035...
comments
Fantastic house ...Great ...Hurry! ......
9
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits certain types of information
Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner
Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy
10
LearnersLearners Input
– schema information: name, proximity, structure, ...– data information: value, format, ...
Output– prediction weighted by confidence score
Examples– Name matcher
– agent-name => (name,0.7), (phone,0.3)
– Frequency learner– “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)
11
Training the LearnersTraining the Learnersrealestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location listed-price agent-phone comments
Name Matcher
(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...
Frequency Learner
(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...
12
Applying the LearnersApplying the Learners
homes.com
address phone price description
mediated schema
area
Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher
Frequency Learner
Name MatcherFrequency Learner
Meta-learner
Meta-learneraddressaddressdescriptionaddress
Combiner
address
13
The LSD SystemThe LSD System
Base learners/modules– name matcher– Naive Bayesian learner– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer
Meta-learner– uses stacking [Ting&Witten99, Wolpert92]– uses training data to learn weights for base learners – combines predictions using confidence scores/weights
14
Experiments Experiments
Sources Coverage# of MatchableLeaf Elements
BestSingle Learner
LSD
realestate.yahoo national 31 63% 77%
homeseekers.com national 31 52% 64%
nkymls.com Kentucky 28 64% 75%
texasproperties.com Texas 42 59% 62%
windermere.com Northwest 35 55% 63%
15
Related WorkRelated Work
Rule-based approaches– TRANSCM [Milo&Zohar98],
ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]– utilize only schema information
Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability
Multi-strategy learning in other domains– series of workshops [91,93,96,98,00]– [Freitag98], Proverb [Keim et. al. 99]
16
SummarySummary
Schema matching– automated by learning
Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data
Implemented LSD– promising results with initial experiments
17
Future WorkFuture Worksource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings