putting context into schema matching philip bohannon* yahoo! research eiman elnahrawy* rutgers...
TRANSCRIPT
![Page 1: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/1.jpg)
Putting Context into Schema Matching
Philip Bohannon*Yahoo! Research
Eiman Elnahrawy*Rutgers University
Wenfei FanUniv of Edinburgh and Bell Labs
Michael Flaster*Google
*Work performed at Lucent Technologies -- Bell Laboratories.
![Page 2: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/2.jpg)
Slide 2
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusions
![Page 3: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/3.jpg)
Slide 3
Schema Matching vs. Schema Mapping
RS.Person
First
Last
City
RT.Student
Name
Address
City...
.
.
.
Source Schema: RS
Target Schema: RT
Arrows inferred based on meta-data or sample instance data
Associated confidence score
Meaning (variant of): RS.Person.City RT.Student.City
.88
.93
.97
Schema Matching means “computer-suggested arrows”
![Page 4: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/4.jpg)
Slide 4
Schema Mapping: “From Arrows to Queries”
RS.Person
First
Last
City
RT.Student
Name
Address
City...
.
.
.
Given a set of arrows user input, produce a query that maps instances of RS into instances of RTRT
Transformations, joins [Miller, Has, Hernandez, VLDB 2002] added by, or with help from, the user
Most of this talk is about matching, some implications for mapping later
select concat(First, “ ”,Last) as Name,
City as City
from RS.Person, RS.Education,…
where …
Q: RS -> RT
![Page 5: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/5.jpg)
Slide 5
Motivation: inventory mapping example
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
RT.booktitle: stringisbn: string
price: floatformat: string
RT.music
title: stringasin: string
price: float
label: string
sale: float
Consider integrating two inventory schemas
Books, music in separate tables in RT
Run some nice schema match software
![Page 6: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/6.jpg)
Slide 6
Inventory where clause
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
RT.booktitle: stringisbn: string
price: floatformat: string
RT.music
title: stringasin: string
price: float
label: string
sale: float
The lines are helpful (schema matching is a best-effort affair), but…
lines are semantically correct only in the context of a selection condition
where type=1
where type = 2
![Page 7: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/7.jpg)
Slide 7
Definition and Goals
Contextual schema match: An arrow between source and
target schema elements, annotated with a selection condition
– In a standard schema match, the condition “true” is always
used
Goal: Adapt instance-driven schema matching techniques to
infer semantically valid contextual schema matches, and
create schema maps from those matches
RS.aa RT.bb true M
RS.aa RT.bb RS.c=3 M
![Page 8: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/8.jpg)
Slide 8
Attribute promotion example Consider integrating data about grade assignments [Fletcher,
Wyss, SIGMOD 2005 demo]
Again context is needed, but semantics are slightly different: attribute promotion
Name Assgn Grade Name Grade1 Grade2 Grade3 …Joe
Joe
Mary
Mary
Mary
Joe
1
23
1
23
84
8675
92
9485
where Assgn=1
where Assgn=2
=3 = …
![Page 9: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/9.jpg)
Slide 9
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusion
![Page 10: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/10.jpg)
Slide 10
Background: Instance-level matching
RS.ac RT1.bb true M RS.ac RT
1.ac true M
San JoseCupertinoPalo AltoGilroyPleasantonSunnyvale
SunnyvaleLos AngelesCupertinoGilroySan Diego
Nice match!
(408) 123-4456(212) 223-3455(408) 123-2222(408) 324-4444
SunnyvaleLos AngelesCupertinoGilroySan Diego
Dubious, at best!
![Page 11: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/11.jpg)
Slide 11
Background: Instance-level matching
RS.ac RT1.bb true M RS.ac RT
1.ac true M
Perfect match!Dubious, at best!
BayesianTri-gram
Type Expert
String Edit Distance
CosineSimilarity
WhateverMore
Whatever
Coming up with a good score is far from simple!• Derive comparable scores across sample size, data types, etc.
![Page 12: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/12.jpg)
Slide 12
StandardMatch(RS,RT,)
RS.ac RT1.ac true M
RS.ba RT1.cd true M
RS.ba RT1.sb true M
RS.db RT1.ar true M
RS.ac RT1.vw true M
RS.bd RT1.ad true M
1. Consider all |RS||RT| matches, score them, normalize the scores
RS.ac RT1.ac true M
RS.ba RT1.cd true M
RS.ba RT1.sb true M
RS.db RT1.ar true M
RS.ac RT1.vw true M
RS.bd RT1.ad true M
2. Rank by normalized score
3. Apply as a cutoff, and return
![Page 13: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/13.jpg)
Slide 13
Background: Categorical Attributes What attributes are candidates
for the where clause?
We focus on “categorical” attributes (leaving non-categorical attributes as future work)
If not identified by schema, infer from sample data, as any attribute with
–more than 1 value
–most values associated with more than one tuple
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
![Page 14: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/14.jpg)
Slide 14
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusion
![Page 15: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/15.jpg)
Slide 15
Strawman Algorithm
1. Use instance-based matching algorithm to compute a set of matches, L = M1..Mn, along with associated scores
2. For each Mi in L, of the form (RS.s,RT.t,true)
For each categorical attribute c in the source (or target)
For each value v taken by c in the sample
1. Restrict the sample of RS to tuples where c=v
2. Re-compute the match score on the new sample
3. For c,v that most improves score, replace Mi with (RS.s,RT.t,c=v)
![Page 16: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/16.jpg)
Slide 16
ContextMatch(RS,RT,)
RS.ac RT1.ac true M
RS.ba RT1.cd true M
RS.ba RT2.sb true M
RS.db RT2.ar true M
RS.ac RT1.vw true M
RS.bd RT1.ad true M
2. Rank by normalized score
3. Apply as a cutoff, and return
StandardMatch…
RS.ba RT1.cd Rs.t=1M
5. Evaluate quality of match
6. Keep the best!
RS.c = 2RS.d = “open”
RS.c = 2 or RS.c = 3RS.t = 0
RS.t = 1
4. Try each context condition
![Page 17: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/17.jpg)
Slide 17
Problems with Strawman
False Positives – the increase in the score may not be meaningful, since
some random subsets of corpus will match better than the whole (even with size-adjusted metrics)
False Negatives– original matching algorithm only returned matches with
quality above some threshold to be in L, but a match that didn’t make the cut may improve greatly with contextual matching
Time – with disjuncts -- too many expressions to test
![Page 18: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/18.jpg)
Slide 18
Strawman 2.0
Like Strawman, but require an improvement threshold, w, to cut down on false positives
Not much better
Setting w is problematic, as matcher scores are not perfect
![Page 19: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/19.jpg)
Slide 19
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusion
![Page 20: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/20.jpg)
Slide 20
Our approach:
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
RT.booktitle: stringisbn: string
price: floatformat: string
RT.music
title: stringasin: string
price: float
label: string
sale: float
1. Pre-filter conditions based on classification
2. Find conditions that improve several matches from the same table
![Page 21: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/21.jpg)
Slide 21
View-oriented contextual mapping (cont’d)
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
RT.booktitle: stringisbn: string
price: floatformat: string
RT.music
title: stringasin: string
price: float
label: string
sale: float
RS.inv where type = 2
id: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
RS.inv where type = 1
id: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
![Page 22: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/22.jpg)
Slide 22
Algorithm ContextMatch(RS,RT,)
L = ;
M = StandardMatch(RS,RT,);
C = InferCandidateViews(RS,M,EarlyDisjuncts);
for c C do
Vc = select * from RS where c;
for m M do
m’ := m with RS replaced by Vc;
s := ScoreMatch(m’);
L = L {(m’,s)};
return SelectContextualMatches(M, L, EarlyDisjuncts)
![Page 23: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/23.jpg)
Slide 23
ContextMatch(RS,RT,)
RS.ac RT1.ac true M
RS.ba RT1.cd true M
RS.ba RT2.sb true M
RS.db RT2.ar true M
RS.ac RT1.vw true M
RS.bd RT1.ad true M
2. Rank by normalized score
3. Apply as a cutoff, and return
StandardMatch…InferCandidateViews
RS.c = 2RS.d = “open”
RS.c = 2 or RS.c = 3RS.t = 0
RS.t = 1
4. Re-compute summariesfor V as:
“select * from RS
where RS.t = 1”
For each candidate view V,
RS.ba RT1.cd Rs.t=1M
5. Evaluate quality of matches
![Page 24: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/24.jpg)
Slide 24
How to Filter Candidate Views
Naïve
– Any Boolean condition involving a categorical attribute (strawman approach)
SourceClassifier, TargetClassifier
– Check for categorical attributes that do a “good job” categorizing other attributes
Disjunct Handling (early or late)
Conjunct Handling
![Page 25: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/25.jpg)
Slide 25
RS.invid: integer
name: string
code: string
type: integerinstock: string
descr: string
arrival: date
id name type instock code descr
0 leaves of grass 1 y 0195128 hardcover
1 the white album 2 y B002UAX audio cd
2 heart of darkness 1 n 0486611 paperback
3 wasteland 1 y 039995 paperback
4 hotel california 2 n B002GVO electra
Source Classifier Intuition
how well do the categorical attributes serve as classifier labels for the other attributes?
![Page 26: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/26.jpg)
Slide 26
id name type instock code descr
0 leaves of grass 1 y 0195128 hardcover
1 the white album 2 y B002UAX audio cd
2 heart of darkness 1 n 0486611 paperback
3 wasteland 1 y 039995 paperback
4 hotel california 2 n B002GVO electra
Source Classifier Intuition: type
how about ‘type’?
![Page 27: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/27.jpg)
Slide 27
id name type instock code descr
0 leaves of grass 1 y 0195128 hardcover
1 the white album 2 y B002UAX audio cd
2 heart of darkness 1 n 0486611 paperback
3 wasteland 1 y 039995 paperback
4 hotel california 2 n B002GVO electra
Source Classifier Intuition: instock
how about ‘instock’?
![Page 28: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/28.jpg)
Slide 28
What do we really mean by a “good job”? Split the sample into a training set and a testing set
(randomly) For each categorical attribute C and non-categorical
attribute A– Train a classifier H by treating the value of A as the
document and the value of C as the label
– Test H against test set, determine precision, p, and recall, r
– Score(C) w.r.t. A based on combination of precision and recall (F = 2pr/(p+r))
– Compare Score(C) to Score(NC), wher NC is a Naïve Classifier:
• This classifier chooses most frequent label
– C does a good job with H if H’s improvement over Naïve is statistically significant with 95% confidence
![Page 29: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/29.jpg)
Slide 29
id name type instock code descr
0 leaves of grass 1 y 0195128 hardcover
1 the white album 2 y B002UAX audio cd
2 heart of darkness 1 n 0486611 paperback
3 wasteland 1 y 039995 paperback
4 hotel california 2 n B002GVO electra
Target Classifier Intuition
Train a new classifier, T, treating each target schema attribute as a class of documents
Check source values against this classifier Label each value with best guess label Use labels instead of values in the same framework
Book.comment
Book.comment
Music.label
![Page 30: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/30.jpg)
Slide 30
Handling Disjunctive Conditions Why Disjuncts? What if type field had separate categories for
hardback and paperback? Two approaches to handling disjunctive conditions, “early”
and “late” Early Disjuncts
– InferCandidateViews is responsible for identifying “interesting” disjuncts
– Each interesting disjunct is evaluated separately, no overlapping conditions are output
Late Disjuncts– InferCandidateViews returns no disjuncts
– All high-scoring conditions are unioned together (Clio semantics), effectively creating a disjunct
![Page 31: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/31.jpg)
Slide 31
Early Disjuncts: A Heuristic Approach When evaluating trained classifier on test set for some
categorical attribute C, make note of misclassifications of the form “should be A, but guessed B”
Consider merging the (A,B) pair that would repair most errors
– by merge, we mean “replace” A and B values with (A,B)
Re-evaluate
Repeat
Keep all alternatives formed this way that score well
Only accept 1 view that mentions attribute C (don’t union)
![Page 32: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/32.jpg)
Slide 32
Handling Conjuncts
Proposed Approach: – Assumes that a good conjunctive view has a good
disjunctive view as one of the terms in the conjunct.
Run Context Match Repeatedly
At stage i, consider views VC identified by the previous (i-1)th run as the input base tables– where C was the select condition defining the view
When considering candidate attributes for a run, only consider categorical attributes not in C.
(Conjunct handling not in current experiments)
![Page 33: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/33.jpg)
Slide 33
Selecting Contextual Matches
Each view V based on condition c is evaluated, rather than each match
Compute overall confidence of matches from V, and compare to overall confidence from base table
If overall confidence is better than w, use V instead of the base table
If more than one qualifies– If EarlyDisunct, choose the best
– Else, take all that are over w
![Page 34: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/34.jpg)
Slide 34
Comments on Schema Mapping
Seek to apply the Clio ([Popa et al, VLDB 2002]) approach to mapping construction
Create ‘logical tables’ based on key-foreign key constraints
Two challenges– Extend notion of foreign-key constraints in context of
selection views, undecidability result
– Extend join rules of [Popa et al, VLDB 2002] to handle the selection views
See paper for details
![Page 35: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/35.jpg)
Slide 35
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusion
![Page 36: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/36.jpg)
Slide 36
Experimental Study Used schemas from the retail domain
– schemas created by students at UW
• Aaron, Ryan, Barrett
– Populated code, descr info by scraping web-sites, used some name data from Illinois Semantic Integration Archive
ItemType is split, so that instead of just CD, BOOK
– e.g. CD1, CD2, BOOK1, BOOK2, =4
Compare matched edges to correct edges
– Accuracy: how many of BOOKi edges go to book target table?
– Precision: of the BOOKi edges, how many go to book target?
– Fmeas: 2(Accuracy * Precision)/(Accuracy + Precision)
![Page 37: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/37.jpg)
Slide 37
View improvement threshold: w
Aaron Barett
How sensitive is technique to w?
Depends on disjunct strategy Easier to pick w with
EarlyDisjunct
Ryan
![Page 38: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/38.jpg)
Slide 38
Strawman
Strawman means– Late disjunct (EarlyDisjunct=false)
– Pick best arrow from each source attribute on per-attribute basis (MultiTable)
![Page 39: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/39.jpg)
Slide 39
Sensitivity to Decoy Categorical Attributes
EarlyDisjunct
LateDisjunct
Add 3 extra categorical attributes Vary their correlation with ItemType (higher correlation makes it
harder) Naïve is not only slow, it is overly confusing to the quality
metrics EarlyDisjunct heuristic based on classification helps with quality
![Page 40: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/40.jpg)
Slide 40
Varying schema size
Add n non-categorical attributes to every table, all taken from same domain
Add n/4 categorical attributes to tables with categorical attributes Early dip is before non-categorical attributes match each other
![Page 41: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/41.jpg)
Slide 41
Runtime as schema gets larger
Same experiment, compare runtimes TgtClass is somewhat higher quality (not shown), but takes much longer for large
schemas
![Page 42: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/42.jpg)
Slide 42
Grades Example
Create an experiment based on grades example
Artificial data – mean of assignment I is 40 + 10(I-1) (as grades improve)
– standard deviation is varied
Name Assgn Grade Name Grade1 Grade2 Grade3 …Joe
Joe
Mary
Mary
Mary
Joe
Bob
Sue
1
23
1
23
84
8675
92
9485
where Assgn=2
where Assgn=1 =3 = …
![Page 43: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/43.jpg)
Slide 43
Grades accuracy as std. dev increases
![Page 44: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/44.jpg)
Slide 44
Overview
1. Motivation
2. Background
3. Strawman
4. Framework
5. Experimental Evaluation
6. Related Work
7. Conclusion
![Page 45: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/45.jpg)
Slide 45
Related Work Instance level schema matching
– Survey [Rahm, Bernstein, VLDB Journal 2001], Coma [Do, Rahm, VLDB02], Coma++ [SIGMOD 05], iMAP [Doan et al, SIGMOD 01], Cupid [Madhavan, Bernstien, Rahm, VLDB 01], etc.
Schema mapping
– Clio [Popa, et al, VLDB 02], [Haas et al, SIGMOD 2005], etc
– Model Management (many papers)
Overcoming heterogeneity during match process
– Schema Mapping as Query Discovery [Miller, Haas, Hernandez, VLDB 2000] - present user with examples to derive join conditions
– MIQIS [Fletcher, Wyss, (demo) SIGMOD 2005] - search through a large space of schema transformations (beyond what is given here), but requires the same data to appear in both source and target
– We focus on inferring selection views only, but are very compatible with existing schema match work
![Page 46: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/46.jpg)
Slide 46
Conclusions Contributions
– Introduced contextual matching as an important extension to schema matching
– Defined a general framework in which instance-level match technique is treated as a black box
– Identified two techniques based on classification to find good conditions
– Identified filtering criterea for contextual matches– Define contextual foreign key and new join rules to extend a Clio-
style schema mapper to better handle contextual matches– Experimental study illustrating time/quality tradeoffs
Future Work– More complex view conditioning (horizontal partitioning + attribute
promotion)– Consider taking constraints on target into account in quality functions
![Page 47: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/47.jpg)
The End
Thank you, any questions?
![Page 48: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/48.jpg)
Slide 48
sizes_fmeas.eps
![Page 49: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/49.jpg)
Slide 49
Standard Match Algorithm
StandardMatch(RS,RT, )
– Evaluate quality of match between all pairs of (source, target) attributes
• Ignore complex (multi-attribute) matches for simplicity
– return matches between source table RS and target schema RT that have confidence threshold >=
![Page 50: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/50.jpg)
Slide 50
RS.ac RT1.ac true M
RS.ba RT1.cd true M
RS.ba RT1.sb true M
RS.db RT1.ar true M
RS.ac RT1.vw true M
RS.bd RT1.ad true M
RS.af RT1.ca true M
![Page 51: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/51.jpg)
Slide 51
Background: Instance-level matching Instance-level schema matching requires sample data for
source and target schema
Train a variety of matchers by treating each (source, target) column as a set of documents labeled by the column name
– e.g. text matchers based on string similarity, token similarity, format similarity, number of tokens, etc, or
– numeric matchers based on value distribution, etc.
Apply source matchers to sample target data, and vice versa
Combine resulting scores (with machine-learned weightings [Doan, Domingos, Halevy, SIGMOD 2001]) to score each arrow
RS.ac RT1.bb true M
score
RS.ac RT1.bb true M
“perfect match”
![Page 52: Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael](https://reader036.vdocuments.site/reader036/viewer/2022062417/55169314550346f6208b477a/html5/thumbnails/52.jpg)
Slide 52
Algorithm ContextMatch(RS,RT,)
L = ;
M = StandardMatch(RS,RT,);
C = InferCandidateViews(RS,M,EarlyDisjuncts);
for c C do
Vc = select * from RS where c;
for m M do
m’ := m with RS replaced by Vc;
s := ScoreMatch(m’);
L = L {(m’,s)};
return SelectContextualMatches(M, L, EarlyDisjuncts)