on schema matching with opaque column names and data values
Post on 30-Dec-2015
45 Views
Preview:
DESCRIPTION
TRANSCRIPT
On Schema Matching with Opaque Column Names and Data Values
Jaewoo KangNC State (Aug 2003)
Jeffrey F. Naughton
Univ. of Wisconsin-Madison
June 10, 2003SIGMOD 2003
2 Jaewoo Kang
What is Schema Matching?
Finding semantic correspondences of schema elements across heterogeneous sources.
Old problem yet attracting new interests.
June 10, 2003SIGMOD 2003
3 Jaewoo Kang
What is Schema Matching? (Cont’d)
Important for enterprise applications Data warehouses, data migration.
Also important for Internet data Virtual databases, web information
systems. Fundamental element of data
integration.
June 10, 2003SIGMOD 2003
4 Jaewoo Kang
No Silver Bullet!
State of the art: A collection of techniques that propose matches.
We have added a new technique to this collection that works when previous techniques don’t even apply.
June 10, 2003SIGMOD 2003
5 Jaewoo Kang
Some Previous Approaches
Schema-based approaches
Manager
Employ Salary
J. K. J. D. $50K
T. J. N. D $80K
P. K. Z. I. $75K
MNG EMP WAGE
U. P. D. S. $85K
A. H. M. H. $75K
J. N. D. F. $60K
Site 1 Site 2
June 10, 2003SIGMOD 2003
6 Jaewoo Kang
Some Previous Approaches II
Instance-based approaches
DeptEmploy
Phone
HR J. D. 267-7622
R&D N. D 354-8736
Sales Z. I. 219-0457
DPT EMP CONT
R&D D. S. 387-9802
Sales M. H. 546-3856
Adm D. F. 326-1284Site 1 Site 2
June 10, 2003SIGMOD 2003
7 Jaewoo Kang
So two previous approaches
Schema-based (interpret column names)
Instance-based (interpret data values)
June 10, 2003SIGMOD 2003
8 Jaewoo Kang
But what about this problem?
t1 t2 t3 t4 t5 t6 t719
37.3 3.6 5.7 9
0.39
0.2
176
7 4.5 8 150.8
70.4
123
6.3 3.8 7 120.5
60.5
238
6.7 3.9 3.7 180.4
40.5
174
6.1 3.5 4.4 210.5
60.6
96 6.1 4.1 3.1 100.7
30.3
133
8.4 4.7 6.3 120.7
70.3
t1 t2 t3 t4 t5 t6 t716
47.4 4.2 3.8 13
0.57
0.4
129
6.3 3.4 4.8 160.4
40.2
136
7.6 4 3.1 90.5
20.6
395
6.9 3.6 4.8 80.3
80.4
93 6.6 3.7 3.9 170.6
10.6
114
6.8 3.9 4 170.3
20.5
144
7.8 4.3 3.8 160.5
10.9
Site 1 Site 2
June 10, 2003SIGMOD 2003
9 Jaewoo Kang
This is the “Un-interpreted Matching” Problem.
Focus of this talk Outline of the remainder of this
talk Formal definition Terminology Algorithm Experimental Results
June 10, 2003SIGMOD 2003
10 Jaewoo Kang
Un-interpreted Matching
M1 = match(R(r1, r2, .., rn), S(s1, s2, .., sm))M2 = match(R(r1, r2, .., rn), S’(f1(s1), f2(s2), .., fm(sm))
where match = a schema matching algorithm,Mi = {(ri-sj)} : set of matching column
pairs,fi = arbitrary one-to-one function.
‘match’ is an un-interpreted matching iff M1=M2 for all fi’s.
Main idea: specific token representing column name and value is not important.
June 10, 2003SIGMOD 2003
11 Jaewoo Kang
Motivating example
Model
Color Tire
XLE white
P2R6
XG2.5
silver
XR5
LE red GM6
: : :
Model
C1 C2
GL3.5
a1 b1
XGL a2 b2
XE a3 b3
: : :
Two Car Part Tables
June 10, 2003SIGMOD 2003
12 Jaewoo Kang
Motivating example
Model
Color Tire
XLE white
P2R6
XG2.5
silver
XR5
LE red GM6
: : :
Model
C1 C2
GL3.5
a1 b1
XGL a2 b2
XE a3 b3
: : :
Two Car Part Tables
June 10, 2003SIGMOD 2003
13 Jaewoo Kang
Motivating example
Model
Color Tire
XLE white
P2R6
XG2.5
silver
XR5
LE red GM6
: : :
Model
C1 C2
GL3.5
a1 b1
XGL a2 b2
XE a3 b3
: : :
Two Car Part Tables
June 10, 2003SIGMOD 2003
14 Jaewoo Kang
Motivating example
Model
Color Tire
XLE white
P2R6
XG2.5
silver
XR5
LE red GM6
: : :
Model
C1 C2
GL3.5
a1 b1
XGL a2 b2
XE a3 b3
: : :
Two Car Part Tables
June 10, 2003SIGMOD 2003
15 Jaewoo Kang
Background
Before introducing our algorithm, need: Information Entropy Mutual Information Modeling Dependency Relations Graph Matching
June 10, 2003SIGMOD 2003
16 Jaewoo Kang
Information Entropy Measures the
uncertainty of values in an attribute
Standard information theoretic measure( ) ( ) log ( )
x
H X p x p x
X
Entropy of Coin Flip Test
0
0.2
0.4
0.6
0.8
1
1.2
p(x=front)
H(X
)
June 10, 2003SIGMOD 2003
17 Jaewoo Kang
Mutual Information Another standard information theoretic
measure Measures the amount of information
captured in one attribute about the other.
Note Self-information MI(X;X) = H(X)
( , )( ; ) ( , ) log
( ) ( )x y
p x yMI X Y p x y
p x p y
X Y
June 10, 2003SIGMOD 2003
18 Jaewoo Kang
Modeling Dependency Relation
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A
DC
B
1.5 2.0
1.0 1.5
1.0 1.5
1.5
0.5
1.0
1.0
Table R G=Table2DepGraph(R)
June 10, 2003SIGMOD 2003
19 Jaewoo Kang
Graph Matching
A
DC
B
1.5 2.0
1.0 1.5
1.0 1.5
1.5
0.5
1.0
1.0W
ZY
X
2.0 1.5
1.0 1.5
1.0 1.0
1.5
1.0
0.5
1.5
G1 G2 Our algorithm will use graph matching. {(G1(a),G2(b))}=GraphMatch(G1,G2) Finds a mapping that minimizes the
distance between the two graphs.
June 10, 2003SIGMOD 2003
20 Jaewoo Kang
Distance Between the Graphs Euclidean distance metric (Frobenius
norm)
where aij and bij = mutual information between node i and j.
m(node in A) = matching node in B.
2( ) ( )
,( , ) ( )U
ij m i m jM i jD A B a b
June 10, 2003SIGMOD 2003
21 Jaewoo Kang
Measuring the quality of match results
#of correct matches producedPrecision =
#of all matches produced
June 10, 2003SIGMOD 2003
22 Jaewoo Kang
Finally, Our Matching Algorithm
1. G1 = Table2DepGraph(S1); G2 = Table2DepGraph(S2);
2. {(G1(a), G2(b))} = GraphMatch(G1, G2);
where Si = an input table, Gi = a dependency graph, (G1(a), G2(b)) = a matching
node pair.
June 10, 2003SIGMOD 2003
23 Jaewoo Kang
Validating the Framework
Graph matching algorithm Used exhaustive search w/ simple
filtering. Can be replaced w/ approximate
algorithms in practice. System
Java HotSpot VM 1.4
June 10, 2003SIGMOD 2003
24 Jaewoo Kang
Goals of experiments… Main goal: see if mutual information-
based un-interpreted matching works.
Secondary goal: see if mutual information is necessary, or if a simpler approach, Entropy-only Matching, works just as well. Only compares the entropies of
attributes in isolation, without considering mutual information.
June 10, 2003SIGMOD 2003
25 Jaewoo Kang
Data Set I Census Data (U.S. Census Bureau)
State census data files: NY and CA. Can algorithm find mapping between attributes
in NY and CA tables?
1 2 3 4 5 6 7 8 9 10
18091 1063 10 9 9 41 15 368 368 288
17511 3281 25 21 40 89 59 1211 1211 796
609 3424 29 13 15 148 26 1055 1055 861
3861 2884 18 7 4 114 11 670 670 568
18614 1478 12 10 15 40 16 630 630 459
June 10, 2003SIGMOD 2003
26 Jaewoo Kang
Data Set II Medical Data
Thrombosis lab exam data (12 years of patient records.) Range partitioned into two tables based on exam dates. Can algorithm find mapping between attributes in
resulting two tables?
1 2 3 4 5 6 7 8 9 10
970709 23 530 104 6.4 4 14 0.5 232 100
971022 26 564 108 6.8 5.3 13 0.55 250 103
971224 25 483 90 6.5 5.1 15 0.62
980120 26 578 101 7 4.6 16 0.49 224 93
980217 34 521 98 5.3 10 0.62 234 111
June 10, 2003SIGMOD 2003
27 Jaewoo Kang
50%55%60%65%70%75%80%85%90%95%
100%
2 4 6 8 10 12 14 16 18 20
Schema size (#of attributes)
Pre
cisi
on
Mutual InformationEntropy-only
50%55%60%65%70%75%80%85%90%95%
100%
2 4 6 8 10 12 14 16 18 20
Schema size (#of attributes)
Pre
cisi
on
Mutual InformationEntropy-only
Results
Thrombosis exam Census data
Match precision deteriorates as the size of match increases. However, deterioration is small compared to the
exponential increase in search space. MI-based approach dominates entropy-only approach.
June 10, 2003SIGMOD 2003
28 Jaewoo Kang
Why does mutual information-based approach dominate entropy-only approach?
0123456789
1011121314
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Attributes
Ent
ropy
Census NY
Census CA
June 10, 2003SIGMOD 2003
29 Jaewoo Kang
Cardinality Constraints in Schema Matching
One-to-one mapping (bijective)
A
B
C
A
B
C
G1 G2
June 10, 2003SIGMOD 2003
30 Jaewoo Kang
Cardinality Constraints in Schema Matching
One-to-one mapping (bijective) Onto mapping (surjective)
A
B
C
A
B
C
D
G1 G2
June 10, 2003SIGMOD 2003
31 Jaewoo Kang
Cardinality Constraints in Schema Matching
One-to-one mapping Onto mapping Partial mapping
A
B
C
E
A
B
C
D
G1 G2
June 10, 2003SIGMOD 2003
32 Jaewoo Kang
What about schemas that don’t match?
Examined how our matching algorithm reacts to the matching of unrelated schemas. (NY-CA vs. Lab1-CA)
June 10, 2003SIGMOD 2003
33 Jaewoo Kang
Distinguishing Good and Bad Matches
Clearly detects case where there is no good matching.
0
10
20
30
40
50
60
70
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Schema size (#of attributes)
Met
ric
valu
e
One-to-One NY-CA Euclidean
One-to-One Lab1-CA Euclidean
June 10, 2003SIGMOD 2003
34 Jaewoo Kang
Summary Identified new class of schema
matching problems that have not been addressed by existing solutions.
First to introduce an un-interpreted matching technique that addresses the new class of problems.
Evaluation suggests it may be useful as an addition to existing matching techniques.
June 10, 2003SIGMOD 2003
35 Jaewoo Kang
Future Work
Find an efficient, accurate graph matching approximation algorithm.
Extend the techniques to nested structures such as XML, OO schemas.
See if the technique applicable to the problems of schema classification / clustering.
June 10, 2003SIGMOD 2003
36 Jaewoo Kang
Questions?
For more information: jaewoo@cs.wisc.edu http://www.cs.wisc.edu/~jaewoo
top related