on schema matching with opaque column names and data values

Post on 30-Dec-2015

45 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

On Schema Matching with Opaque Column Names and Data Values. Jaewoo Kang NC State (Aug 2003) Jeffrey F. Naughton Univ. of Wisconsin-Madison. What is Schema Matching?. Finding semantic correspondences of schema elements across heterogeneous sources. Old problem yet attracting new interests. - PowerPoint PPT Presentation

TRANSCRIPT

On Schema Matching with Opaque Column Names and Data Values

Jaewoo KangNC State (Aug 2003)

Jeffrey F. Naughton

Univ. of Wisconsin-Madison

June 10, 2003SIGMOD 2003

2 Jaewoo Kang

What is Schema Matching?

Finding semantic correspondences of schema elements across heterogeneous sources.

Old problem yet attracting new interests.

June 10, 2003SIGMOD 2003

3 Jaewoo Kang

What is Schema Matching? (Cont’d)

Important for enterprise applications Data warehouses, data migration.

Also important for Internet data Virtual databases, web information

systems. Fundamental element of data

integration.

June 10, 2003SIGMOD 2003

4 Jaewoo Kang

No Silver Bullet!

State of the art: A collection of techniques that propose matches.

We have added a new technique to this collection that works when previous techniques don’t even apply.

June 10, 2003SIGMOD 2003

5 Jaewoo Kang

Some Previous Approaches

Schema-based approaches

Manager

Employ Salary

J. K. J. D. $50K

T. J. N. D $80K

P. K. Z. I. $75K

MNG EMP WAGE

U. P. D. S. $85K

A. H. M. H. $75K

J. N. D. F. $60K

Site 1 Site 2

June 10, 2003SIGMOD 2003

6 Jaewoo Kang

Some Previous Approaches II

Instance-based approaches

DeptEmploy

Phone

HR J. D. 267-7622

R&D N. D 354-8736

Sales Z. I. 219-0457

DPT EMP CONT

R&D D. S. 387-9802

Sales M. H. 546-3856

Adm D. F. 326-1284Site 1 Site 2

June 10, 2003SIGMOD 2003

7 Jaewoo Kang

So two previous approaches

Schema-based (interpret column names)

Instance-based (interpret data values)

June 10, 2003SIGMOD 2003

8 Jaewoo Kang

But what about this problem?

t1 t2 t3 t4 t5 t6 t719

37.3 3.6 5.7 9

0.39

0.2

176

7 4.5 8 150.8

70.4

123

6.3 3.8 7 120.5

60.5

238

6.7 3.9 3.7 180.4

40.5

174

6.1 3.5 4.4 210.5

60.6

96 6.1 4.1 3.1 100.7

30.3

133

8.4 4.7 6.3 120.7

70.3

t1 t2 t3 t4 t5 t6 t716

47.4 4.2 3.8 13

0.57

0.4

129

6.3 3.4 4.8 160.4

40.2

136

7.6 4 3.1 90.5

20.6

395

6.9 3.6 4.8 80.3

80.4

93 6.6 3.7 3.9 170.6

10.6

114

6.8 3.9 4 170.3

20.5

144

7.8 4.3 3.8 160.5

10.9

Site 1 Site 2

June 10, 2003SIGMOD 2003

9 Jaewoo Kang

This is the “Un-interpreted Matching” Problem.

Focus of this talk Outline of the remainder of this

talk Formal definition Terminology Algorithm Experimental Results

June 10, 2003SIGMOD 2003

10 Jaewoo Kang

Un-interpreted Matching

M1 = match(R(r1, r2, .., rn), S(s1, s2, .., sm))M2 = match(R(r1, r2, .., rn), S’(f1(s1), f2(s2), .., fm(sm))

where match = a schema matching algorithm,Mi = {(ri-sj)} : set of matching column

pairs,fi = arbitrary one-to-one function.

‘match’ is an un-interpreted matching iff M1=M2 for all fi’s.

Main idea: specific token representing column name and value is not important.

June 10, 2003SIGMOD 2003

11 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

June 10, 2003SIGMOD 2003

12 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

June 10, 2003SIGMOD 2003

13 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

June 10, 2003SIGMOD 2003

14 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

June 10, 2003SIGMOD 2003

15 Jaewoo Kang

Background

Before introducing our algorithm, need: Information Entropy Mutual Information Modeling Dependency Relations Graph Matching

June 10, 2003SIGMOD 2003

16 Jaewoo Kang

Information Entropy Measures the

uncertainty of values in an attribute

Standard information theoretic measure( ) ( ) log ( )

x

H X p x p x

X

Entropy of Coin Flip Test

0

0.2

0.4

0.6

0.8

1

1.2

p(x=front)

H(X

)

June 10, 2003SIGMOD 2003

17 Jaewoo Kang

Mutual Information Another standard information theoretic

measure Measures the amount of information

captured in one attribute about the other.

Note Self-information MI(X;X) = H(X)

( , )( ; ) ( , ) log

( ) ( )x y

p x yMI X Y p x y

p x p y

X Y

June 10, 2003SIGMOD 2003

18 Jaewoo Kang

Modeling Dependency Relation

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0

Table R G=Table2DepGraph(R)

June 10, 2003SIGMOD 2003

19 Jaewoo Kang

Graph Matching

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0W

ZY

X

2.0 1.5

1.0 1.5

1.0 1.0

1.5

1.0

0.5

1.5

G1 G2 Our algorithm will use graph matching. {(G1(a),G2(b))}=GraphMatch(G1,G2) Finds a mapping that minimizes the

distance between the two graphs.

June 10, 2003SIGMOD 2003

20 Jaewoo Kang

Distance Between the Graphs Euclidean distance metric (Frobenius

norm)

where aij and bij = mutual information between node i and j.

m(node in A) = matching node in B.

2( ) ( )

,( , ) ( )U

ij m i m jM i jD A B a b

June 10, 2003SIGMOD 2003

21 Jaewoo Kang

Measuring the quality of match results

#of correct matches producedPrecision =

#of all matches produced

June 10, 2003SIGMOD 2003

22 Jaewoo Kang

Finally, Our Matching Algorithm

1. G1 = Table2DepGraph(S1); G2 = Table2DepGraph(S2);

2. {(G1(a), G2(b))} = GraphMatch(G1, G2);

where Si = an input table, Gi = a dependency graph, (G1(a), G2(b)) = a matching

node pair.

June 10, 2003SIGMOD 2003

23 Jaewoo Kang

Validating the Framework

Graph matching algorithm Used exhaustive search w/ simple

filtering. Can be replaced w/ approximate

algorithms in practice. System

Java HotSpot VM 1.4

June 10, 2003SIGMOD 2003

24 Jaewoo Kang

Goals of experiments… Main goal: see if mutual information-

based un-interpreted matching works.

Secondary goal: see if mutual information is necessary, or if a simpler approach, Entropy-only Matching, works just as well. Only compares the entropies of

attributes in isolation, without considering mutual information.

June 10, 2003SIGMOD 2003

25 Jaewoo Kang

Data Set I Census Data (U.S. Census Bureau)

State census data files: NY and CA. Can algorithm find mapping between attributes

in NY and CA tables?

1 2 3 4 5 6 7 8 9 10

18091 1063 10 9 9 41 15 368 368 288

17511 3281 25 21 40 89 59 1211 1211 796

609 3424 29 13 15 148 26 1055 1055 861

3861 2884 18 7 4 114 11 670 670 568

18614 1478 12 10 15 40 16 630 630 459

June 10, 2003SIGMOD 2003

26 Jaewoo Kang

Data Set II Medical Data

Thrombosis lab exam data (12 years of patient records.) Range partitioned into two tables based on exam dates. Can algorithm find mapping between attributes in

resulting two tables?

1 2 3 4 5 6 7 8 9 10

970709 23 530 104 6.4 4 14 0.5 232 100

971022 26 564 108 6.8 5.3 13 0.55 250 103

971224 25 483 90 6.5 5.1 15 0.62

980120 26 578 101 7 4.6 16 0.49 224 93

980217 34 521 98 5.3 10 0.62 234 111

June 10, 2003SIGMOD 2003

27 Jaewoo Kang

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20

Schema size (#of attributes)

Pre

cisi

on

Mutual InformationEntropy-only

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20

Schema size (#of attributes)

Pre

cisi

on

Mutual InformationEntropy-only

Results

Thrombosis exam Census data

Match precision deteriorates as the size of match increases. However, deterioration is small compared to the

exponential increase in search space. MI-based approach dominates entropy-only approach.

June 10, 2003SIGMOD 2003

28 Jaewoo Kang

Why does mutual information-based approach dominate entropy-only approach?

0123456789

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Attributes

Ent

ropy

Census NY

Census CA

June 10, 2003SIGMOD 2003

29 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping (bijective)

A

B

C

A

B

C

G1 G2

June 10, 2003SIGMOD 2003

30 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping (bijective) Onto mapping (surjective)

A

B

C

A

B

C

D

G1 G2

June 10, 2003SIGMOD 2003

31 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping Onto mapping Partial mapping

A

B

C

E

A

B

C

D

G1 G2

June 10, 2003SIGMOD 2003

32 Jaewoo Kang

What about schemas that don’t match?

Examined how our matching algorithm reacts to the matching of unrelated schemas. (NY-CA vs. Lab1-CA)

June 10, 2003SIGMOD 2003

33 Jaewoo Kang

Distinguishing Good and Bad Matches

Clearly detects case where there is no good matching.

0

10

20

30

40

50

60

70

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Schema size (#of attributes)

Met

ric

valu

e

One-to-One NY-CA Euclidean

One-to-One Lab1-CA Euclidean

June 10, 2003SIGMOD 2003

34 Jaewoo Kang

Summary Identified new class of schema

matching problems that have not been addressed by existing solutions.

First to introduce an un-interpreted matching technique that addresses the new class of problems.

Evaluation suggests it may be useful as an addition to existing matching techniques.

June 10, 2003SIGMOD 2003

35 Jaewoo Kang

Future Work

Find an efficient, accurate graph matching approximation algorithm.

Extend the techniques to nested structures such as XML, OO schemas.

See if the technique applicable to the problems of schema classification / clustering.

June 10, 2003SIGMOD 2003

36 Jaewoo Kang

Questions?

For more information: jaewoo@cs.wisc.edu http://www.cs.wisc.edu/~jaewoo

top related