automatic schema matching seminar on databases and the internet yaron naveh january 2006

52
Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Upload: jean-conley

Post on 29-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching

Seminar on Databases and the InternetYaron Naveh

January 2006

Page 2: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

2

Articles

A survey of approaches to automatic schema matching Rahm & Bernstein (2001)

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004)

Page 3: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

3

Contents

Problem Definition Applications Classic Approaches Correlation Mining

Approach

Page 4: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

4

Match Definition

ID

Name

NumOfBooks

AID

AName

ANumOfBooks

Authors Authors

A match is a mapping between elements of two schemas that correspond semantically to each other

Page 5: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

5

Match Properties

ID

Name

NumOfBooks

ID

FName

LName

YearOfBirth

Authors Authors

?

• (n:m) matching also possible

(1:1)

(1:n)

?

Page 6: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

6

Match Properties (cont’d)

ID

Name

Salary ($)

Authors Authors

• Salary(NIS) = Salary($) * 4.55

• We will not find the function, just the attributes

ID

Name

Salary (NIS)

Page 7: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

7

Match Properties (cont’d)

EmpName

DeptID

Employees

Employees

One relation is mapped to two others

EmpName

DeptName

DeptID

DeptName

Departments

Join

Page 8: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

8

Match Properties (cont’d)

Teacher

StartTime

EndTime

Lessons Lessons

• Too hard for PC!

• PC should only suggest mappings to the user

Teacher

Time??

Page 9: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

9

Match Properties (cont’d)

An automated tool can be helpful here…

Field1

Field2

Field3

Field4

Field5

Field6

Field7

Field8

Field9

field10

Field1

Field2

Field3

Field4

Field5

Field6

Field7

Field8

Field9

field10

So maybe it can all be done manually?

Page 10: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

10

Match Generalization

We have defined a match for the relational model.

There are other interesting models:

<author>

<id>1</id>

<name>Calvino</name>

</author>

AuthorsBooks

ID

AuthorsName

Page 11: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

11

Match Generalization (cont’d)

• nodes and edges in graphs

• elements, subelements, and IDREFs in XML

Define a Schema to be a set of elements connected by some structure

Use the natural correspondence:

Page 12: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

12

Contents

Problem Definition Applications Classic Approaches Correlation Mining

Approach

Page 13: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

13

Data Migration

Date

From

Message

Time

Writer

Message

IsVisible

ResponseTo

Old Forum New Forum

Migrate data from old DB to new DB

Special case: Data warehouse

Page 14: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

14

E-Commerce

Map between different message formats

<book>

<name>The Invisible Cities</name>

<price>50</price>

</book>

<product>

<name>book</name>

<price>50</price>

</product>

Book Store

General Store

Page 15: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

15

Global Query Interface

GOOGLE

<input name=search>

<select name=type>

MSN

<input name=q>

Yahoo

<input name=qry>

<input name=type>

You want to build a Meta-Querier. However…

Page 16: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

16

Global Query Interface (cont’d)

Search

Type

q

GOOGLE

MSNYahoo

Solution: Reduce the html form to its “schema”

Qry

Type

Page 17: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

17

Semantic Query Processing

Id

Name

Authors Find: Author + Ram + Oren

Keywords search scenario

SELECT * WHERE Id=‘Ram Oren’SELECT * WHERE Name=‘Ram Oren’

?

?

Author

Ram

Oren

How does this differ from previous

examples?

Page 18: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

18

Contents

Problem Definition Applications Classic Approaches Correlation Mining

Approach

Page 19: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

19

Matchers

There are a few algorithms to map attributes of 2 schemas

Define such an algorithm as a matcher Define a hybrid matcher as a matcher that

combines results from other matchers

Page 20: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

20

Schema-based Vs. Instance-based

Two ways to perform a match:

• Use schema data (field name, type, constraints…)

• Use data from the table

Page 21: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

21

Instance-based

BookID TotPages

TotPrice

1 500 50

2 400 40

3 450 90

BookID

TotalP

1 6060

• Build a schema from instance data, then use schema matchers

• Use the data directly. Example:

Two options for using data from the table:

Books Books

What is TotalP?

Page 22: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

22

Instance-based (cont’d)

• Useful when no schema data is available

• Not useful when no instance data is available…

When will we use/not use instance based matchers?

Page 23: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

23

Schema-Based

• Element’s name

• Description

• Data Type

• Relationships

• Constraints

What useful data is there in the schema?

Page 24: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

24

Schema-Based: Name Matching

Map elements with similar names:

• String equality

• Common substrings (Birthday --> DayOfBirth)

• Canonical names (CName --> Customer Name)

• Synonyms (Car --> Automobile)

• Hypernyms (Book is-a Publication)

• Soundex (ShipTo --> Ship2)

• User provided (Issue --> Bug)

Page 25: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

25

Schema-Based: Description

Map elements based on description

empn //employee name

name //name of employee

Schema A Schema B

Page 26: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

26

Schema-Based: Constraint Based

Map elements based on Constraints:

• Data Types

• Unique, Primary, Foreign

Name

PID

ID

PLevel

Name

PID

Employees

Permissions

Employees ID

Sum

Payments

?

Page 27: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

27

Reuse Previous Matching

Schema AName

Salary

AName

Income

Author

Money

Schema B

Schema C

• Get mapping AC From mappings AB and BC

• A partial reuse is also possible (e.g. on some of the attributes)

• Be aware of the domain: salary and income are not always the same!

Page 28: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

28

Complexity

• We must compare every subgroup of attributes in schema A to every subgroup in schema B

• Exponential in the number of attributes

• However, we can assume the number of attributes is blocked…

• Also check (n:m) matching only for n,m<C for some C

Page 29: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

29

Contents

Problem Definition Applications Classic Approaches Correlation Mining

Approach

Page 30: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

30

Data Mining

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells

Which items are likely to co-appear?

Data Mining is the process of discovering patterns in data, usually stored in a Database.

Page 31: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

31

Data Mining (cont’d)

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells Support of an itemset: the fraction of transactions that contain all items in the itemset.

What is the support for {Book}?

1

And for {Book, Soap}? 0.666

The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset

Page 32: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

32

Data Mining (cont’d)

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

SellsAlgorithm to find frequent itemsets:

Why can we

stop?

1. Define a threshold minSupport for “frequent” itemsets

2. Calculate support for all itemsets of size (1)

3. Calculate support for itemsets of size 2,3,4…

4. For each size k save the frequent itemsets

5. Stop when there are no frequent itemsets in size K.

Page 33: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

33

Data Mining (cont’d)

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells Example:1. Set minSupport = 0.5

2. S({Book})=1, S({Pencil})=0.33, S({Soap})=0.666

3. S({Book, Soap})=0.666

4. S({Book, Soap, Pencil})=0

Where is {Soap,

Pencil}?

Page 34: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

34

Back to Schema Matching…

Id

First

Last

Id

Salary

Name

Year

Authors

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Goal: Map {Name} to {Author}, {Salary} to {Income}…

Id

FirstName

LastName

Income

Idea:{Name} and {Author} are unlikely to appear togetherSolution: go to the supermarket, but instead of food buy attributes!

What is the difference from the

supermarket example?

Page 35: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

35

The Algorithm

Input: set of m schemas

{Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last}…

{Salary}:{Income}

{Year}:{YearBirth}

Output: set of n-ary mappings

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

Income

Page 36: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

36

Algorithm

1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,

LastName, Author, First, Last…}

2. For each pair of attributes, calculate their support (how often they appear together)

S(Name, Salary) = 0.4

S(First, Last) = 0.95

S(Last, Name) = 0.1

Naive Algorithm

Page 37: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

37

Algorithm (Cont’d)

4. Using the A-Priory property calculate support for groups of sizes 3,4,5…

3. Choose groups with low support

S(Name, LastName, Salary) = 0

S(First, Last, Salary) = 0.1

5. Return all groups with low support

S(Name, Salary) = 0.4

S(First, Last) = 0.95

S(Last, Name) = 0.1

Page 38: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

38

Algorithm (Cont’d)

The algorithm is naive.

{name, author, X}

Actually for any attribute X we have:

{name, author}

Then we also have negative correlation for this:

{name, author, salary}

{name, author, yearOfBirth}

suppose we have negative correlation for this:

Page 39: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

39

Improvement

Improvement: Define the support (s) of an itemset {a,b,c…} to be

MAX { s(a,b), s(b,c), s(a,c) … }

s(name, author)=0.1

s(name, salary)=0.5

s(salary, author)=0.6

Example:

s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6

Now the support can go up so checking it is not trivial

What is the logic

behind this?

Page 40: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

40

Generalizing the algorithm

({first,last}, {name})

Now the algorithm finds all groups of attributes (a,b,c…) s.t. none of the pairs appears together.

Hopefully these are attributes with the same semantic:{name, author}

{salary, payments}

But what about this?

Currently we find only (1:1) matching

For (n:m) we need to preprocess…

Page 41: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

41

Preprocess

1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,

LastName, Author, First, Last…}

2. Run the normal A-Priori algorithm (find all attributes that DO appear together)

S(first, last)=0.9

S(firstName,lastName)=0.85

Pre-Process for the algorithm:

Page 42: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

42

Preprocess

3. For each schema S in the input:

For each frequent attributes group A:

If A intersects with S than add new attribute “A” to S

Id

First

Last

Id

First

Last

First, Last4. Run the previous algorithm on

S1’, S2’… to find negative correlation

{First,Last}

({first,last}, {name})

Now we can find groups like:

SA

S’

Page 43: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

43

Still Not Perfect…

Suppose we found these mappings:

{first,last}:{name}:{author}

{first, yearOfBirth}:{birthDate}

{yearOfBirth, monthOfBirth}:{birthDate}There is a contradiction!

Page 44: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

44

Solution

Add the top rank to the results

1. {first,last}:{name}:{author}

Delete contradictions to this rank:

2. {first, yearOfBirth}:{birthDate} XProcess next mapping

3. {yearOfBirth, monthOfBirth}:{birthDate}

1. {first,last}:{name}:{author}

2. {first, yearOfBirth}:{birthDate}

3. {yearOfBirth, monthOfBirth}:{birthDate}

Solution: rank the mappings according to the support of the lowest pair in each mapping

Page 45: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

45

Attributes with the same name

Payment (longint)

Step 1 of the algorithm (reminder):

Make a list S of all attributes from all schemasS = {Name, Salary, FirstName,

LastName, Author, First, Last…}

This means that two attributes with the same name are always considered the same.

Payment (datetime)?Solution: add the type to the name

Id

First

Last

Id_Int

First_String

Last_String

Page 46: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

46

Correlation Measure

So Income=Id?

s(Income, Id)=0.2

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

IncomeThe rare attribute problem:

Page 47: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

47

Correlation Measure (cont’d)

s(Salary, Income)=0

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

IncomeThe sparseness problem:

If Salary=Income than what is their equivalence in the other tables?

Page 48: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

48

Correlation Measure (cont’d)

Let A,B be two attributes. Define

f11: the number of schemas where both A,B appears

f10: number of schemas where only A appears

f1+: f11+f10

A ^A

B f11 f10 f1+

^B f01 f00 f0+

f+1 f+0 f++

Support of an itemset: the fraction of transactions that contain all items in the itemset.

There are other ways to calculate support:

Page 49: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

49

Correlation Measure (cont’d)

support=f11/f++

We used: Lift:

f00f11/f10f11

H-measure

f01f10/f+1f1+

A ^A

B f11 f10 f1+

^B f01 f00 f0+

f+1 f+0 f++

Every measure fits a different situation

For example, in the matching problem we want to “punish” attributes that co-appear

Id

Salary

Name

Year

Page 50: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

50

Applications

This approach can only be used when we have many schemas

El-Al.Com•Adult

•Child

•Infant

Arkia.Com American Airlines.Com

•Adult

•Child

•Destination

•Passengers

•To

• Data Migration?

• Web query interfaces. Example:

Is it possible to use the algorithm for migration by running it on many random schemas?

Page 51: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

51

Complexity

The A-Priory algorithm is O(2^n)

Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k

Page 52: Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, 2006

52