automatic schema matching seminar on databases and the internet yaron naveh january 2006

Automatic Schema Matching

Seminar on Databases and the InternetYaron Naveh

January 2006

Automatic Schema Matching, SDBI, 2006

2

Articles

A survey of approaches to automatic schema matching Rahm & Bernstein (2001)

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004)


3

Contents

Problem Definition Applications Classic Approaches Correlation Mining

Approach


4

Match Definition

ID

Name

NumOfBooks

AID

AName

ANumOfBooks

Authors Authors

A match is a mapping between elements of two schemas that correspond semantically to each other


5

Match Properties

ID

Name

NumOfBooks

ID

FName

LName

YearOfBirth

Authors Authors

?

• (n:m) matching also possible

(1:1)

(1:n)

?


6

Match Properties (cont’d)

ID

Name

Salary ($)

Authors Authors

• Salary(NIS) = Salary($) * 4.55

• We will not find the function, just the attributes

ID

Name

Salary (NIS)


7


EmpName

DeptID

Employees

Employees

One relation is mapped to two others

EmpName

DeptName

DeptID

DeptName

Departments

Join


8


Teacher

StartTime

EndTime

Lessons Lessons

• Too hard for PC!

• PC should only suggest mappings to the user

Teacher

Time??


9


An automated tool can be helpful here…

Field1

Field2

Field3

Field4

Field5

Field6

Field7

Field8

Field9

field10

Field1

Field2

Field3

Field4

Field5

Field6

Field7

Field8

Field9

field10

So maybe it can all be done manually?


10

Match Generalization

We have defined a match for the relational model.

There are other interesting models:

…

<author>

<id>1</id>

<name>Calvino</name>

</author>

…

AuthorsBooks

ID

AuthorsName


11

Match Generalization (cont’d)

• nodes and edges in graphs

• elements, subelements, and IDREFs in XML

…

Define a Schema to be a set of elements connected by some structure

Use the natural correspondence:


12

Contents


Approach


13

Data Migration

Date

From

Message

Time

Writer

Message

IsVisible

ResponseTo

Old Forum New Forum

Migrate data from old DB to new DB

Special case: Data warehouse


14

E-Commerce

Map between different message formats

<book>

<name>The Invisible Cities</name>

<price>50</price>

</book>

<product>

<name>book</name>

<price>50</price>

</product>

Book Store

General Store


15

Global Query Interface

GOOGLE

<input name=search>

<select name=type>

MSN

<input name=q>

Yahoo

<input name=qry>

<input name=type>

You want to build a Meta-Querier. However…


16

Global Query Interface (cont’d)

Search

Type

q

GOOGLE

MSNYahoo

Solution: Reduce the html form to its “schema”

Qry

Type


17

Semantic Query Processing

Id

Name

Authors Find: Author + Ram + Oren

Keywords search scenario

SELECT * WHERE Id=‘Ram Oren’SELECT * WHERE Name=‘Ram Oren’

?

?

Author

Ram

Oren

How does this differ from previous

examples?


18

Contents


Approach


19

Matchers

There are a few algorithms to map attributes of 2 schemas

Define such an algorithm as a matcher Define a hybrid matcher as a matcher that

combines results from other matchers


20

Schema-based Vs. Instance-based

Two ways to perform a match:

• Use schema data (field name, type, constraints…)

• Use data from the table


21

Instance-based

BookID TotPages

TotPrice

1 500 50

2 400 40

3 450 90

BookID

TotalP

1 6060

• Build a schema from instance data, then use schema matchers

• Use the data directly. Example:

Two options for using data from the table:

Books Books

What is TotalP?


22

Instance-based (cont’d)

• Useful when no schema data is available

• Not useful when no instance data is available…

When will we use/not use instance based matchers?


23

Schema-Based

• Element’s name

• Description

• Data Type

• Relationships

• Constraints

What useful data is there in the schema?


24

Schema-Based: Name Matching

Map elements with similar names:

• String equality

• Common substrings (Birthday --> DayOfBirth)

• Canonical names (CName --> Customer Name)

• Synonyms (Car --> Automobile)

• Hypernyms (Book is-a Publication)

• Soundex (ShipTo --> Ship2)

• User provided (Issue --> Bug)


25

Schema-Based: Description

Map elements based on description

empn //employee name

name //name of employee

Schema A Schema B


26

Schema-Based: Constraint Based

Map elements based on Constraints:

• Data Types

• Unique, Primary, Foreign

Name

PID

ID

PLevel

Name

PID

Employees

Permissions

Employees ID

Sum

Payments

?


27

Reuse Previous Matching

Schema AName

Salary

AName

Income

Author

Money

Schema B

Schema C

• Get mapping AC From mappings AB and BC

• A partial reuse is also possible (e.g. on some of the attributes)

• Be aware of the domain: salary and income are not always the same!


28

Complexity

• We must compare every subgroup of attributes in schema A to every subgroup in schema B

• Exponential in the number of attributes

• However, we can assume the number of attributes is blocked…

• Also check (n:m) matching only for n,m<C for some C


29

Contents


Approach


30

Data Mining

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells

Which items are likely to co-appear?

Data Mining is the process of discovering patterns in data, usually stored in a Database.


31

Data Mining (cont’d)

TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells Support of an itemset: the fraction of transactions that contain all items in the itemset.

What is the support for {Book}?

1

And for {Book, Soap}? 0.666

The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset


32


TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

SellsAlgorithm to find frequent itemsets:

Why can we

stop?

1. Define a threshold minSupport for “frequent” itemsets

2. Calculate support for all itemsets of size (1)

3. Calculate support for itemsets of size 2,3,4…

4. For each size k save the frequent itemsets

5. Stop when there are no frequent itemsets in size K.


33


TransID

Item

1 Book

1 Pencil

2 Book

2 Soap

3 Book

3 Soap

Sells Example:1. Set minSupport = 0.5

2. S({Book})=1, S({Pencil})=0.33, S({Soap})=0.666

3. S({Book, Soap})=0.666

4. S({Book, Soap, Pencil})=0

Where is {Soap,

Pencil}?


34

Back to Schema Matching…

Id

First

Last

Id

Salary

Name

Year

Authors

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Goal: Map {Name} to {Author}, {Salary} to {Income}…

Id

FirstName

LastName

Income

Idea:{Name} and {Author} are unlikely to appear togetherSolution: go to the supermarket, but instead of food buy attributes!

What is the difference from the

supermarket example?


35

The Algorithm

Input: set of m schemas

{Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last}…

{Salary}:{Income}

{Year}:{YearBirth}

Output: set of n-ary mappings

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

Income


36

Algorithm

1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,

LastName, Author, First, Last…}

2. For each pair of attributes, calculate their support (how often they appear together)

S(Name, Salary) = 0.4

S(First, Last) = 0.95

S(Last, Name) = 0.1

Naive Algorithm


37

Algorithm (Cont’d)

4. Using the A-Priory property calculate support for groups of sizes 3,4,5…

3. Choose groups with low support

S(Name, LastName, Salary) = 0

S(First, Last, Salary) = 0.1

5. Return all groups with low support

S(Name, Salary) = 0.4

S(First, Last) = 0.95

S(Last, Name) = 0.1


38

Algorithm (Cont’d)

The algorithm is naive.

{name, author, X}

Actually for any attribute X we have:

{name, author}

Then we also have negative correlation for this:

{name, author, salary}

{name, author, yearOfBirth}

suppose we have negative correlation for this:


39

Improvement

Improvement: Define the support (s) of an itemset {a,b,c…} to be

MAX { s(a,b), s(b,c), s(a,c) … }

s(name, author)=0.1

s(name, salary)=0.5

s(salary, author)=0.6

Example:

s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6

Now the support can go up so checking it is not trivial

What is the logic

behind this?


40

Generalizing the algorithm

({first,last}, {name})

Now the algorithm finds all groups of attributes (a,b,c…) s.t. none of the pairs appears together.

Hopefully these are attributes with the same semantic:{name, author}

{salary, payments}

…

But what about this?

Currently we find only (1:1) matching

For (n:m) we need to preprocess…


41

Preprocess

1. Make a list L of all attributes from all schemasL = {Name, Salary, FirstName,


2. Run the normal A-Priori algorithm (find all attributes that DO appear together)

S(first, last)=0.9

S(firstName,lastName)=0.85

Pre-Process for the algorithm:


42

Preprocess

3. For each schema S in the input:

For each frequent attributes group A:

If A intersects with S than add new attribute “A” to S

Id

First

Last

Id

First

Last

First, Last4. Run the previous algorithm on

S1’, S2’… to find negative correlation

{First,Last}

({first,last}, {name})

Now we can find groups like:

SA

S’


43

Still Not Perfect…

Suppose we found these mappings:

{first,last}:{name}:{author}

{first, yearOfBirth}:{birthDate}

{yearOfBirth, monthOfBirth}:{birthDate}There is a contradiction!


44

Solution

Add the top rank to the results

1. {first,last}:{name}:{author}

Delete contradictions to this rank:

2. {first, yearOfBirth}:{birthDate} XProcess next mapping

3. {yearOfBirth, monthOfBirth}:{birthDate}

1. {first,last}:{name}:{author}

2. {first, yearOfBirth}:{birthDate}

3. {yearOfBirth, monthOfBirth}:{birthDate}

Solution: rank the mappings according to the support of the lowest pair in each mapping


45

Attributes with the same name

Payment (longint)

Step 1 of the algorithm (reminder):

Make a list S of all attributes from all schemasS = {Name, Salary, FirstName,


This means that two attributes with the same name are always considered the same.

Payment (datetime)?Solution: add the type to the name

Id

First

Last

Id_Int

First_String

Last_String


46

Correlation Measure

So Income=Id?

s(Income, Id)=0.2

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

IncomeThe rare attribute problem:


47

Correlation Measure (cont’d)

s(Salary, Income)=0

Id

First

Last

Id

Salary

Name

Year

Id

AuthorFirst

AuthorLast

YearBirth

Id

Author

Id

FirstName

LastName

IncomeThe sparseness problem:

If Salary=Income than what is their equivalence in the other tables?


48


Let A,B be two attributes. Define

f11: the number of schemas where both A,B appears

f10: number of schemas where only A appears

…

f1+: f11+f10

A ^A

B f11 f10 f1+

^B f01 f00 f0+

f+1 f+0 f++

Support of an itemset: the fraction of transactions that contain all items in the itemset.

There are other ways to calculate support:


49


support=f11/f++

We used: Lift:

f00f11/f10f11

H-measure

f01f10/f+1f1+

A ^A

B f11 f10 f1+

^B f01 f00 f0+

f+1 f+0 f++

Every measure fits a different situation

For example, in the matching problem we want to “punish” attributes that co-appear

Id

Salary

Name

Year


50

Applications

This approach can only be used when we have many schemas

El-Al.Com•Adult

•Child

•Infant

Arkia.Com American Airlines.Com

•Adult

•Child

•Destination

•Passengers

•To

• Data Migration?

• Web query interfaces. Example:

Is it possible to use the algorithm for migration by running it on many random schemas?


51

Complexity

The A-Priory algorithm is O(2^n)

Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k


52

automatic schema matching seminar on databases and the internet yaron naveh january 2006

Documents

xmlautomatic schema

match generalizationwe

match propertiesauthorsauthors

web query interfaces

ram orenselect

graphs elements

set of elements

hybrid matcher