Download - New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University

New England Database Society (NEDS)

Friday, April 23, 2004

Volen 101, Brandeis University

Sponsored by Sun Microsystems

Learning to Reconcile Semantic Heterogeneity

Alon Halevy

University of Washington, Seattle

NEDS, April 23, 2004

Large-Scale Data SharingLarge-scale data sharing is pervasive: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources) “Enterprise Information Integration” industry

The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web

Fundamental problem: reconciling different models of the world.

Large-Scale Scientific Data Sharing

UW

UW Microbiology

Harvard GeneticsUW Genome Sciences

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

OMIMHUGO

Swiss-Prot

GeneClinics

Data Integration

OMIMSwiss-Prot

HUGO GO

Gene-Clinics

EntrezLocus-Link

GEO

Entity

Sequenceable Entity

GenePhenotypeStructured Vocabulary

Experiment

ProteinNucleotide Sequence

Microarray Experiment

Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?

www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork

http://www.biomediator.org/

Peer Data Management Systems

UW

Stanford

DBLP

M.I.T Brown

CiteSeer

BrandeisQ

Q1

Q2Q6

Q5

Q4

Q3Mappings specified locally

Map to most convenient nodes

Queries answered by traversing semantic paths.

Piazza: [Tatarinov, H., Ives, Suciu, Mork]

UW Stanford

DBLP

Roma Paris

CiteSeer

Vienna

Q

Q’

Q’Q’’

Q’’

Q’’

Q’’

Mediated Schema

R1 R2 R3 R4 R5

• Data integration• PDMS• Message passing• Web services• Data warehousing

Data Sharing Architectures

Semantic Mappings

Mediated Schema

Q

Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …

SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …



SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …



… …

Formalism for mappings Reformulation algorithms

How will we create them?

Semantic Mappings: Example

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Differences in: Names in schema Attribute grouping

Coverage of databases Granularity and format of attributes

Why is Schema Matching so Hard?Because the schemas never fully capture their intended meaning:Schema elements are just symbols.We need to leverage any additional information

we may have.

‘Theorem’: Schema matching is AI-Complete.Hence, human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.

Dimensions of the Problem (1)

Schema Matching:Schema Matching: Discovering correspondences between similar elementsDiscovering correspondences between similar elementsSchema Mapping:Schema Mapping: BooksAndMusic(x:Title,…) = Books(x:Title,…) BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) CDs(x:Album,…)

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

Matching vs. Mapping

Dimensions of the Problem (2)

Schema level vs. instance level: Alon Halevy, A. Halevy, Alon Y. Levy – same guy! Can’t always separate the two levels. Crucial for Personal Info Management (See

Semex)

What are we mapping? Schemas Web service descriptions Business logic and processes Ontologies

Important Special CasesMapping to a common mediated schema? Or mapping two arbitrary schemas?

One schema may be a new version of the other.

The two schemas may be evolutions of the same original schema.

Web forms.

Horizontal integration: many sources talking about the same stuff.

Vertical integration: sources covering different parts of the domain, and have only little overlap.

Problem Definition

GivenS1 and S2: a pair of schemas/DTDs/ontologies,

… Possibly, data accompanying instances Additional domain knowledge

Find:A match between S1 and S2

A set of correspondences between the terms.

Outline

Motivation and problem definition

Learning to match to a mediated schema

Matching arbitrary schemas using a corpus

Matching web services.

Typical Matching Heuristics[See Rahm & Bernstein, VLDBJ 2001, for a survey]

Build a model for every element from multiple sources of evidences in the schemas Schema element names

BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation

ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book

Data types, data instances DateTime Integer, addresses have similar formats

Schema structure All books have similar attributes

Models consider only the two schemas.

In isolation, techniques are incomplete or brittle:Need principled combination.[See the Coma System]

Matching to a Mediated Schema[Doan et al., SIGMOD 2001, MLJ 2003]

Find houses with four bathrooms priced under $500,000

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

source schema 3source schema 1

Query reformulationand optimization.

Finding Semantic MappingsSource schemas = XML DTDs

house

location contact

house

address

name phone

num-baths

full-baths half-baths

contact-info

agent-name agent-phone

1-1 mapping non 1-1 mapping

Learning from Previous Matching

Every matching task is a learning opportunity.

Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies,

value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute

occurrences.

listed-price $250,000 $110,000 ...

address price agent-phone description

Matching Real-Estate Sources

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Learning to Match Schemas

Mediated schema

Source schemas

Data listings

Constraint Handler

Mappings

User Feedback

Domain Constraints

Matching PhaseTraining Phase

Base-Learner1 Base-Learnerk Meta-Learner

Multi-strategy Learning System

Multi-Strategy Learning

Use a set of base learners: Name learner, Naïve Bayes, Whirl, XML learner

And a set of recognizers: County name, zip code, phone numbers.

Each base learner produces a prediction weighted by confidence score.Combine base learners with a meta-learner, using stacking.

Name Learner

Base Learners

(contact,agent-phone)

(contact-info,office-address)

(phone,agent-phone)(listed-price,price)

contact-phone => (agent-phone,0.7), (office-address,0.3)

Naive Bayes Learner [Domingos&Pazzani 97] “Kent, WA” => (address,0.8), (name,0.2)

Whirl Learner [Cohen&Hirsh 98]

XML Learner exploits hierarchical structure of XML data

(contact,agent-phone)

(contact-info,office-address)

(phone,agent-phone)(listed-price,price)

(contact-phone, ? )

Meta-Learner: StackingTraining of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element)

weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9

Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores

<area>Seattle, WA</>(address,0.6)(address,0.8)

Name LearnerNaive Bayes

Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(description,0.8), (address,0.2)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Empirical Evaluation

Four domains Real Estate I & II, Course Offerings, Faculty Listings

For each domain create mediated DTD & domain constraints choose five sources mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48

Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Outline

Motivation and problem definition Learning to match to a mediated

schema

Matching arbitrary schemas using a corpus


Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, Halevy]

Can we use previous experience to match two new schemas?

Learn about a domain, rather than a mediated schema?

CDs Categories Artists

Items

Artists

Authors Books

Music

Information

Litreture

Publisher

Authors

Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas

Learn general purpose knowledge

Classifier for every corpus element

Exploiting The Corpus

Given an element s S and t T, how do we determine if s and t are similar?

The PIVOT Method: Elements are similar if they are similar to the same corpus

concepts

The AUGMENT Method: Enrich the knowledge about an element by exploiting similar

elements in the corpus.

Pivot: measuring (dis)agreement

Pk= Probability (s ~ ck )

Interpretation I(s) = element s Schema S

Compute interpretations w.r.t. corpus

# concepts in corpus

Similarity(I(s), I(t))

I(s)

I(t)s t

S T

Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.

Augmenting element models

Search similar corpus concepts Pick the most similar ones from the interpretation

Build augmented models Robust since more training data to learn from

Compare elements using the augmented models

s

S

Schema

ElementModel

Name:Instances:Type:…

M’s

Search similar corpus concepts

Build augmented models

s e ff

e

Corpus of known schemas and mappings

Experimental Results

Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas

Performance measure: F-Measure:

Precision and recall are measured in terms of the matches predicted.

Results averaged over hundreds of schema matching tasks!

recallprecision

recallprecisionf

+=

**2

Comparison over domains

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto real estate invsmall inventory nameaddr

Average FMeasure

direct augment pivot

Corpus based techniques perform better in all the domains

“Tough” schema pairs

Significant improvement in difficult to match schema pairs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto real estate invsmall inventory nameaddr

Average F-Measure


Mixed corpus

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

auto + re + invsmall difficult auto + invsmall

Average F-Measure


Corpus with schemas from different domains can also be useful

Other Corpus Based ToolsA corpus of schemas can be the basis for many useful tools:Mirror the success of corpora in IR and

NLP?

Auto-complete: I start creating a schema (or show sample

data), and the tool suggests a completion.

Formulating queries on new databases: I ask a query using my terminology, and it

gets reformulated appropriately.

Outline

Motivation and problem definition Learning to match to a mediated

schema Matching arbitrary schemas using a

corpus


Searching for Web Services[Dong, Madhavan, Nemes, Halevy, Zhang]

Over 1000 web services already on WWW.

Keyword search is not sufficient.

Search involves drill-down; don’t want to repeat it. Hence, Find similar operationsFind operations that compose with this one.

1) Operations With Similar FunctionalityOp1: GetTemperature Input: Zip, Authorization Output: Return

Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity

Similar Operations

2) Operations with Similar Inputs/OutputsOp1: GetTemperature Input: Zip, Authorization Output: Return


Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult

Op4: ZipCodeToCityState Input: ZipCode Output: City, State

Similar Inputs

3) Composable OperationsOp1: GetTemperature Input: Zip, Authorization Output: Return


Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult

Op4: ZipCodeToCityState Input: ZipCode Output: City, State

Op5: CityStateToZipCode Input: City, State Output: ZipCode

Input of Op2 is similar to

Output of Op5

Composition

Why is this Hard?

Little to go on: Input/output parameters (they don’t mean

much)Method nameText descriptions of operation or web

service (typically bad)

Difference from schema matching:Web service not a coherent schemaDifferent level of granularity.

Main Ideas

Measure similarity of each of the components of the WS-operation: I, O, description, WS description.Cluster parameter names into concepts. Heuristic: Parameters occurring together tend to express the same conceptsWhen comparing inputs/outputs, compare parameters and concepts separately, and combine the results.

Operation Match Effectiveness

0

0.2

0.4

0.6

0.8

1

1.2

precision recall R-precision

Percentage

description simple combine structure aware

Precision and Recall Results

WoogleA collection of 790 web services431 active web services, 1262 operations

Function Web service similarity search Keyword search on web service descriptions Keyword search on inputs/outputs Web service category browse Web service on-site try Web service status report

http://www.cs.washington.edu/woogle

ConclusionSemantic reconciliation is crucial for data sharing.

Learning from experience: an important ingredient. See Transformic Inc.

Current challenges: large schemas, GUIs, dealing with other meta-data issues.

Download - New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University

Top Related