Download - New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University
New England Database Society (NEDS)
Friday, April 23, 2004
Volen 101, Brandeis University
Sponsored by Sun Microsystems
Learning to Reconcile Semantic Heterogeneity
Alon Halevy
University of Washington, Seattle
NEDS, April 23, 2004
Large-Scale Data SharingLarge-scale data sharing is pervasive: Big science (bio-medicine, astrophysics, …) Government agencies Large corporations The web (over 100,000 searchable data sources) “Enterprise Information Integration” industry
The vision: Content authoring by anyone, anywhere Powerful database-style querying Use relevant data from anywhere to answer the query The Semantic Web
Fundamental problem: reconciling different models of the world.
Large-Scale Scientific Data Sharing
UW
UW Microbiology
Harvard GeneticsUW Genome Sciences
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
OMIMHUGO
Swiss-Prot
GeneClinics
Data Integration
OMIMSwiss-Prot
HUGO GO
Gene-Clinics
EntrezLocus-Link
GEO
Entity
Sequenceable Entity
GenePhenotypeStructured Vocabulary
Experiment
ProteinNucleotide Sequence
Microarray Experiment
Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?
www.biomediator.orgTarczy-Hornoch, MorkTarczy-Hornoch, Mork
Peer Data Management Systems
UW
Stanford
DBLP
M.I.T Brown
CiteSeer
BrandeisQ
Q1
Q2Q6
Q5
Q4
Q3Mappings specified locally
Map to most convenient nodes
Queries answered by traversing semantic paths.
Piazza: [Tatarinov, H., Ives, Suciu, Mork]
UW Stanford
DBLP
Roma Paris
CiteSeer
Vienna
Q
Q’
Q’Q’’
Q’’
Q’’
Q’’
Mediated Schema
R1 R2 R3 R4 R5
• Data integration• PDMS• Message passing• Web services• Data warehousing
Data Sharing Architectures
Semantic Mappings
Mediated Schema
Q
Q’ Q’ Q’SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … …
SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …
CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Formalism for mappings Reformulation algorithms
How will we create them?
Semantic Mappings: Example
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Differences in: Names in schema Attribute grouping
Coverage of databases Granularity and format of attributes
Why is Schema Matching so Hard?Because the schemas never fully capture their intended meaning:Schema elements are just symbols.We need to leverage any additional information
we may have.
‘Theorem’: Schema matching is AI-Complete.Hence, human will always be in the loop.Goal is to improve designer’s productivity.Solution must be extensible.
Dimensions of the Problem (1)
Schema Matching:Schema Matching: Discovering correspondences between similar elementsDiscovering correspondences between similar elementsSchema Mapping:Schema Mapping: BooksAndMusic(x:Title,…) = Books(x:Title,…) BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) CDs(x:Album,…)
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Books TitleISBNPriceDiscountPriceEdition
CDs AlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
Inventory Database A
Inventory Database B
Matching vs. Mapping
Dimensions of the Problem (2)
Schema level vs. instance level: Alon Halevy, A. Halevy, Alon Y. Levy – same guy! Can’t always separate the two levels. Crucial for Personal Info Management (See
Semex)
What are we mapping? Schemas Web service descriptions Business logic and processes Ontologies
Important Special CasesMapping to a common mediated schema? Or mapping two arbitrary schemas?
One schema may be a new version of the other.
The two schemas may be evolutions of the same original schema.
Web forms.
Horizontal integration: many sources talking about the same stuff.
Vertical integration: sources covering different parts of the domain, and have only little overlap.
Problem Definition
GivenS1 and S2: a pair of schemas/DTDs/ontologies,
… Possibly, data accompanying instances Additional domain knowledge
Find:A match between S1 and S2
A set of correspondences between the terms.
Outline
Motivation and problem definition
Learning to match to a mediated schema
Matching arbitrary schemas using a corpus
Matching web services.
Typical Matching Heuristics[See Rahm & Bernstein, VLDBJ 2001, for a survey]
Build a model for every element from multiple sources of evidences in the schemas Schema element names
BooksAndCDs/Categories ~ BookCategories/Category Descriptions and documentation
ItemID: unique identifier for a book or a CD ISBN: unique identifier for any book
Data types, data instances DateTime Integer, addresses have similar formats
Schema structure All books have similar attributes
Models consider only the two schemas.
In isolation, techniques are incomplete or brittle:Need principled combination.[See the Coma System]
Matching to a Mediated Schema[Doan et al., SIGMOD 2001, MLJ 2003]
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Query reformulationand optimization.
Finding Semantic MappingsSource schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
Learning from Previous Matching
Every matching task is a learning opportunity.
Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies,
value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute
occurrences.
listed-price $250,000 $110,000 ...
address price agent-phone description
Matching Real-Estate Sources
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
Learning to Match Schemas
Mediated schema
Source schemas
Data listings
Constraint Handler
Mappings
User Feedback
Domain Constraints
Matching PhaseTraining Phase
Base-Learner1 Base-Learnerk Meta-Learner
Multi-strategy Learning System
Multi-Strategy Learning
Use a set of base learners: Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers: County name, zip code, phone numbers.
Each base learner produces a prediction weighted by confidence score.Combine base learners with a meta-learner, using stacking.
Name Learner
Base Learners
(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)(listed-price,price)
contact-phone => (agent-phone,0.7), (office-address,0.3)
Naive Bayes Learner [Domingos&Pazzani 97] “Kent, WA” => (address,0.8), (name,0.2)
Whirl Learner [Cohen&Hirsh 98]
XML Learner exploits hierarchical structure of XML data
(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)(listed-price,price)
(contact-phone, ? )
Meta-Learner: StackingTraining of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element)
weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9
Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores
<area>Seattle, WA</>(address,0.6)(address,0.8)
Name LearnerNaive Bayes
Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)
<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>
<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>
Applying the Learners
Name LearnerNaive Bayes
Meta-Learner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(description,0.8), (address,0.2)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
Empirical Evaluation
Four domains Real Estate I & II, Course Offerings, Faculty Listings
For each domain create mediated DTD & domain constraints choose five sources mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified
Matching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Outline
Motivation and problem definition Learning to match to a mediated
schema
Matching arbitrary schemas using a corpus
Matching web services.
Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, Halevy]
Can we use previous experience to match two new schemas?
Learn about a domain, rather than a mediated schema?
CDs Categories Artists
Items
Artists
Authors Books
Music
Information
Litreture
Publisher
Authors
Corpus of Schemas and MatchesCorpus of Schemas and MatchesReuse extracted knowledgeto match new schemas
Learn general purpose knowledge
Classifier for every corpus element
Exploiting The Corpus
Given an element s S and t T, how do we determine if s and t are similar?
The PIVOT Method: Elements are similar if they are similar to the same corpus
concepts
The AUGMENT Method: Enrich the knowledge about an element by exploiting similar
elements in the corpus.
Pivot: measuring (dis)agreement
Pk= Probability (s ~ ck )
Interpretation I(s) = element s Schema S
Compute interpretations w.r.t. corpus
# concepts in corpus
Similarity(I(s), I(t))
I(s)
I(t)s t
S T
Interpretation captures how similar an element is to each corpus concept Compared using cosine distance.
Augmenting element models
Search similar corpus concepts Pick the most similar ones from the interpretation
Build augmented models Robust since more training data to learn from
Compare elements using the augmented models
s
S
Schema
ElementModel
Name:Instances:Type:…
M’s
Search similar corpus concepts
Build augmented models
s e ff
e
Corpus of known schemas and mappings
Experimental Results
Five domains: Auto and real estate: webforms Invsmall and inventory: relational schemas Nameaddr: real xml schemas
Performance measure: F-Measure:
Precision and recall are measured in terms of the matches predicted.
Results averaged over hundreds of schema matching tasks!
recallprecision
recallprecisionf
+=
**2
Comparison over domains
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto real estate invsmall inventory nameaddr
Average FMeasure
direct augment pivot
Corpus based techniques perform better in all the domains
“Tough” schema pairs
Significant improvement in difficult to match schema pairs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto real estate invsmall inventory nameaddr
Average F-Measure
direct augment pivot
Mixed corpus
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
auto + re + invsmall difficult auto + invsmall
Average F-Measure
direct augment pivot
Corpus with schemas from different domains can also be useful
Other Corpus Based ToolsA corpus of schemas can be the basis for many useful tools:Mirror the success of corpora in IR and
NLP?
Auto-complete: I start creating a schema (or show sample
data), and the tool suggests a completion.
Formulating queries on new databases: I ask a query using my terminology, and it
gets reformulated appropriately.
Outline
Motivation and problem definition Learning to match to a mediated
schema Matching arbitrary schemas using a
corpus
Matching web services.
Searching for Web Services[Dong, Madhavan, Nemes, Halevy, Zhang]
Over 1000 web services already on WWW.
Keyword search is not sufficient.
Search involves drill-down; don’t want to repeat it. Hence, Find similar operationsFind operations that compose with this one.
1) Operations With Similar FunctionalityOp1: GetTemperature Input: Zip, Authorization Output: Return
Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity
Similar Operations
2) Operations with Similar Inputs/OutputsOp1: GetTemperature Input: Zip, Authorization Output: Return
Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState Input: ZipCode Output: City, State
Similar Inputs
3) Composable OperationsOp1: GetTemperature Input: Zip, Authorization Output: Return
Op2: WeatherFetcher Input: PostCode Output: TemperatureF, WindChill, Humidity
Op3: LocalTimeByZipcode Input: Zipcode Output: LocalTimeByZipCodeResult
Op4: ZipCodeToCityState Input: ZipCode Output: City, State
Op5: CityStateToZipCode Input: City, State Output: ZipCode
Input of Op2 is similar to
Output of Op5
Composition
Why is this Hard?
Little to go on: Input/output parameters (they don’t mean
much)Method nameText descriptions of operation or web
service (typically bad)
Difference from schema matching:Web service not a coherent schemaDifferent level of granularity.
Main Ideas
Measure similarity of each of the components of the WS-operation: I, O, description, WS description.Cluster parameter names into concepts. Heuristic: Parameters occurring together tend to express the same conceptsWhen comparing inputs/outputs, compare parameters and concepts separately, and combine the results.
Operation Match Effectiveness
0
0.2
0.4
0.6
0.8
1
1.2
precision recall R-precision
Percentage
description simple combine structure aware
Precision and Recall Results
WoogleA collection of 790 web services431 active web services, 1262 operations
Function Web service similarity search Keyword search on web service descriptions Keyword search on inputs/outputs Web service category browse Web service on-site try Web service status report
http://www.cs.washington.edu/woogle
ConclusionSemantic reconciliation is crucial for data sharing.
Learning from experience: an important ingredient. See Transformic Inc.
Current challenges: large schemas, GUIs, dealing with other meta-data issues.