creating semantic mappings (based on the slides of the course: cis 550 – database &...
Post on 18-Dec-2015
217 views
TRANSCRIPT
Creating semantic mappings
(based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
What we have studied before? Formalisms for specifying source descriptions and
how to use these descriptions to reformulate queries
What is the goal now? Set of techniques that helps a designer create
semantic mappings and source descriptions for a particular data integration application Heuristic task Idea is to reduce the time it takes to create semantic
mappings
Motivating example
DVD vendor schemaProduct(title, productionYear, releaseDate, basePrice, listPrice, rating, customerReviews, saleLocation)
Movie(title, year, director, directorOrigin, mainAtors, genre, awards)
Locations(name, taxRate, shippingCharge)
Online Aggregator SchemaItem(title, year, classification, genre, director, price, starring, angReviews)
Recap. semantic heterogeneities Table and attribute names can differ
Attributes rating and classification as well as mainActors and starring
Multiple attributes in one schema correspond to a single attribute in the other basePrice and taxRate from the vendor are used to
compute the value of price in the aggregator schema Tabular organization may be different
DVD vendor requires 3 tables, aggregator needs only one. Coverage and level of detail may differ
DVD vendor models releaseDate and awards that the aggregator does not
Process of creating schema mappings1) Schema matching
Creating correspondences between elements of 2 schemas
Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema
2) Creating schema mappings from the correspondences (and filling in missing details)
Specifies the transformations that have to be applied the the source data in order to produce the target data
Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate
Schema matchingGoal: to automatically create a set of correspondences
between the 2 schemas Given two schemas S1 and S2 and being schema
elements the table/attribute names in the schema, a correspondence A -> B states that a set of elements A in S1 maps to a set of attributes B in S2 Most common correspondences are 1-1, where A and B
are singletonsEx: for the target relation Item, there are the following
correspondences: Product.title -> title Movie.year -> year {Product.basePrice, Locations.taxRate} ->
price
Output of the schema matcher Associate a confidence measure ([0,1]) with
every correspondence Because of the heuristic nature of the schema
matching process Possible heuristics: examine the names of the
schema elements; examine the data values; etc May associate a filter with a correspondence
Ex: {Product.basePrice, Locations.taxRate} -> price may apply only to locations in the US
Components of a schema matcher
Basic Matcher: predicts correspondences based on cues available in schema and data
Combiner: combines the predictions of the basic matchers into a single similarity matrix
Constraint enforcer: applies domain knowledge and constraints to prune the possible matches
Match selector: chooses the best match or matches from the similarity matrix
Process of creating schema mappings1) Schema matching
Creating correspondences between elements of 2 schemas
Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema
2) Creating schema mappings from the correspondences (and filling in missing details)
Specifies the transformations that have to be applied the the source data in order to produce the target data
Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate
Creating mappings from correspondences
Create the actual mappings from the matches (correspondences) Find how tuples from one source can be translated into
tuples in the other Challenge: there may be more than one possible
way of joining the data Ex: To compute the value of price in the aggregator
schema, may join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate, or join Product with Movie to obtain the origin of director, and compute the price based on the taxes in the director’s country of birth
Outline
Base matchers Combining match predictions Applying constraints and domain knowledge
to candidate schema matches Match selector Applying machine learning techniques to
enable the schema matcher to learn Discovering m-m matches From matches to mappings
Base matchers
Input: pair of schemas S1 and S2, whith elements A and
B, respectively Additonal available information, such as data
instances or text descriptions Output:
Correspondence matrix that assigns to every pair of elements (Ai, Bj) a number between 0 and 1 predicting whether Ai corresponds to Bj
Classes of base matchers
1) Name-based matchers: based on comparing names of schema elements
2) Instance-based matchers: based on inspecting data instances
For specific domains, it is possible to develop more specialized and effective matchers
Must look at large amounts of data Slower, but efficiency can be improved More precise
Name-based matchers
Compare the names of the elements, hoping that the names convey the true semantics of elements
Challenge: to find effective distance measures reflecting the distance of element names Names are never written in exactly the same way
Edit distance Edit distance between the strings that represent the names of the
elements Levenshtein distance:
Minimum nb of operations (insertions, deletions or replacements) needed to transform one string into another
Given two schema elements represented by the strings s1 and s2 and their edit distance, denoted by editDistance(s1,s2)
editSimilarity(s1, s2) = 1 – editDistance(s1, s2)/max(length(s1),
length(s2))
Ex: The Levenshtein distance between the strings FullName and FName is 3; between TotalPrice and PriceSum is 8The editSimilarity between FullName and FName is 0.625; between TotalPrice and PriceSum is 0.2
Qgram distance Qgrams: substrings within the names Given a positive integer q, the set of q-grams of s,
qgrams(s), consists of all the substrings of s of size q
Ex: the 3-grams of price are: {pri, ric, ice} Given a number q, qgramSimilarity(s1, s2) = ||qgrams(s1) qgrams(s2)||/ ||qgrams(s1) qgrams(s2)||
Ex: the 3-gram similarity between pricemarked and markedprice is 7/18 = 0.39
Advantages over edit distance: Faster More resilient to word-order interchaging
Sound distance
Based on the way strings sound Soundex algorithm: encodes names by their
sound; can detect when two names sound the same even if spelling varies
Ex: ship2 is similar to shipto
Normalization
Element names can be composed of acronyms or short phrases to express their meanings
Normalization: replaces a single token by several tokens that can be compared Element names should be normalized before applying
distance measures Some normalization techniques:
Expand known abbreviations Expand a string with its synonyms Remove articles, propositions and conjunctions
Instance-based matchers
Data instances, if available, convey the meaning of a schema element, more than its name Use them for predicting correspondences between schema
elements Techniques:
Develop a set of rules for inferring common types from the format of the data values
Ex: phone numbers, prices, zip codes, etc Value overlap Text-field analysis
Value overlap
Measuring the overlap of values in the two elements Applies to categorical elements: whose values range in some
finite domain (e.g., movie ratings, country name) Jaccard coefficient: fraction of the values for the two elements
that can be an instance for both of them Also defined as: conditional probability of a value being an
instance of both elements given that it is an instance of one of them
JaccardSim(e1,e2) = Pr(e1e2 | e1e2) =||D(e1) D(e2)||/||D(e1) D(e2)||
where D(e) is the set of values for element e
Text-field analysis
Applies to elements whose values are longer texts (e.g., house descriptions) Their values can vary drastically The probability of finding the exact string for both
elements is very low Idea: to compare the general topics these
text fields are about Use text classifiers
Text classifiers
Classifier for a concept C: algorithm that identifies instances of C from those that are not Creates an internal model based on training
examples: positive examples that are known to be instances of C and negative examples that are known not to be instances of C
Given an example c, the classifier applies its model to decide whether c is an instance of C
Combining match predictions (1) Result of basic matchers summarized in a
similarity cube: Suppose the schema matcher used l base
matchers to predict correspondences between elements A1, ..., An of S1 and the elements B1, ...,Bm of S2.
The similarity cube assigns each triple (b,i,j) a number between 0 and 1 describing the prediction of base matcher b about the correspondence between Ai and Bj
Combining match predictions (2) Output: a similarity matrix that combines the predictions of
the base matchers For every pair (i, j), want a value between o and 1, Combined(i, j) that gives a single prediction about the correspondence between Ai and Bj
Two possible combinations: Combined(i,j) = max b=1
l Base(b, i, j) Combined(i,j) = 1/l sum b=1
l Base(b, i, j) Max used when we trust in a matcher that outputs a high
value Avg used otherwise Also multi-step combination functions and give weights to
matchers
Applying constraints and domain knowledge to candidate matches There may exist domain-specific knowledge helpful
in the process of schema matching Expressed as a set of constraints that enable
pruning candidate matches Hard constraints: must be applied; schema matcher will
not output any match that violates them Soft constraints: more heuristic nature; may be violated in
some schemas; nb of violated should be minimized A cost is associated to each constraint: infinite for hard
constraints; any positive number for soft constraints
ExampleSchema:
Book(ISBN, publisher, pubCountry, title, review)
Item(code, name, brand, origin, desc)
Inventory(ISBN, quantity, location)
InStore(code, availQuant)
Constraints:T1: if A -> Item.name, then A is a key. Cost = T2: if sim(A1,B1) sim(A2,B1), A1 is next to A3 and B1 is next to B2
and sim(A3,B2) > 0.8 then match A1 to B1. Cost = 2
T3: Average(length(desc)) 20. Cost = 1.5
T4: if ||{Ai attributes(R1) | Bj attributes(R2) s.t. sim(Ai,Bj) > 0.8}|| >= ||attributes(R1)||/2 then match table R1 to table R2. Cost = 1
Algorithms for applying constraints to the similarity matrix Applying constraints with A* search
Guarantees to find the optimal solution Computationally more expensive
Applying constraints with local propagation Faster May get stuck in a local minimum
Components of a schema matcher
Match selector Input: similarity matrix Output: a schema match or the top few matches If matching system interactive, the user can choose
among the top k correspondences The system computes several possible matchesEx: schema1(shipAddr, shipPhone, billAddr, billPhone)
schema2(addr, phone)The correspondences:shipPhone ->phone and billPhone->phone are chosen until the user indicates:shipAddr->addrthen shipPhone->phone becomes more likely than billPhone->phone
The algorithm behind The match selection problem can be formulated as an instance
of finding a stable marriage Elements of S1: men; elements of S2:women Sim(i,j): the degree to which Ai and Bj desire each other
Goal: find a stable match between men and women A match is unstable if there are Ai ->Bj and Ak->Bl, such that
sim(i,l)>sim(i,j) and sim(i,l)>sim(k,l) If these couples existed then Ai and Bl would want to be matched
together To produce a schema match without unhappy couples do:
Match={} Repeat:
Let (i,j) be the highest value in sim such that Ai and Bj are not in match
Add Ai->Bj to match
Applying machine learning techniques to enable the schema matcher to learn
Schema matching tasks often repetitive When working in the same domain, one starts to identify
how common domain concepts get expressed in schemas So the designer can create schema matches more quickly
over time So,
Can the schema matching also improve over time? Or: can a schema matcher learn from previous experience? Machine learning techniques can be applied to schema
matching, thus enabling the matcher to improve over time
Learning to match
Suppose n data sources s1,..., sn whose schemas must be mapped into the mediated schema G
Goal: To train the system by manually providing it with
schema matches on a small nb of data sources (e.g., s1,...,sm, where m is much smaller than n)
The system generalizes from the training examples so that it is able to predict matches for sources sm+1,...sn
Components of the system
Training phase (1)
1. Manually specify mappings for several sources
2. Extract source data
3. Create training data for each base learner
4. Train the base learners
5. Train the meta-learner
Training phase (2)
Learning classifiers for elements in the mediated schema The classifier for an element e in the mediated schema
examines an element in a source schema and predict whether it matches e or not
To create classifiers, employs a machine learning algorithm
Each machine learning algorithm typically considers only one aspect of the schema and has advantages/inconvenients So, use a multi-strategy learning technique
Multi-strategy learning Training phase:
Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of the
mediated schema from its training examples Use a meta-learner to learn weights for the different base
learners For each element e of the mediated schema and base learner l, the
meta-learner computes a weight we,l He knows how to do that, because we are working with training examples
Matching phase: When presented with a schema S whose elements are e1’,..,
et’ Apply the base learners to e1’,.., et’. Let pe,l(e’) be the
prediction of learner l on whether e’ matches e Combine the learners:
pe(e’) = j=1 k we,lj* pe,lj(e’)
Charlie comesto town
Find houses with 2 bedrooms
priced under 300K
homes.comrealestate.com homeseekers.com
Example
Data Integration
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
wrapper wrapperwrapper
source schema 3source schema 1
Find houses with 2 bedrooms priced under 300K
Example
Rule-based learner
Naive-bayes learner
listed-price $250,000 $110,000 ...
address price agent-phone description
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>
<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>
Training the Learners
Naive Bayes Learner
(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...
(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...
realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com
Mediated schema
location listed-price phone comments
<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>
<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>
Applying the Learners
Name LearnerNaive Bayes
Meta-Learner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(address,0.6), (description,0.4)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
Base Learners Input
schema information: name, proximity, structure, ... data information: value, format, ...
Output prediction weighted by confidence score
Examples Name learner
agent-name => (name,0.7), (phone,0.3) Naive Bayes learner
“Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)
Rule-based learner
Examines a set of training examples and computes a set of rules that can be applied to test instances Rules can be represented as logical formulae or
as decision trees Works well in domains where the set of rules can
accurately characterize instances of the class (e.g., identifying elements that adhere to certain formats)
Example: rule-based learner for identifying phone numbers (1) Positive and negative examples of phone numbers:
Example instance? # of digits position of ( position of ) position of –
(608)435-2322 yes 10 1 5 9
(60)445-284 no 9 1 4 8
849-7394 yes 7 - - 4
(1343) 429-441 no 10 1 6 10
43 43 (12 1285) no 10 5 12 -
5549902 no 7 - - -
(212) 433 8842 yes 10 1 5 -
Example: rule-based learner for identifying phone numbers (2) Common method to learn rules is to create a decision tree
Encodes rules such as: If i has 10 digits, a ‘(‘ in position 1 and ‘)’ in position
5, then yes If i has 7 digits, but no ‘-’ in position 4, then no, ...
Naive Bayes Learner
Examines the tokens of a testing instance and assigns to the instance the most likely class given the occurrences of tokens in the training set
Effective for recognizing text fields Given a test instance, the learner converts it
into a bag of tokens
Naive Bayes learner at work
Given that c1, .., cn are elements of the mediated schema, the learner is given a test instance d = {w1,..., wk} to classify
Goal: assign d to the element cd with the highest posterior probability given d:
Cd = arg maxci P(Ci|d)
P(Ci|d)= P(d|Ci)P(Ci)/P(d)
=> Cd = arg maxci [P(d|Ci)P(Ci)/P(d)]
= arg maxci [P(d|Ci)P(Ci)] P(d|Ci) and P(Ci) must be estimated from the training
data
Estimation of P(d|ci) and P(ci)
P(ci)is approximated by the portion of the training instances with label ci
To compute P(d|ci) assume that the tokens wj appear in d independently of each other given ci
P(d|ci) = P(w1|ci) P(w2|ci)... P(wk|ci)
P(wj|ci) = n(wj,ci)/n(ci), where
n(ci): total number of tokens in the training instances with label ci
n(wj,ci): number of times token wj appears in all training instances with label ci
Conclusion of Naive Bayes learner Naive Bayes performs well in many domains in spite
of the fact the independence assumption is not always valid
Works best when: There are tokens strongly indicative of the correct label,
because they appear in one element and not in the othersEx: “beautiful”, “fantastic” to describe houses
There are only weakly suggestive tokens, but many of them Doesn’t work well
Short or numeric fields
Recap. Multi-strategy learning Training phase:
Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of
the mediated schema from its training examples Use a meta-learner to learn weights for the different base
learners For each element e of the mediated schema and base learner
l, the meta-learner computes a weight we,l
Matching phase: When presented with a schema S whose elements are e1’,.., et’
Apply the base learners to e1’,.., et’. Let pe,l(e’) be the prediction of learner l on whether e’ matches e
Combine the learners:
pe(e’) = j=1 k we,lj* pe,lj(e’)
Training the meta-learner
Learns the weights to attach to each of the base learners, from the training examples Can be different for every mediated-schema element
How does it work? Asks the base learners for predictions on training examples Judges how well each learner performed in providing the
prediction for each mediated-schema element Assigns to each combination (mediated schema element c i,
base learner lj) a weight indicating how much it trusts that learner predictions regarding ci
Can use any classification algorithm to compute the weights
Outline
Base matchers Combining match predictions Applying constraints and domain knowledge
to candidate schema matches Match selector Applying machine learning techniques to
enable the schema matcher to learn Discovering m-m matches From matches to mappings
From matches to mappings Schema matches: correspondences between the
source and the target schemas Now: specifying the operations to be performed on
the source data so that they can be transformed into the target data Use DBMS as transformation engines Creating mappings becomes a process of query discovery
Find the queries, using joins, unions, filtering, aggregates, that correctly transform the data into the desired schema
Algorithm that explores the space of possible schema mappings Used in the CLIO system
User interaction
Creating mappings is a complex process System generates the mapping expressions
automatically The possible mappings are automatically
produced using the semantics conveyed by constraints such as foreign keys.
System shows the designer example data instances so that she can verify which are the right mappings
Motivating example
Question: Union professor salaries with employee salaries,
or Join salaries computed from the two
correspondences
Possible mappings
If attribute ProjRank is a foreign key of the relation PayRate, then the mapping would be:
SELECT P.HrRate * W.Hrs
FROM PayRate P, WorksOn W
WHERE P.Rank = W.ProjRank If attribute ProjRank is not a foreign key of the relation PayRate.
Instead, the name attribute of WorksOn is a foreign key of Student and the Yr attribute of Student is a foreign key of PayRate (the salary depends on the year of the student). Then, the following query should be chosen:
SELECT P.HrRate * W.Hrs
FROM PayRate P, WorksOn W, Student S
WHERE W.Name = S.Name AND S.Yr = P.Rank
Not clear which join path to choose for mapping f1!
Possible mappings (2)
One interpretation of f2 is that values produced from f1 should be joined with those produced by f2
Then, most of the values in the source DB would not be mapped to the target
Another interpretation: there are two ways of computing the salary of employees: one applying to professors and another to other emplyoyees. The corresponding mapping is:
SELECT P.HrRate * W.Hrs
FROM PayRate P, WorksOn W, Student S
WHERE W.Name=S.Name AND S.Yr = P.Rank
UNION ALL
SELECT Salary
FROM Professor
Principles to guide the mapping construction If possible, all values in the source appear in
the target Choose a union rather than a join
If possible, a value from the source should only contribute once to the target Associations between values that exist in the
source should not be lost Use a join rather than a cartesian product to
compute the salary value using f1
Possible mappings (3)
Consider a few more correspondences:
f3: Professor(Id) -> Personnel(Id)
f4: Professor(Name) -> Personnel(Name)
f5: Address(Addr) -> Personnel(Addr)
f6: Student(Name) -> Personnel(Name)
They fall into two candidate sets of correspondences: f2, f3, f4 and f5: map from Professor to Personnel f1, f6: map from other employees to Professor
The algorithm explores the possible joins within every candidate set and considers how to union the transformations corresponding to each candidate set.
Possible mappings (4)f3: Professor(Id) -> Personnel(Id)
f4: Professor(Name) -> Personnel(Name)
f5: Address(Addr) -> Personnel(Addr)
f6: Student(Name) -> Personnel(Name)
Most reasonable mapping is:
SELECT P.Id, P.Name, P.Sal, A.Addr
FROM Professor P, Address A
WHERE A.Id = P.Id
UNION ALL
SELECT NULL as ID, S.Name, P.HrRate*W.Hrs, Null as Addr
FROM Student S, PayRate P, WorksOn W
WHERE S.name=W.name AND S.Yr = P.Rank
Possible mappings (5)f3: Professor(Id) -> Personnel(Id)
f4: Professor(Name) -> Personnel(Name)
f5: Address(Addr) -> Personnel(Addr)
f6: Student(Name) -> Personnel(Name)
But this one is also possible:
SELECT NULL as ID, NULL as Name, NULL as Sal, Addr
FROM Address A
UNION ALL
SELECT P.Id, P.Name, P.Sal, NULL as Addr
FROM Professor P
UNION ALL
SELECT NULL as ID, NULL as Name, NULL as Sal, NULL as Addr
FROM Student S
...
Query discovery algorithm - Goal
Eliminates unlikely mappings from the large search space of candidate mappings and
identifies correct mappings a user might not otherwise have considered
Query discovery algorithm - Characteristics
Is interactive: Explores the space of possible mappings and
proposes the most likely ones to the user Accepts user feedback
To guide it in the right direction Uses heuristics
Can be replaced by better ones is available
Query discovery algorithm – Input
Set of correspondences M = {fi: (Ai -> Bi)}, where Ai: set of attributes in the source S1
Bi : one attribute of the target S2
Possible filters on source attributes Range restriction on an attribute, aggregate of an
attribute, etc
Query discovery algorithm – 1st phase
Create all possible candidate sets (subsets of M), that contain at most one correspondence per attribute of S2 Represents one way of computing the
attributes of S2 If a set covers all attributes of S2, it is called
complete set Elements of do not need to be disjoint
Example
Given the correspondences: f1 : S1.A -> T.C f2 : S2.A -> T.D f3 : S2.B -> T.C
Then the complete candidate sets are: {{f1, f2}, {f2, f3}}
The singleton sets {f1}, {f2} and {f3} are also candidate sets.
Query discovery algorithm – 2nd phase
Consider the candidate sets in and search for the best set of joins within each candidate set Considering a candidate set v in and supose (Ai -> Bi) v, and
Ai includes attributes from multiple relations in S1.
Then, search for a join path connecting the relations mentioned in Ai
using the following:
Heuristic: A join path can be either:
A path through foreign keys A path proposed by inspecting previous queries on S, or A path discovered by mining the data for joinable columns in S
Query discovery algorithm – 2nd phase
The set of candidate sets in for which we find join paths is denoted by . When there are multiple join paths, use the following for selecting join paths:
Heuristic: Prefer paths through foreign keys. If there are multiple such paths, choose one that involves an
attribute on which there is a filter in a correspondence, if it exists.
To further rank paths, favor the join path where the estimated difference between the outer join and inner join is the smallest
Favors joins with the least number of dangling tuples
Query discovery algorithm – 3rd phase Examine the candidate sets in , and tries to combine
them by union so they cover all the correspondences in M.
Search for covers of the correspondences A subset T of is a cover is it includes all the correspondences in M
and it is minimal (cannot remove a candidate set from T and still obtain a cover)
Example: = {{f1, f2}, {f2, f3}, {f1}, {f2}, {f3}} Possible covers include
T1 = {{f1}, {f2, f3}} T2 = {{f1, f2}, {f2, f3}.
Query discovery algorithm – 3rd phase
If there are multiple possible covers, use the following:
Heuristic: Choose the cover with the smallest nb of candidate
sets (a simpler mapping should be more appropriate) If there is more than one with the same nb of
candidate sets, choose the one that includes more attributes of S2 (to cover more of that schema)
Query discovery algorithm – 4th phase Creates a schema mapping expression as an SQL
query First creates an SQL query for each candidate set in the
selected cover and then unions them
Example: Suppose v is a candidate set: Attributes of S2 in v are put in the SELECT clause Each of the relations in the join paths found for v are put in the
FROM clause The corresponding join predicates are put in the WHERE clause Any filters associated with the correspondences in v are also
added to the WHERE clause Finally takes the union of the queries for each candidate set in the
cover Compute the SQL mapping expression for T1 and T2 left
as an exercice
The CLIO tool
http://www.almaden.ibm.com/cs/projects/criollo/
http://birte08.stanford.edu/ppts/11-ho.pd
References Chapter 4, Draft of the book on “Principles of Data
Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation).
Erhard Rahm and Philip A. Bernstein. “A survey of approaches to automatic schema matching”. VLDB Journal, 10(4):334–350, 2001.
AnHai Doan, Pedro Domingos, Alon Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach”, SIGMOD 2001
R.J. Miller, L.M. Haas, and M. Hernandez. “Schema Matching as Query Discovery”. In VLDB, 2000.
Renée J. Miller, Mauricio A. Hernandez, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho and Ronald Fagin, Lucian Popa, “The Clio Project: Managing Heterogeneity”, SIGMOD Record 30(1), March 2001, pp. 78-83.
Virtual Data Integration in industry Known as EII: Enterprise Information Integration Different from EAI: Enterprise Application Integration Different from ETL in DW: Extraction, Tranformation
and Loading in Data Warehousing
Good reference to take a look at: A. Halevy et al, “Enterprise Information Integration: Successes,
Challenges and Controversies”, SIGMOD’05.