building data integration queries by demonstration
DESCRIPTION
TRANSCRIPT
1
Building Data Integration Queries by Demonstration
Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock
IUI’07
Reporter:Chao-Ting Ting
2
Outline Introduction Approach
Building a single column table Set intersection constraint
Building a multiple column table Reachable attributes Partial plans
Experiment Conclusion
3
Introduction
With the proliferation of the Internet, most information can be found on the Internet today
The information we need is usually scattered among multiple websites It is very time consuming to access,
combine, filter, and make sense of that data manually
4
Example
A particular restaurant rave reviews from a restaurant review
website ‘C’ rating on a government’s health
inspection website A health conscious person would
require information from both websites
5
Limitation
For computer literate users, their choices are limited to finding the information on their own
by browsing web sites relying on the data integration
providers to supply web interfaces to access the integrated information
6
Goals
Karma A user interface where any computer
literate user could easily build his/her own mashups
A service that integrates information from multiple data sources
7
Problems need to solve
Data retrieval Data cleaning and schema
matching Data Integration Filtering and visualization
8
In this paper
Focus on data integration aspect Translate partially-filled rows into
queries & retrieve data from multiple sources
Use constraints and partial plans to decrease possible attributes/values
generate consistent queries that always return data
9
Hypothesis
The data is clean and aligned fix misspellings and resolve format
inconsistencies Unify attributes names
10
A snapshot of Karma
11
Approach - Index tables and data source definition S : a set of all available web sources
A : a lookup hashtable with its key and value being a -> {s} where, 1) {s} ⊆ S and 2) ∀s ⊂ S : a ∈ att(s)
V : a lookup hashtable with its key and value being v ->{(a,s)} where ∀ (a,s): v ∈val(a,s) ∧ a ∈ att(s)
att(s) : a procedure that returns the set of attributes from the source s
val(a,s) : a procedure that returns the set of values associated with the attribute a in the source s.
12
The data sources in the scenario Zagat($restaurant name, $cuisine,
$address, $city, $state,$zipcode, review rating)
Asian_food_review($restaurant name, $cuisine, $price,$address, $city, $state, $zipcode, review rating)
LA_health_rating($restaurant name, $address, $city, $state, $zipcode, inspection date, health rating)
EU_country_info($country name, language, population, gdp, date, location)(The set of attributes with the $ in the data source model acts as primary keys)
13
Building a single column table-start by entering an attribute set
{v} = val(a,s) where s ⊂ {s}
(SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review)
14
The set intersection constraint
Cuisine
15
The set intersection constraint
{x} = Set intersection({a}) over all the value rows
To enter a value, the possible value set is: {v} = val(a,s) where a ∈{x} ∧ s is
any source where att(s) ∩ {x} ≠ {}
16
Example : start by entering a value & use the set intersection constraint
{(a,s)} = {(Cuisine, Zagat), (Language, EU_country_info)}
French
(SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review) UNION (SELECT Language FROM EU_country_info)
Vietnamese
{(a,s)} = {(Cuisine, Zagat), (Cuisine, Asian_food_review)}
Cuisine
(SELECT Cuisine FROM Zagat) UNION(SELECT Cuisine FROM Asian_food_review)
17
Building a multiple column table
Enter attribute Enter value when the user has
selected the attribute Enter value when the user hasn’t
selected the attribute
18
Reachable attributes
Cuisine
French{(a,s)} = {(Cuisine, ), (Language, )}
Reachable attributes of row 1 : {restaurant name, cuisine, address, city, state, zipcode, review rating, country name, language, population, gdp, date, location, inspection date, health rating}
ZagatEU_country_info
Get from source “LA_health_rating”
19
Constraint for entering a new attribute
The set intersection of “reachable” attributes of all partially filled rows
Each attribute in the set intersection of the “reachable” attributes set must produce a non-empty suggested value set Execute multiple queries through
partial plans
20
Partial plan tree
Blue : value nodeWhite : place holder nodeBlack : hidden node
f(a,s,v) = (cuisine, Zagat, French)
a(a,s,v) = ( restaurant_name, Zagat, _PLACE_HOLDER )
21
Tree evaluation
A value node implies a value equal “=” condition
A place holder node can only be included in the SELECT part of the query
A node with multiple parents implies a join condition over its parents
22
Example( find possible candidate set for the attribute “restaurant name”)
23
Tree Construction - points for attention
Frenchroot
f1st
f(a,s,v) = (cuisine, Zagat, French)
root
f1st
f(a,s,v) = (language, EU_country_info, French)
Row 1:Vietnamese
root
f1st
f(a,s,v) = (cuisine, Zagat, Vietnamese)
root
f1st
f(a,s,v) = (cuisine, Asian_food_review, Vietnamese)
Row 2:
24
Construction Step 1
Compute the set intersection of the reachable attributes of all the rows & keep only reachable attributes that return non-empty suggest value sets
root
f1st
f(a,s,v) = (cuisine, Zagat, French)
25
Construction Step 2 User selects the attribute & the attribute is
in the same data source add the new node as the child of the root
User selects the attribute & the attribute is in a different source create necessary hidden nodes according to the
primary key constraints and set the new node as the child of those hidden nodes
User decides to enter a value Permute the partial plan for the attribute set in 1
26
Experiment Retrieving the restaurant health rating
only retrieves data from one data source (LA_health_rating)
Retrieving the restaurant information with reviews and health ratings
includes the join between two data sources limits the number of sources so there would be no
union operation Retrieving the restaurant information with
reviews and health ratings includes union and join as discussed in the previous
section
27
Experiment
Baseline : Microsoft Access, which integrates the QBE approach into its query design view
Typing in a value or Selecting a value = 1t unitSelecting a data source to use = 1d unitSelecting an attribute = 1a unit
28
Conclusion
An approach to data integration with the following characteristics does not require the user to know
details about data sources or existing values
suggests valid possible values to the user
creates consistent queries that always return values