building data integration queries by demonstration

28
1 Building Data Integration Queries by Demonstration Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock IUI’07 Reporter:Chao-Ting Ting

Upload: ding

Post on 11-Jan-2015

805 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Building Data Integration Queries By Demonstration

1

Building Data Integration Queries by Demonstration

Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock

IUI’07

Reporter:Chao-Ting Ting

Page 2: Building Data Integration Queries By Demonstration

2

Outline Introduction Approach

Building a single column table Set intersection constraint

Building a multiple column table Reachable attributes Partial plans

Experiment Conclusion

Page 3: Building Data Integration Queries By Demonstration

3

Introduction

With the proliferation of the Internet, most information can be found on the Internet today

The information we need is usually scattered among multiple websites It is very time consuming to access,

combine, filter, and make sense of that data manually

Page 4: Building Data Integration Queries By Demonstration

4

Example

A particular restaurant rave reviews from a restaurant review

website ‘C’ rating on a government’s health

inspection website A health conscious person would

require information from both websites

Page 5: Building Data Integration Queries By Demonstration

5

Limitation

For computer literate users, their choices are limited to finding the information on their own

by browsing web sites relying on the data integration

providers to supply web interfaces to access the integrated information

Page 6: Building Data Integration Queries By Demonstration

6

Goals

Karma A user interface where any computer

literate user could easily build his/her own mashups

A service that integrates information from multiple data sources

Page 7: Building Data Integration Queries By Demonstration

7

Problems need to solve

Data retrieval Data cleaning and schema

matching Data Integration Filtering and visualization

Page 8: Building Data Integration Queries By Demonstration

8

In this paper

Focus on data integration aspect Translate partially-filled rows into

queries & retrieve data from multiple sources

Use constraints and partial plans to decrease possible attributes/values

generate consistent queries that always return data

Page 9: Building Data Integration Queries By Demonstration

9

Hypothesis

The data is clean and aligned fix misspellings and resolve format

inconsistencies Unify attributes names

Page 10: Building Data Integration Queries By Demonstration

10

A snapshot of Karma

Page 11: Building Data Integration Queries By Demonstration

11

Approach - Index tables and data source definition S : a set of all available web sources

A : a lookup hashtable with its key and value being a -> {s} where, 1) {s} ⊆ S and 2) ∀s ⊂ S : a ∈ att(s)

V : a lookup hashtable with its key and value being v ->{(a,s)} where ∀ (a,s): v ∈val(a,s) ∧ a ∈ att(s)

att(s) : a procedure that returns the set of attributes from the source s

val(a,s) : a procedure that returns the set of values associated with the attribute a in the source s.

Page 12: Building Data Integration Queries By Demonstration

12

The data sources in the scenario Zagat($restaurant name, $cuisine,

$address, $city, $state,$zipcode, review rating)

Asian_food_review($restaurant name, $cuisine, $price,$address, $city, $state, $zipcode, review rating)

LA_health_rating($restaurant name, $address, $city, $state, $zipcode, inspection date, health rating)

EU_country_info($country name, language, population, gdp, date, location)(The set of attributes with the $ in the data source model acts as primary keys)

Page 13: Building Data Integration Queries By Demonstration

13

Building a single column table-start by entering an attribute set

{v} = val(a,s) where s ⊂ {s}

(SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review)

Page 14: Building Data Integration Queries By Demonstration

14

The set intersection constraint

Cuisine

Page 15: Building Data Integration Queries By Demonstration

15

The set intersection constraint

{x} = Set intersection({a}) over all the value rows

To enter a value, the possible value set is: {v} = val(a,s) where a ∈{x} ∧ s is

any source where att(s) ∩ {x} ≠ {}

Page 16: Building Data Integration Queries By Demonstration

16

Example : start by entering a value & use the set intersection constraint

{(a,s)} = {(Cuisine, Zagat), (Language, EU_country_info)}

French

(SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review) UNION (SELECT Language FROM EU_country_info)

Vietnamese

{(a,s)} = {(Cuisine, Zagat), (Cuisine, Asian_food_review)}

Cuisine

(SELECT Cuisine FROM Zagat) UNION(SELECT Cuisine FROM Asian_food_review)

Page 17: Building Data Integration Queries By Demonstration

17

Building a multiple column table

Enter attribute Enter value when the user has

selected the attribute Enter value when the user hasn’t

selected the attribute

Page 18: Building Data Integration Queries By Demonstration

18

Reachable attributes

Cuisine

French{(a,s)} = {(Cuisine, ), (Language, )}

Reachable attributes of row 1 : {restaurant name, cuisine, address, city, state, zipcode, review rating, country name, language, population, gdp, date, location, inspection date, health rating}

ZagatEU_country_info

Get from source “LA_health_rating”

Page 19: Building Data Integration Queries By Demonstration

19

Constraint for entering a new attribute

The set intersection of “reachable” attributes of all partially filled rows

Each attribute in the set intersection of the “reachable” attributes set must produce a non-empty suggested value set Execute multiple queries through

partial plans

Page 20: Building Data Integration Queries By Demonstration

20

Partial plan tree

Blue : value nodeWhite : place holder nodeBlack : hidden node

f(a,s,v) = (cuisine, Zagat, French)

a(a,s,v) = ( restaurant_name, Zagat, _PLACE_HOLDER )

Page 21: Building Data Integration Queries By Demonstration

21

Tree evaluation

A value node implies a value equal “=” condition

A place holder node can only be included in the SELECT part of the query

A node with multiple parents implies a join condition over its parents

Page 22: Building Data Integration Queries By Demonstration

22

Example( find possible candidate set for the attribute “restaurant name”)

Page 23: Building Data Integration Queries By Demonstration

23

Tree Construction - points for attention

Frenchroot

f1st

f(a,s,v) = (cuisine, Zagat, French)

root

f1st

f(a,s,v) = (language, EU_country_info, French)

Row 1:Vietnamese

root

f1st

f(a,s,v) = (cuisine, Zagat, Vietnamese)

root

f1st

f(a,s,v) = (cuisine, Asian_food_review, Vietnamese)

Row 2:

Page 24: Building Data Integration Queries By Demonstration

24

Construction Step 1

Compute the set intersection of the reachable attributes of all the rows & keep only reachable attributes that return non-empty suggest value sets

root

f1st

f(a,s,v) = (cuisine, Zagat, French)

Page 25: Building Data Integration Queries By Demonstration

25

Construction Step 2 User selects the attribute & the attribute is

in the same data source add the new node as the child of the root

User selects the attribute & the attribute is in a different source create necessary hidden nodes according to the

primary key constraints and set the new node as the child of those hidden nodes

User decides to enter a value Permute the partial plan for the attribute set in 1

Page 26: Building Data Integration Queries By Demonstration

26

Experiment Retrieving the restaurant health rating

only retrieves data from one data source (LA_health_rating)

Retrieving the restaurant information with reviews and health ratings

includes the join between two data sources limits the number of sources so there would be no

union operation Retrieving the restaurant information with

reviews and health ratings includes union and join as discussed in the previous

section

Page 27: Building Data Integration Queries By Demonstration

27

Experiment

Baseline : Microsoft Access, which integrates the QBE approach into its query design view

Typing in a value or Selecting a value = 1t unitSelecting a data source to use = 1d unitSelecting an attribute = 1a unit

Page 28: Building Data Integration Queries By Demonstration

28

Conclusion

An approach to data integration with the following characteristics does not require the user to know

details about data sources or existing values

suggests valid possible values to the user

creates consistent queries that always return values