data integration introduction dbms

7
1 Data Integration Alon Halevy Google Inc. University of Aalborg September, 2007 Introduction What is Data Integration and Why is it Hard? DBMS: it’s all about abstraction Logical vs. Physical; What vs. How. SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 Students: Takes: CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter Courses: SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid Data Integration: A Higher-level Abstraction Mediated Schema Query S1 S2 S3 SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter Semantic Mappings Independence of: source & location • data model, syntax • semantic variations • … <cd> <title> The best of … </title> <artist> Carreras </artist> <artist> Pavarotti </artist> <artist> Domingo </artist> <price> 19.95 </price> </cd> Application Area 1: Business Single Mediated View Legacy Databases Services and Applications Enterprise Databases CRM ERP Portals EII Apps: 50% of all IT $$$ spent here! Application Area 2: Science OMIM Swiss- Prot HUGO GO Gene- Clinics Entrez Locus- Link GEO Sequenceable Entity Gene Phenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment Hundreds of biomedical data sources available; growing rapidly!

Upload: abdulgani11

Post on 07-Apr-2015

640 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Integration Introduction DBMS

1

Data Integration

Alon Halevy

Google Inc.

University of Aalborg

September, 2007

Introduction

What is Data Integration

and

Why is it Hard?

DBMS: it’s all about abstractionLogical vs. Physical; What vs. How.

SSN Name Category

123-45-6789 Charles undergrad

234-56-7890 Dan grad

… …

SSN CID

123-45-6789 CSE444

123-45-6789 CSE444

234-56-7890 CSE142

Students: Takes:

CID Name Quarter

CSE444 Databases fall

CSE541 Operating systems winter

Courses:

SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration:A Higher-level Abstraction

Mediated Schema

Query

S1 S2 S3SSN Name Category

123-45-6789 Charles undergrad

234-56-7890 Dan grad

… …

SSN CID

123-45-6789 CSE444

123-45-6789 CSE444

234-56-7890 CSE142

CID Name Quarter

CSE444 Databases fall

CSE541 Operating systems winter

… …

SemanticMappings

Independence of:• source & location• data model, syntax• semantic variations• …

<cd> <title> The best of … </title>

<artist> Carreras </artist>

<artist> Pavarotti </artist>

<artist> Domingo </artist>

<price> 19.95 </price>

</cd>

Application Area 1: Business

Single Mediated View

Legacy DatabasesServices and Applications

Enterprise Databases

CRMERPPortals…

EII Apps:

50% of all IT $$$ spent here!

Application Area 2: Science

OMIM Swiss-Prot

HUGO GO

Gene-Clinics EntrezLocus-

Link GEO

SequenceableEntity

GenePhenotype StructuredVocabulary

Experiment

ProteinNucleotideSequence

MicroarrayExperiment

Hundreds of biomedical data sources available; growingrapidly!

Page 2: Data Integration Introduction DBMS

2

Application Area 3: The WebFullServe Corporation

Employees

Training

Sales

Resumes

Services

HelpLine

FullTimeEmpHireTempEmployees

CoursesEnrollments

ProductsSales

InterviewCV

ServicesCustomersContracts

Calls

EuroCard CorporationEmployees

Credit Cards

Resumes

HelpLine

EmployeesHire

CustomerCustDetail

Interview

Calls

Heterogeneity 101EmployeesFullTimeEmp ssn, empId, firstName middleName, lastName

Hire empId, hireDate, recruiter

TempEmployees ssn, hireStart, hireEnd

EmployeesEmployees ID, firstNameMiddleInitial, lastName

Hire ID, hireDate, recruiter

Find all employees (making over $100K)

Customer Call Center

Sales

Services

ProductsSales

ServicesCustomersContracts

Credit CardsCustomerCustDetail

Customer: I lost my credit card!!

Agent: would you like to buy an espresso machine?

Challenges & Opportunities

Create a (useful) web site for trackingservicesCollaborate with third parties E.g., create branded services

Comply with government regulations Find “risky” employees

Business intelligence What’s really wrong with our products?

Page 3: Data Integration Introduction DBMS

3

The Deep Web

Millions of (good) forms out there

Each form has its own special interface Hard to explore data across sites.

Goal (for some domains): A single interface into a multitude of deep-

web sources.

The Semantic Web

Knowledge sharing at web scale.Web resources are described by ontologies: Rich domain models; allow reasoning. RDF/OWL are the emerging standards

OWL-lite may actually be useful.

Issues: Too complex for users? Killer apps? Scalability of reasoning?

Proposal: let’s build the SW bottom up.

Goal of Data Integration

Uniform query access to a set of datasources

Handle: Scale of sources: from tens to millions

Heterogeneity

Autonomy

Semi-structure

Page 4: Data Integration Introduction DBMS

4

Why is it Hard?

Systems-level reasons: Managing different platforms SQL across multiple systems is not so simple Distributed query processing

Logical reasons: Schema (and data) heterogeneity

‘Social’ reasons: Locating and capturing relevant data in the enterprise. Convincing people to share (data fiefdoms)

Security, privacy and performance implications.

Setting Expectations

Data integration is AI-Complete. Completely automated solutions unlikely.

Goal 1: Reduce the effort needed to set up an

integration application.

Goal 2: Enable the system to perform gracefully

with uncertainty (e.g., see the web)

Terminology

Relations and Queries

Relational Terminology

Relational schemas Tables, attributes

Relation instances Sets (or multi-sets) of tuples

Integrity constraints Keys, foreign keys, inclusion dependencies

HitachiHousehold$203.99MultiTouch

CanonPhotography$149.99SingleTouch

GizmoWorksGadgets$29.99Powergizmo

GizmoWorksGadgets$19.99Gizmo

ManufacturerCategoryPricePName

Product

Attribute names

Tuples or rows

Table/relation name SQL (very basic)Interview: candidate, date, recruiter, hireDecision, grade

EmployeePerf: empID, name, reviewQuarter, grade, reviewer

select recruiter, candidatefrom Interview, EmployeePerfwhere recuiter=name AND grade < 2.5

Page 5: Data Integration Introduction DBMS

5

SQL (w/aggregation)

EmployeePerf: empID, name, reviewQuarter, grade, reviewer

select reviewer, Avg(grade)from EmployeePerfwhere reviewQuarter=“1/2007”

Conjunctive QueriesQ(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.

select recruiter, candidatefrom Interview, EmployeePerfwhere recuiter=name AND grade < 2.5

Unions of Conjunctive QueriesQ(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.

Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), Manager(y), W > 3.9.

Datalog (recursion)

Path(X,Y) :- edge(X,Y)

Path(X,Y) :- edge(X,Z), path(Z,Y)

Database: edge(X,Y)

Warmup Exercise

Virtual Data IntegrationArchitecture

Page 6: Data Integration Introduction DBMS

6

wrapper wrapper wrapper wrapper wrapper

Mediated Schema

Source Descriptions

Cinemas: place, movie,

start

Reviews: title, date

grade, review

Movies: name, actors, director, genre

Cinemas in NYC: cinema, title,

startTime

Cinemas in SF: location, movie,

startingTime

Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description

S1 S2 S3 S4 S5

Wrappers<cd> <title> The best of … </title>

<artist> Abiteboul </artist>

<artist> Pavarotti </artist>

<artist> Domingo </artist>

<price> 19.95 </price>

</cd>

•Lixto [Vienna]•Fetch [ISI]•XQWrap [Georgia Tech]•Wrapper induction [Dublin]

Mediation Languages

BooksTitleISBNPriceDiscountPriceEdition

CDsAlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

CD: ASIN, Title, Genre,…Artist: ASIN, name, …

Mediated Schema

logic

Execution engine

source

wrapper

source

wrapper

source

wrapper

source

wrapper

source

wrapper

Query optimizer

Query reformulationQuery

Logical query plan

Physical query plan

Replanning request

Query Processing Woody Allen Comedies in NY

select title, startTimefrom Movie, Playswhere Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”

Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description

Page 7: Data Integration Introduction DBMS

7

Cinemas: place, movie,

start

Reviews: title, date

grade, review

Movies: name, actors, director, genre

Cinemas in NYC: cinema, title,

startTime

Cinemas in SF: location, movie,

startingTime

Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description

S1 S2 S3 S4 S5

select title, startTimefrom Movie, Playswhere Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”

Outline (1)

Mediation languages: theoretical foundations (containment,

answering queries using views)

Creating schema mappingsQuery processing Adaptive query processing

XML and its role in data integration

Outline (2)

Other architectures for data integration Warehousing, data exchange, p2p dbms

Advanced (i.e., current) topics: Dataspaces

The deep web

Uncertainty in data integration