data integration introduction dbms
TRANSCRIPT
1
Data Integration
Alon Halevy
Google Inc.
University of Aalborg
September, 2007
Introduction
What is Data Integration
and
Why is it Hard?
DBMS: it’s all about abstractionLogical vs. Physical; What vs. How.
SSN Name Category
123-45-6789 Charles undergrad
234-56-7890 Dan grad
… …
SSN CID
123-45-6789 CSE444
123-45-6789 CSE444
234-56-7890 CSE142
…
Students: Takes:
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter
Courses:
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
Data Integration:A Higher-level Abstraction
Mediated Schema
Query
S1 S2 S3SSN Name Category
123-45-6789 Charles undergrad
234-56-7890 Dan grad
… …
SSN CID
123-45-6789 CSE444
123-45-6789 CSE444
234-56-7890 CSE142
…
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter
… …
SemanticMappings
Independence of:• source & location• data model, syntax• semantic variations• …
<cd> <title> The best of … </title>
<artist> Carreras </artist>
<artist> Pavarotti </artist>
<artist> Domingo </artist>
<price> 19.95 </price>
</cd>
Application Area 1: Business
Single Mediated View
Legacy DatabasesServices and Applications
Enterprise Databases
CRMERPPortals…
EII Apps:
50% of all IT $$$ spent here!
Application Area 2: Science
OMIM Swiss-Prot
HUGO GO
Gene-Clinics EntrezLocus-
Link GEO
SequenceableEntity
GenePhenotype StructuredVocabulary
Experiment
ProteinNucleotideSequence
MicroarrayExperiment
Hundreds of biomedical data sources available; growingrapidly!
2
Application Area 3: The WebFullServe Corporation
Employees
Training
Sales
Resumes
Services
HelpLine
FullTimeEmpHireTempEmployees
CoursesEnrollments
ProductsSales
InterviewCV
ServicesCustomersContracts
Calls
EuroCard CorporationEmployees
Credit Cards
Resumes
HelpLine
EmployeesHire
CustomerCustDetail
Interview
Calls
Heterogeneity 101EmployeesFullTimeEmp ssn, empId, firstName middleName, lastName
Hire empId, hireDate, recruiter
TempEmployees ssn, hireStart, hireEnd
EmployeesEmployees ID, firstNameMiddleInitial, lastName
Hire ID, hireDate, recruiter
Find all employees (making over $100K)
Customer Call Center
Sales
Services
ProductsSales
ServicesCustomersContracts
Credit CardsCustomerCustDetail
Customer: I lost my credit card!!
Agent: would you like to buy an espresso machine?
Challenges & Opportunities
Create a (useful) web site for trackingservicesCollaborate with third parties E.g., create branded services
Comply with government regulations Find “risky” employees
Business intelligence What’s really wrong with our products?
3
The Deep Web
Millions of (good) forms out there
Each form has its own special interface Hard to explore data across sites.
Goal (for some domains): A single interface into a multitude of deep-
web sources.
The Semantic Web
Knowledge sharing at web scale.Web resources are described by ontologies: Rich domain models; allow reasoning. RDF/OWL are the emerging standards
OWL-lite may actually be useful.
Issues: Too complex for users? Killer apps? Scalability of reasoning?
Proposal: let’s build the SW bottom up.
Goal of Data Integration
Uniform query access to a set of datasources
Handle: Scale of sources: from tens to millions
Heterogeneity
Autonomy
Semi-structure
4
Why is it Hard?
Systems-level reasons: Managing different platforms SQL across multiple systems is not so simple Distributed query processing
Logical reasons: Schema (and data) heterogeneity
‘Social’ reasons: Locating and capturing relevant data in the enterprise. Convincing people to share (data fiefdoms)
Security, privacy and performance implications.
Setting Expectations
Data integration is AI-Complete. Completely automated solutions unlikely.
Goal 1: Reduce the effort needed to set up an
integration application.
Goal 2: Enable the system to perform gracefully
with uncertainty (e.g., see the web)
Terminology
Relations and Queries
Relational Terminology
Relational schemas Tables, attributes
Relation instances Sets (or multi-sets) of tuples
Integrity constraints Keys, foreign keys, inclusion dependencies
HitachiHousehold$203.99MultiTouch
CanonPhotography$149.99SingleTouch
GizmoWorksGadgets$29.99Powergizmo
GizmoWorksGadgets$19.99Gizmo
ManufacturerCategoryPricePName
Product
Attribute names
Tuples or rows
Table/relation name SQL (very basic)Interview: candidate, date, recruiter, hireDecision, grade
EmployeePerf: empID, name, reviewQuarter, grade, reviewer
select recruiter, candidatefrom Interview, EmployeePerfwhere recuiter=name AND grade < 2.5
5
SQL (w/aggregation)
EmployeePerf: empID, name, reviewQuarter, grade, reviewer
select reviewer, Avg(grade)from EmployeePerfwhere reviewQuarter=“1/2007”
Conjunctive QueriesQ(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.
select recruiter, candidatefrom Interview, EmployeePerfwhere recuiter=name AND grade < 2.5
Unions of Conjunctive QueriesQ(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.
Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), Manager(y), W > 3.9.
Datalog (recursion)
Path(X,Y) :- edge(X,Y)
Path(X,Y) :- edge(X,Z), path(Z,Y)
Database: edge(X,Y)
Warmup Exercise
Virtual Data IntegrationArchitecture
6
wrapper wrapper wrapper wrapper wrapper
Mediated Schema
Source Descriptions
Cinemas: place, movie,
start
Reviews: title, date
grade, review
Movies: name, actors, director, genre
Cinemas in NYC: cinema, title,
startTime
Cinemas in SF: location, movie,
startingTime
Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description
S1 S2 S3 S4 S5
Wrappers<cd> <title> The best of … </title>
<artist> Abiteboul </artist>
<artist> Pavarotti </artist>
<artist> Domingo </artist>
<price> 19.95 </price>
</cd>
…
•Lixto [Vienna]•Fetch [ISI]•XQWrap [Georgia Tech]•Wrapper induction [Dublin]
Mediation Languages
BooksTitleISBNPriceDiscountPriceEdition
CDsAlbumASINPriceDiscountPriceStudio
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
AuthorsISBNFirstNameLastName
CD: ASIN, Title, Genre,…Artist: ASIN, name, …
Mediated Schema
logic
Execution engine
source
wrapper
source
wrapper
source
wrapper
source
wrapper
source
wrapper
Query optimizer
Query reformulationQuery
Logical query plan
Physical query plan
Replanning request
Query Processing Woody Allen Comedies in NY
select title, startTimefrom Movie, Playswhere Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”
Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description
7
Cinemas: place, movie,
start
Reviews: title, date
grade, review
Movies: name, actors, director, genre
Cinemas in NYC: cinema, title,
startTime
Cinemas in SF: location, movie,
startingTime
Movie: Title, director, year, genreActors: title, actorPlays: movie, location, startTimeReviews: title, rating, description
S1 S2 S3 S4 S5
select title, startTimefrom Movie, Playswhere Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”
Outline (1)
Mediation languages: theoretical foundations (containment,
answering queries using views)
Creating schema mappingsQuery processing Adaptive query processing
XML and its role in data integration
Outline (2)
Other architectures for data integration Warehousing, data exchange, p2p dbms
Advanced (i.e., current) topics: Dataspaces
The deep web
Uncertainty in data integration