dr.-ing. eike schallehn · 2017. 11. 2. · distributed data management organization organization...

85
Distributed Data Management Dr.-Ing. Eike Schallehn OvG Universität Magdeburg Fakultät für Informatik Institut für Technische und Betriebliche Informationssysteme 2007/2008

Upload: others

Post on 11-Aug-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Dr.-Ing. Eike Schallehn

OvG Universität MagdeburgFakultät für Informatik

Institut für Technische und Betriebliche Informationssysteme

2007/2008

Page 2: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

Organization

Lecture: Friday, 13:00 - 14:30, Raum G10-110Teacher: Eike Schallehn

Consultation hours: on appointment via emailRoom: G29-125Telephone: (0391) 67-12845Email: [email protected]

Exercise: Thursday, 11:15 - 13:45, Raum G22A-218Starting October 24thTeacher: Eike Schallehn

Page 3: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

Prerequisites

Required: knowledge about database basics fromdatabase introduction course

Basic principles, Relational Model, SQL, database design,ER Model

Helpful: advanced knowledege about database internalsQuery processing, storage structures

Helpful hands-on experience:SQL queries, DDL and DML

Page 4: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

Overview

1 Foundations2 Distributed DBMS:

architectures, distribution, query processing, transactionmanagement, replication

3 Parallel DBMS:architectures, query processing

4 Federated DBS:architectures, conflicts, integration, query processing

5 Peer-to-peer Data Management

Page 5: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

English Literature /1

M. Tamer Özsu, P. Valduriez: Principles of DistributedDatabase Systems. Second Edition, Prentice Hall, UpperSaddle River, NJ, 1999.S. Ceri and G. Pelagatti: Distributed Databases Principlesand Systems, McGraw Hill Book Company, 1984.C. T. Yu, W. Meng: Principles of Database QueryProcessing for Advanced Applications. Morgan KaufmannPublishers, San Francisco, CA, 1998.

Page 6: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

English Literature /2

Elmasri, R.; Navathe, S.: Fundamentals of DatabaseSystems, Addison Wesley, 2003C. Dye: Oracle Distributed Systems, O’Reilly, Sebastopol,CA, 1999.D. Kossmann: The State of the Art in Distributed QueryProcessing, ACM Computing Surveys, Vol. 32, No. 4,2000, S. 422-469.

Page 7: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Organization

German Literature

P. Dadam: Verteilte Datenbanken undClient/Server-Systeme, Springer-Verlag, Berlin, Heidelberg1996.E. Rahm: Mehrrechner-Datenbanksysteme.Addison-Wesley, Bonn, 1994.http://dbs.uni-leipzig.de/buecher/mrdbs/S. Conrad: Föderierte Datenbanksysteme: Konzepte derDatenintegration. Springer-Verlag, Berlin/Heidelberg,1997.

Page 8: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Part I

Introduction

Page 9: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Overview

MotivationRecapitulationClassification of Multi-Processor DBMS

Page 10: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Centralized Data Management

TXN management

Data manipulation

Data definition

Application

DBMS

Application

Application

...

New requirementsSupport for de-centralized organization structuresHigh availabilityHigh performanceScalability

Page 11: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Data Management in a Network

Node

Network

Node

Node

Node

Page 12: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Distributed Data Management

Node

Node

Network

Node

Node

Page 13: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Distributed Data Management: Example

Network

Products San Francisco, Sydney

Customers San Francisco, Sydney

Customers SydneyProducts München, Magdeburg,

Sydney

Customers München

Customers Magdeburg

Products Magdeburg, San Francisco

Products Sydney

Sydney

Magdeburg

München

San Francisco

Page 14: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Advantages of Distributed DBMS

Transparent management of distributed/replicated dataAvailability and fault tolerancePerformanceScalability

Page 15: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Transparent Data Management

Transparency: "‘hide"’ implementation detailsFor (distributed) database systems

Data independence (physical, logical)Network transparency

"‘hide"’ existence of the network"‘hide"’ physical location of data

To applications a distributed DBS looks just like acentralized DBS

Page 16: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Transparent Data Management/2

continued:Replication transparency

Replication: managing copies of remote data (performance,availability, fault-tolerance)Hiding the existence of copies (e.g. during updates)

Fragmentation transparencyFragmentation: decomposition of relations and distribution ofresulting fragmentsHiding decomposition of global relation

Page 17: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Who provides Transparency?

ApplicationDifferent parts/modules of distributed applicationCommunication / data exchange using standard protocols(RPC, CORBA, HTTP, SOAP, . . . )

DBMSTransparent SQL-access to data on remote DB-instancesRequires query decomposition, transaction coordination,replication

Operating systemOperating systems provides network transparency e.g. onfile system level (NFS) or through standard protocols(TCP/IP)

Page 18: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Fault-Tolerance

Failure of one single node can be compensatedRequires

Replicated copies on different nodesDistributed transactions

Page 19: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Performance

Data can be stored, where they are most likely used →reduction of transfer costsParallel processing in distributed systems

Inter-transaction-parallelism: parallel processing of differenttransactionsInter-query-parallelism: parallel processing of differentqueriesIntra-query-parallelism: parallel of one or several operationswithin one query

Page 20: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Scalability

Requirements raised by growing databases or necessaryperformance improvement

Addition of new nodes/processors often cheaper thandesign of new system or complex tuning measures

Page 21: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Differentiation

Distributed Information SystemApplication components communicate for purpose of dataexchange (distribution on application level)

Distributed DBSDistribution solely realized on the DBS-level

Page 22: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Differentiation /2

DBMS on distributed file systemAll data must be read from blocks stored on different disksProcessing is performed only within DBMS node (notdistributed)Distribution handled by operating system

Page 23: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Special Case: Parallel DBS

Data management on simultaneous computer (multiprocessor, special hardware)Processing capacities are used for performanceimprovementExample

100 GB relation, sequential read with 10 MB/s 17minutesparallel read on 10 nodes (without considering coordinationoverhead) 1:40 minutes

Page 24: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Special Case: Heterogeneous DBS

Motivation: integration of previously existing DBS (legacysystems)

Integrated access: global queries, relationships betweendata objects in different databases, global integrity

ProblemsHeterogeneity on different levels: system, data model,schema, data

Special importance on the WWW: integration of Websources Mediator concept

Page 25: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Motivation

Special Case: Peer-to-Peer-Systems

Peer-to-Peer (P2P): networks without centralized serversAll / many nodes (peers) store dataEach node knows only some "‘close"’ neighbors

No global viewNo centralized coordination

Examples: Napster, Gnutella, Freenet, BitTorrent, . . .Distributed management of data (e.g. MP3-Files)Lookup using centralized servers (Napster) or distributed(Gnutella)

Page 26: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Database Management Systems (DBMS)

Nowadays commonly usedto store huge amounts of data persistently,in collaborative scenarios,to fulfill high performance requirements,to fulfill high consistency requirements,as a basic component of information systems,to serve as a common IT infrastructure for departments ofan organization or company.

Page 27: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Database Management Systems

A database management system (DBMS) is a suite ofcomputer programs designed to manage a database and runoperations on the data requested by numerous clients.

A database (DB) is an organized collection of data.

A database system (DBS) is the concrete instance of adatabase managed by a database management system.

Page 28: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Codd’s 9 Rules for DBMS /1

Differentiate DBMS from other systems managing datapersistently, e.g. file systems

1 Integration: homogeneous, non-redundant managementof data

2 Operations: means for accessing, creating, modifying,and deleting data

3 Catalog: the data description must be accessible as partof the database itself

4 User views: different users/applications must be able tohave a different perception of the data

Page 29: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Codd’s 9 Rules for DBMS /2

5 Integrity control: the systems must provide means togrant the consistency of data

6 Data security: the system must grant only authorizedaccesses

7 Transactions: multiple operations on data can be groupedinto a logical unit

8 Synchronization: parallel accesses to the database aremanaged by the system

9 Data backups: the system provides functionality to grantlong-term accessibility even in case of failures

Page 30: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

3 Level Schema Architecture

Internal Schema

Data R

epresentation

Query P

rocessing

Conceptual Schema

External Schema N. . .External Schema 1

Page 31: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

3 Level Schema Architecture /2

Important concept of DBMSProvides

transparency, i.e. non-visibility, of storage implementationdetailsease of usedecreased application maintenance effortsconceptual foundation for standardsportability

Describes abstraction steps:Logical data independencePhysical data independence

Page 32: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Data Independence

Logical data independence: Changes to the logical schemalevel must not require a change to an application (externalschema) based on the structure.

Physical data independence: Changes to the physicalschema level (how data is stored) must not require a changeto the logical schema.

Page 33: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Architecture of a DBS

Schema architecture roughly con-forms to general architecture of adatabase systems

Applications accessdatabase using specificviews (external schema)

The DBMS provides accessfor all applications using thelogical schema

The database is stored onsecondary storage accordingto an internal schema

Application n. . .

Database

DBMS

Application 1

Page 34: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Client Server Architecture

Schema architecture does notdirectly relate to client server archi-tecture (communication/networkarchitecture)

Clients may run severalapplications

Applications may run onseveral clients

DB servers may bedistributed

...

DB Server

. . .

Database

Client 1 Client n

Page 35: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Relational Model

Developed by Edgar F. Codd (1923-2003) in 1970Derived from mathematical model of n-ary relationsColloquial: data is organized as tables (relations) ofrecords (n-tuples) with columns (attributes)Currently most commonly used database modelRelational Database Management Systems (RDBMS)First prototype: IBM System R in 1974Implemented as core of all major DBMS since late ’70s:IBM DB2, Oracle, MS SQL Server, Informix, Sybase,MySQL, PostgreSQL, etc.Database model of the DBMS language standard SQL

Page 36: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Basic Constructs

A relational database is a database that is structuredaccording to the relational database model. It consists of aset of relations.

Relation

. . .

. . .

. . .

. . .R 1 nA A

Tuple

Relation schema

Relation name Attributes

}

Page 37: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Integrity Constraints

Static integrity constraints describe valid tuples of arelation

Primary key constraintForeign key constraint (referential integrity)Value range constraints...

In SQL additionally: uniqueness and not-NULLTransitional integrity constraints describe valid changes toa database

Page 38: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Relational Algebra

A relational algebra is a set of operations that are closedover relations.

Each operation has one or more relations as inputThe output of each operation is a relation

Page 39: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Relational Operations

Primitive operations:Selection σ

Projection π

Cartesian product(cross product) ×Set union ∪Set difference −Rename β

Non-primitive operationsNatural Join ./

θ-Join and Equi-Join./ϕ

Semi-Join nOuter-Joins = ×Set intersection ∩. . .

Page 40: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Notation for Relations and Tuples

If R denotes a relation schema (set of attributes), than thefunction r(R) denotes a relation conforming to that schema(set of tuples)R and r(R) are often erroneously used synonymously todenote a relation, assuming that for a given relationschema only one relation existst(R) denotes a tuple conforming to a relation schemat(R.a) denotes an attribute value of a tuple for an attributea ∈ R

Page 41: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Selection Operation σ

Select tuples based on predicate or complex condition

PROJECTPNAME PNUMBER PLOCATION DNUMProductX 1 Bellaire 5ProductY 2 Sugarland 5ProductZ 3 Houston 5Computerization 10 Stafford 4Reorganization 20 Houston 1Newbenefits 30 Stafford 4

σPLOCATION=′Stafford ′(r(PROJECT ))

PNAME PNUMBER PLOCATION DNUMComputerization 10 Stafford 4Newbenefits 30 Stafford 4

Page 42: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Projection Operation π

Project to set of attributes - remove duplicates if necessaryPROJECT

PNAME PNUMBER PLOCATION DNUMProductX 1 Bellaire 5ProductY 2 Sugarland 5ProductZ 3 Houston 5Computerization 10 Stafford 4Reorganization 20 Houston 1Newbenefits 30 Stafford 4

πPLOCATION,DNUM(r(PROJECT ))

PLOCATION DNUMBellaire 5Sugarland 5Houston 5Stafford 4Houston 1

Page 43: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Cartesian or cross product ×

Create all possible combinations of tuples from the two inputrelations

RA B1 23 4

SC D E5 6 78 9 1011 12 13

r(R)× r(S)

A B C D E1 2 5 6 71 2 8 9 101 2 11 12 133 4 5 6 73 4 8 9 103 4 11 12 13

Page 44: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Set: Union, Intersection, Difference

All require compatible schemas: attribute names anddomainsUnion: duplicate entries are removedIntersection and Difference: ∅ as possible result

Page 45: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Natural Join Operation ./

Combine tuples from two relations r(R) and r(S) where forall attributes a = R ∩ S (defined in both relations)is t(R.a) = t(S.a).

Basic operation for following key relationshipsIf there are no common attributes result is CartesianproductR ∩ S = ∅ =⇒ r(R) ./ r(S) = r(R)× r(S)

Can be expressed as combination of π, σ and ×r(R) ./ r(S) = πR∪S(σV

a∈R∩S t(R.a)=t(S.a)(r(R)× r(S)))

Page 46: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Natural Join Operation ./ /2

RA B1 23 45 6

SB C D4 5 66 7 88 9 10

r(R) ./ r(S)A B C D3 4 5 65 6 7 8

Page 47: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

The Semi-Join Operation n

Results all tuples from one relation having a (natural) joinpartner in the other relationr(R) n r(S) = πR(r(R) ./ r(S))

PERSONPID NAME

1273 Dylan2244 Cohen3456 Reed

CARPID BRAND

1273 Cadillac1273 VW Beetle3456 Stutz Bearcat

r(PERSON) n r(CAR)

PID NAME1273 Dylan3456 Reed

Page 48: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Other Join Operations

Conditional join: join condition ϕ is explicitly specifiedr(R) ./ϕ r(S) = σϕ(r(R)× r(S))

θ-Join: special conditional join, where ϕ is a singlepredicate of the form aθb with a ∈ R, b ∈ S, andθ ∈ {=, 6=, >, <,≤,≥, . . . }Equi-Join: special θ-Join where θ is =.(Left or Right) Outer Join: union of natural join result andtuples from the left or right input relation which could notbe joined (requires NULL-values to grant compatibleschemas).

Page 49: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Relational Database Management Systems

A Relational Database Management System (RDBMS) is adatabase management system implementing the relationaldatabase model.

Today, most relational DBMS implement the SQL databasemodelThere are some significant differences between therelational model and SQL (duplicate rows, tuple ordersignificant, anonymous column names, etc.)Most distributed and parallel DBMS have a relational(SQL) data model

Page 50: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL Data Model

Said to implement relational database modelDefines own terms

Table

. . .

. . .

. . .

. . .R 1 nA A

Row

Table head

Table name Columns

}

Some significant differences exist

Page 51: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Structured Query Language

Structured Query Language (SQL): declarative languageto describe requested query resultsRealizes relational operations (with the mentioneddiscrepancies)Basic form: SELECT-FROM-WHERE-block (SFW)

SELECT FNAME, LNAME, MGRSTARTDATEFROM EMPLOYEE, DEPARTMENTWHERE SSN=MGRSSN

Page 52: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Selection σ

σDNO=5∧SALARY>30000(r(EMPLOYEE))

SELECT *FROM EMPLOYEEWHERE DNO=5 AND SALARY>30000

Page 53: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Projection π

πLNAME ,FNAME(r(EMPLOYEE))

SELECT LNAME,FNAMEFROM EMPLOYEE

Difference to RM: does not remove duplicatesRequires additional DISTINCT

SELECT DISTINCT LNAME,FNAMEFROM EMPLOYEE

Page 54: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Cartesian Product ×

r(EMPLOYEE)× r(PROJECT )

SELECT *FROM EMPLOYEE, PROJECT

Page 55: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Natural Join ./

r(DEPARTMENT ) ./ r(DEPARTMENT_LOCATIONS)

SELECT *FROM DEPARTMENT

NATURAL JOIN DEPARTMEN_LOCATIONS

Page 56: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Equi-Join

r(EMPLOYEE) ./SSN=MGRSSN r(DEPARTMENT )

SELECT *FROM EMPLOYEE, DEPARTMENTWHERE SSN=MGRSSN

Page 57: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Union

r(R) ∪ r(S)

SELECT * FROM RUNIONSELECT * FROM S

Other set operations: INTERSECT, MINUSDoes remove duplicates (in compliance with RM)If duplicates required:

SELECT * FROM RUNION ALLSELECT * FROM S

Page 58: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL: Other Features

SQL provides several features not in the relational algebraGrouping And Aggregation Functions, e.g. SUM, AVG,COUNT, . . .Sorting

SELECT PLOCATION, AVG(HOURS)FROM EMPLOYEE, WORKS_ON, PROJECTWHERE SSN=ESSN AND PNO=PNUMBERGROUP BY PLOCATIONHAVING COUNT(*) > 1ORDER BY PLOCATION

Page 59: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL DDL

Data Definition Language to create, modify, and deleteschema objects

CREATE DROP ALTER TABLE mytable ( id INT, ...)DROP TABLE ...ALTER TABLE ...CREATE VIEW myview AS SELECT ...DROP VIEW ...CREATE INDEX ...DROP INDEX ......

Page 60: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Simple Integrity Constraints

CREATE TABLE employee(ssn INTEGER,lname VARCHAR2(20) NOT NULL,dno INTEGER,...FOREIGN KEY (dno)

REFERENCES department(dnumber),PRIMARY KEY (ssn)

)

Additionally: triggers, explicit value domains, ...

Page 61: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

SQL DML

Data Manipulation Language to create, modify, anddelete tuples

INSERT INTO (<COLUMNS>) mytable VALUES (...)

INSERT INTO (<COLUMNS>) mytable SELECT ...

UPDATE mytableSET ...WHERE ...

DELETE FROM mytableWHERE ...

Page 62: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Other Parts of SQL

Data Control Language (DCL):GRANT, REVOKE

Transaction management:START TRANSACTION, COMMIT, ROLLBACK

Stored procedures and imperative programming conceptsCursor definition and management

Page 63: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Transactions

Sequence of database operationsRead and write operationsIn SQL sequence of INSERT, UPDATE, DELETE, SELECTstatements

Build a semantic unit, e.g. transfer of an amount from onebank account to anotherHas to conform to ACID properties

Page 64: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Transactions: ACID Properties

Atomicity means that a transaction can not be interruptedor performed only partially

TXN is performed in its entirety or not at allConsistency to preserve data integrity

A TXN starts from a consistent database state and endswith a consistent database state

IsolationResult of a TXN must be independent of other possiblyrunning parallel TXNs

Durability or persistenceAfter a TXN finished successfully (from the user’s view) itsresults must be in the database and the effect can not bereversed

Page 65: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Functional Dependencies

A functional dependency (FD) X→Y within a relationbetween sets r(R) of attributes X ⊆ R and Y ⊆ R exists, iffor each tuple the values of X determine the values for Yi.e.

∀t1, t2 ∈ r(R) : t1(X ) = t2(X ) ⇒ t1(Y ) = t2(Y )

Page 66: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Derivation Rules for FDs

R1 Reflexivity if X ⊇ Y =⇒ X→YR2 Accumulation {X→Y} =⇒ XZ→YZR3 Transitivity {X→Y , Y →Z} =⇒ X→ZR4 Decomposition {X→YZ} =⇒ X→YR5 Unification {X→Y , X→Z} =⇒ X→YZR6 Pseudotransitivity {X→Y , WY →Z} =⇒ WX→Z

R1-R3 known as Armstrong-Axioms (sound, complete)

Page 67: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Normal Forms

Formal criteria to force schemas to be free of redundancyFirst Normal Form (1NF) allows only atomic attributevalues

i.e. all attribute values ar of basic data types like integeror string but not further structured like e.g. an array or aset of values

Second Normal Form (2NF) avoids partial dependenciesA partial dependency exist, if a non-key attribute isfunctionally dependent on a real subset of the primary keyof the relation

Page 68: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Recapitulation

Normal Forms /2

Third Normal Form (3NF) avoids transitive dependenciesDisallows functional dependencies between non-keyattributes

Boyce-Codd-Normal Form (BCNF) disallows transitivedependencies also for primary key attributes

Page 69: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Multi-Processor DBMS

In general: DBMS which are able to use multipleprocessors or DBMS-instances to process databaseoperations [Rahm 94]Can be classified according to different criteria

Processors with same or different functionalityAccess to external storageSpatial distributionProcessor connectionIntegrated vs. federated architectureHomogeneous vs. heterogeneous architecture

Page 70: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Processor Functionality

Assumption: each processor provides the samefunctionalityClassification [Rahm94]

Connection

External Storage

local

loose (close) loose

Shared−NothingShared−DiskShared−Everything

Multi−Processor DBMS

partitioned

local distributed

tight close loose

Spatial

shared

Distribution

Processor

Page 71: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Access to External Storage

Partitioned accessExternal storage is divided among processors/nodes

Each processor has only access to local storageAccessing different partitions requires communication

Shared accessEach processor has access to full databaseRequires synchronisation

Page 72: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Spatial Distribution

Locally distributed: DB-ClusterFast inter-processor communicationFault-toleranceDynamic load balancing possibleLittle administration effortsApplication: parallel DBMS, solutions for high availabilty

Remotely distributed: distributed DBS in WAN scenariosSupport for distributed organization structuresFault-tolerant (even to major catastrophes)Application: distributed DBS

Page 73: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Processor Connection

Tight connectionProcessors share main memoryEfficient co-operationLoad-balancing by means of operating systemProblems: Fault-tolerance, cache coherence, limitednumber of processors (≤ 20)Parallel multi-processor DBMS

Page 74: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Processor Connection /2

Loose connection:Independent nodes with own main memory and DBMSinstancesAdvantages: failure isolation, scalabilityProblems: expensive network communication, costly DBoperations, load balancing

Close connection:Mix of the aboveIn addition to own main memory there is connection viashared memoryManaged by operating system

Page 75: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Everything

Cache

CPU

Shared Main Memory

Cache

CPU CPU

DBMS Buffer

Shared Hard Disks

Cache

Page 76: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Everything /2

Simple realization of DBMS

Distribution transparency provided by operating system

Expensive synchronization

Extended implementation of query processing

Page 77: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Nothing

CPU

Cache

CPU

Cache

CPU

DBMS−

Cache

DBMS−

Buffer

DBMS−

BufferBuffer

Main Memory Main Memory Main Memory

Page 78: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Nothing /2

Distribution of DB across various nodes

Distributed/parallel execution plans

TXN management across participating nodes

Management of catalog and replicas

Page 79: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Disk

Cache

CPU

Main Memory

Cache

CPU

gemeinsame Festplatten

Highspeed Communication

DBMS−

Buffer

DBMS−

Puffer

DBMS−

Buffer

CPU

Main Memory Main Memory

Cache

Page 80: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Class: Shared-Disk /2

Avoids physical data distribution

No distributed TXNs and query processing

Requires buffer invalidation

Page 81: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Integrated vs. Federated DBS

Integrated:Shared database for all nodes one conceptual schemaHigh distribution transparency: access to distributed DB vialocal DBMSRequires co-operation of DBMS nodes restrictedautonomy

Federated:Nodes with own DB and own conceptual schemaRequires schema integration global conceptual schemaHigh degree of autonomy of nodes

Page 82: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Integrated vs. Federates DBS /2

homogeneous

Multi−Processor−DBS

Shared−Disk Shared−Nothing

integrated integrated federated

heterogeneoushomogeneous homogeneous

Page 83: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Centralized vs. De-centralized Coordination

Centralized:Each node has global view on database (directly of viamaster)Central coordinator: initiator of query/transaction → knowsall participating nodesProvides typical DBS properties (ACID, resultcompleteness, etc.)Applications: distributed and parallel DBS

Limited availability, fault-tolerance, scalabilty

Page 84: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Criterion: Centralized vs. De-centralized Coordination/2

De-centralized:No global view on schema → peer knows only neighborsAutonomous peers; global behavior depends on localinteractionCan not provide typical DBMS propertiesApplication: P2P systems

Advantages: availability, fault-tolerance, scalabilty

Page 85: Dr.-Ing. Eike Schallehn · 2017. 11. 2. · Distributed Data Management Organization Organization Lecture: Friday, 13:00 - 14:30, Raum G10-110 Teacher: Eike Schallehn Consultation

Distributed Data Management

Introduction

Classification of Multi-Processor DBMS

Comparison

Parallel Distributed FederatedDBS DBS DBS

High TXN rates ↑ →↗ →Intra-TXN-Parallelism ↑ →↗ ↘→Scalability ↗ →↗ →Availability ↗ ↗ ↘Geogr. Distribution ↘ ↗ ↗Node Autonomy ↘ → ↗DBS-Heterogeneity ↘ ↘ ↗Administration → ↘ ↘↓