parallel and distributed databases r & g chapter 22

Parallel and distributed databases

R & G Chapter 22

What is a distributed database?

Why distribute a database

Scalability and performance

Resilience to failures

Th

roughput

Data

siz

e

versusX X

Why distribute a database

Data is already distributed Or needs to be distributed

Data is in multiple systems

Why not distribute a database

You must earn your complexity!

Communication needed Must build a complex infrastructure Unpredictable latencies must be masked

More types of failures More components to fail Network failures Congestion, timeouts

More complex planning Communication cost plus I/O cost

May have to deal with heterogeneity Different types of systems Different schemas, possibly incompatible Different administrative domains

Types of distributed databases

The old days: mainframes

Definitely not distributed!

Client-server

User interaction

Data processing

Network

Parallel database

Primary/secondary

X

Multidatabase

How do they work?

What is shared? How to distribute the data? How to process the data? How to update the data?

What is shared?

Memory

CPUs RAM Disk

Most modern DBMSsMost modern DBMSs

What is shared?

Disk

RAM

Oracle RACOracle RAC

What is shared?

Nothing

RAM

Search engines, TeradataSearch engines, Teradata

Server 1 Server 2 Server 3 Server 4

Bike $866/2/07 636353

Chair $106/5/07 662113

How to distribute the data?

Couch $5706/1/07 424252

Car $11236/1/07 256623

Lamp $196/7/07 121113

Bike $566/9/07 887734

Scooter $186/11/07 252111

Hammer $80006/11/07 116458


Hash partitioning Range partitioning

(key,value)

Hash()

(key,value)

<= X > X

Server 1 Server 2 Server 3 Server 4


Bike

Chair

Couch

Car

Lamp

Bike

Scooter

Hammer

$86

$10

$570

$1123

$19

$56

$18

$8000

6/2/07

6/5/07

6/1/07

6/1/07

6/7/07

6/9/07

6/11/07

6/11/07

636353

662113

424252

256623

121113

887734

252111

116458

Query processing

Intra-operator parallelism

Inter-operator parallelism

Parallel scanning

filter filter filter filter filter filter

Result

Sorting

Parallel hash join

Hash()

Semi-join

Inter-operator parallelism

Updating distributed data

Synchronous: read-any-write-all

Reads are fastReads are fast


Synchronous: voting


Synchronous: voting

Writes tolerant to disconnectionWrites tolerant to disconnection

Consistency of distributed data

Should provide ACID

Primary/secondary

Two-phase commit

PREPARE

PREPARED PREPARED

COMMIT

Two-phase commit

PREPARE

PREPARED ABORT

ABORT

Two-phase commit

PREPARE

PREPARED

ABORT

Two-phase commit

PREPARE

PREPARED PREPARED

X

Conclusion

Parallelism and distribution very useful Performance Fault tolerance Scale

But complex! Rethink lots of aspects of the system Must earn the complexity

parallel and distributed databases r & g chapter 22

Documents

primarysecondary slide

sorting slide

preparedabort slide

disconnection slide

acid slide

teradata slide

fast slide

multidatabase slide