parallel and distributed databases r & g chapter 22
Post on 21-Dec-2015
233 views
TRANSCRIPT
Parallel and distributed databases
R & G Chapter 22
What is a distributed database?
Why distribute a database
Scalability and performance
Resilience to failures
Th
roughput
Data
siz
e
versusX X
Why distribute a database
Data is already distributed Or needs to be distributed
Data is in multiple systems
Why not distribute a database
You must earn your complexity!
Communication needed Must build a complex infrastructure Unpredictable latencies must be masked
More types of failures More components to fail Network failures Congestion, timeouts
More complex planning Communication cost plus I/O cost
May have to deal with heterogeneity Different types of systems Different schemas, possibly incompatible Different administrative domains
Types of distributed databases
The old days: mainframes
Definitely not distributed!
Client-server
User interaction
Data processing
Network
Parallel database
Primary/secondary
X
Multidatabase
How do they work?
What is shared? How to distribute the data? How to process the data? How to update the data?
What is shared?
Memory
CPUs RAM Disk
Most modern DBMSsMost modern DBMSs
What is shared?
Disk
RAM
Oracle RACOracle RAC
What is shared?
Nothing
RAM
Search engines, TeradataSearch engines, Teradata
Server 1 Server 2 Server 3 Server 4
Bike $866/2/07 636353
Chair $106/5/07 662113
How to distribute the data?
Couch $5706/1/07 424252
Car $11236/1/07 256623
Lamp $196/7/07 121113
Bike $566/9/07 887734
Scooter $186/11/07 252111
Hammer $80006/11/07 116458
How to distribute the data?
Hash partitioning Range partitioning
(key,value)
Hash()
(key,value)
<= X > X
Server 1 Server 2 Server 3 Server 4
How to distribute the data?
Bike
Chair
Couch
Car
Lamp
Bike
Scooter
Hammer
$86
$10
$570
$1123
$19
$56
$18
$8000
6/2/07
6/5/07
6/1/07
6/1/07
6/7/07
6/9/07
6/11/07
6/11/07
636353
662113
424252
256623
121113
887734
252111
116458
Query processing
Intra-operator parallelism
Inter-operator parallelism
Parallel scanning
filter filter filter filter filter filter
Result
Sorting
Sorting
Parallel hash join
Hash()
Join
Semi-join
Inter-operator parallelism
Updating distributed data
Synchronous: read-any-write-all
Reads are fastReads are fast
Updating distributed data
Synchronous: voting
Updating distributed data
Synchronous: voting
Writes tolerant to disconnectionWrites tolerant to disconnection
Consistency of distributed data
Should provide ACID
Primary/secondary
Two-phase commit
PREPARE
PREPARED PREPARED
COMMIT
Two-phase commit
PREPARE
PREPARED ABORT
ABORT
Two-phase commit
PREPARE
PREPARED
ABORT
Two-phase commit
PREPARE
PREPARED PREPARED
X
Conclusion
Parallelism and distribution very useful Performance Fault tolerance Scale
But complex! Rethink lots of aspects of the system Must earn the complexity