parallel and distributed data processing - stanford … · parallel and distributed data processing...
TRANSCRIPT
CS 347Parallel and Distributed
Data ProcessingSpring 2016
Notes 1: Introduction
CS 245
CS 347 Notes 1 2
P
M …
SimplificationsSingle front endOne place to keep locksIf processor fails then system fails
Centralized DatabaseApplication
Front end (SQL)
Query processor
Transaction processor
File access
CS 347Multiple processors (and memories)
Heterogeneous and autonomous components
CS 347 Notes 1 3
Multiple ProcessorsOpportunities for parallelism
Opportunities for reliability
Challenges with synchronization
CS 347 Notes 1 4
Multiple ProcessorsOpportunities for parallelism
Opportunities for reliability
Challenges with synchronization
CS 347 Notes 1 5
Illustration: Two Generals’ Problem
One General’s Problem
CS 347 Notes 1 6
General
Army Enemy
Two Generals’ Problem
CS 347 Notes 1 7
Army A Enemy Army B
MessengersGeneral A General B
Two Generals’ ProblemRulesA and B must attack at the same timeA and B synchronize through messengersMessengers can get lost
CS 347 Notes 1 8
How Many Messages Do We Need?
CS 347 Notes 1 9
General A General BAttack at 9am!
How Many Messages Do We Need?
CS 347 Notes 1 10
General A General BAttack at 9am!
Ack, attacking at 9am!
How Many Messages Do We Need?
CS 347 Notes 1 11
General A General BAttack at 9am!
Ack, attacking at 9am!
Got your ack!
How Many Messages Do We Need?TheoremNo protocol with a finite number of messages can solve the stated two generals’ problem
Proof ideaAssume there is a protocol→ There is a solution (some sequence of delivered/lost messages)→ A and B attack at the same timeAssume last delivered message in the solution is lost→ Sender would attack alone→ Protocol is incorrect
CS 347 Notes 1 12
Alternatives?Need to relax rules
CS 347 Notes 1 13
Probabilistic Approach
CS 347 Notes 1 14
General AAttack at 9am!
Attack at 9am!
Attack at 9am!
Attack at 9am!
General B
Send as many messages as possible, hope at least onegets through
Eventual Commit
CS 347 Notes 1 15
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
Eventually both armies attack: A attacks right away, keeps resending until receives confirmation from B
Eventual Commit
CS 347 Notes 1 16
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
One message per time unitEach message delivered with probability pWhat is the probability that B commits by time t?
Eventual Commit
CS 347 Notes 1 17
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
c(1) = p
Eventual Commit
CS 347 Notes 1 18
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
c(1) = pc(2) = p + (1 – p) p
Eventual Commit
CS 347 Notes 1 19
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
c(1) = pc(2) = p + (1 – p) pc(3) = p + (1 – p) p + (1 – p)2 pc(4) = p + (1 – p) p + (1 – p)2 p + (1 – p)3 p
Eventual Commit
0.30
0.51
0.660.76
0.83
0.50
0.75
0.880.94 0.97
0.70
0.91 0.970.99
1.00
0.90
0.99 1.00 1.00 1.00
0
0.2
0.4
0.6
0.8
1
1.00 2.00 3.00 4.00 5.00
c(t)
t
0.3
0.5
0.7
0.9
CS 347 Notes 1 20
p
Eventual Commit
CS 347 Notes 1 21
General AAttack ASAP!
(retransmit)
(retransmit)
On my way!
General B
How expensive is the protocol?E(p) = expected number of messages (function of p)
Exercise: compute E(p)
Two-Phase Eventual Commit
CS 347 Notes 1 22CS 347 Notes 1 22
General AReady to attack?
Yes!
Attack ASAP!
On my way!
General B
(retransmit)
(retransmit)
Phase 1
Phase 2
Eventually both armies attack: A attacks in phase 2, keeps resending until receives confirmation from B
Commit ProtocolsWill study commit (aka consensus) protocols in detail
CS 347 Notes 1 23
CS 347Multiple processors (and memories) ✔
Heterogeneous and autonomous components
CS 347 Notes 1 24
Heterogeneity
CS 347 Notes 1 25
Application
RDBMS
Portfolio
Files
History of dividends, ratios, …Stock ticker tape
Select investments
AutonomyExample: Unable to get statistics for query optimization
Example: Components may not agree to perform computation
CS 347 Notes 1 26
CS 347We study data management with multiple processors and possible heterogeneity and autonomy, impacting:
Data organizationQuery processingAccess structuresConcurrency controlRecovery
CS 347 Notes 1 27
CS 347Huge practical importance:
Massive data sets, managed by many computersE.g., how to crawl and search the Web?
Need for integration of data from many sourcesE.g., how to do comparison shopping?
Need for collaborative computingE.g., collecting and analyzing data in sensor networksE.g., massively multiplayer online games (World of Warcraft)
CS 347 Notes 1 28
LogisticsLecturesMondays and Wednesday 1:30pm – 2:50pm / Huang 018
InstructorZoltan FernOffice hours: Mondays 3pm – 4pm / Gates 459
Course assistantsNeeral DodhiaLili YangOffice hours: Wednesdays TBD
CS 347 Notes 1 29
LogisticsWebsitehttps://cs347.stanford.edu
Videoshttps://mvideox.stanford.edu/course/707
Forumhttps://piazza.com/stanford/spring2016/cs347
CS 347 Notes 1 30
LogisticsAssignments (20%)4-5 automated quizzes (out Wednesday, due next Friday)1-2 homeworks (read paper and answer questions)
Midterm exam (35%)In class1:30pm on Wednesday, April 27 (tentative)
Final exam (45%)12:15pm on Friday, June 3
CS 347 Notes 1 31
SyllabusPart 1: ConceptsData fragmentationQuery processingQuery optimizationConcurrency control, failuresReliable data managementReplicationNetwork partitionsTime and clocks
CS 347 Notes 1 32
SyllabusPart 2: SystemsPeer-to-peer systemsPublish-subscribe systemsMapReduceDistributed information retrievalOther popular systems
Distributed data stores (Dynamo, Memcached, TAO)Non-relational databases (BigTable, Cassandra, HBase, Spanner)Massively parallel processing on MapReduce (Hive, Impala, Pig)Cluster computing (Spark)Event processing (Storm)Graph processing (Pregel)
CS 347 Notes 1 33
PrerequisitesDatabase basics (CS 145)SQL, indexing, transactions
Database implementation (CS 245)Query plans, cost estimation, join algorithms, recovery, logging
Distributed computing basicsInterconnection networks, LAN vs. WAN
CS 347 Notes 1 34
Introductory TopicsDatabase architecturesClient-server systemsParallel vs. distributed databasesCloud computing
CS 347 Notes 1 35
Database Architectures[1] Shared memory
CS 347 Notes 1 36
P
M
… P
…
Database Architectures[2] Shared disk
CS 347 Notes 1 37
P
M…
…
P
M
Database Architectures[2’] Shared data storage
CS 347 Notes 1 38
P
M…
…
P
M
Storage area network (SAN)Distributed file system (e.g., HDFS)
Database Architectures[3] Shared nothing
CS 347 Notes 1 39
P
M …
P
M
Database Architectures[4] Hybrid: example A
CS 347 Notes 1 40
P
M
… P
…
P
M
… P
…
Database Architectures[4] Hybrid: example B
CS 347 Notes 1 41
P
M …
P
M
P
M …
P
M…
R R
WAN
LAN 1 LAN 2
Database Architectures[4] Hybrid: tandemE.g., Microsoft SQL Server Parallel Data Warehouse
CS 347 Notes 1 42
P
M
…P
M
P
M
P
M
Database Architectures[5] Unusual: Datacycle (broadcast disk)
CS 347 Notes 1 43
P
M
P
M
…
Database Architectures[5] Unusual: per-disk processorE.g., Oracle Exadata Database Machine
CS 347 Notes 1 44
P
M …P’ P’ Small processorsSome (tiny) memory
Database Architectures[5] Unusual: sensor networks
CS 347 Notes 1 45
P
M
B
P
M
B
P
M
B
P
M
B
P
M
B
P
M …
Data collection node
sensor
battery
Database ArchitecturesIssues for selecting architecturePerformanceCostReliabilityScalabilityGeographic distribution of dataData patterns/clusters
CS 347 Notes 1 46
Introductory TopicsDatabase architectures ✔Client-server systemsParallel vs. distributed databasesCloud computing
CS 347 Notes 1 47
Client-Server Systems
CS 347 Notes 1 48
Application
Front end
Query processor
Transaction processor
File access
Client
Server
Database serverClient sends (complex) script
Transaction serverClient sends (simple) transaction
Client-Server Systems
CS 347 Notes 1 49
Application
Front end
Query processor
Transaction processor
File access
Client
Server
Data serverClient sends (basic) record requests
Client-Server Systems
CS 347 Notes 1 50
Application
Front end
Query processor
Transaction processor
File access
Client
Server
Client-Server SystemsIssues for selecting software partitioningObject granularity (~ data properties)Location of locks (~ operation properties)Location of caches (~ workload properties)
CS 347 Notes 1 51
Client-Server SystemsBasic tradeoffOffloading work to clients vs. data transmitted
CS 347 Notes 1 52
C
S
C
S
Retrieve file blocksReserve hotel room
Introductory TopicsDatabase architectures ✔Client-server systems✔Parallel vs. distributed databasesCloud computing
CS 347 Notes 1 53
Parallel vs. DistributedMore similarities than differences!
CS 347 Notes 1 54
Parallel DatabasesTypical system propertiesFast interconnectHomogeneous software (and even hardware)
Typical goalsHigh performanceTransparency
CS 347 Notes 1 55
Distributed DatabasesTypical system propertiesGeographical distributionHeterogeneity and autonomy
Typical goalsData sharingDisconnected operation capabilities
CS 347 Notes 1 56
Cloud Computing“The fight to dominate cloud computing will increase competition and innovation”(Battle of the clouds, October 2009)
“Tech giants are waging a price war to win other firms’ computing business”(Silver lining, August 2014)
“As cloud-computing prices keep falling, the whole IT business will change”(The cheap, convenient cloud, April 2015)
CS 347 Notes 1 57
Cloud ComputingWhat is it?On-demand access to shared computing resourcesInfrastructure/platform/software as a serviceHardware virtualizationBusiness model (focus on business, not infra, differentiators)
CS 347 Notes 1 58
Cloud ComputingHow does it relate to?Client-server modelSoftware as a serviceGrid computingDistributed computingMassively parallel computingPeer-to-peer computingUtility computing
CS 347 Notes 1 59
NextDescribing distributed dataQuery processing in parallel databasesQuery processing in distributed databases
CS 347 Notes 1 60
Query Processing in Parallel DBsTypically, we can distribute/partition/sort/etc. data to make DB operations (e.g., joins) fast
CS 347 Notes 1 61
Query Processing in Distributed DBsTypically, data distribution is given – we need to find query processing strategy to minimize cost (e.g., communication cost)
CS 347 Notes 1 62