parallel and distributed data processing - stanford … · parallel and distributed data processing...

62
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 1: Introduction

Upload: lamcong

Post on 19-Apr-2018

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 347Parallel and Distributed

Data ProcessingSpring 2016

Notes 1: Introduction

Page 2: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 245

CS 347 Notes 1 2

P

M …

SimplificationsSingle front endOne place to keep locksIf processor fails then system fails

Centralized DatabaseApplication

Front end (SQL)

Query processor

Transaction processor

File access

Page 3: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 347Multiple processors (and memories)

Heterogeneous and autonomous components

CS 347 Notes 1 3

Page 4: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Multiple ProcessorsOpportunities for parallelism

Opportunities for reliability

Challenges with synchronization

CS 347 Notes 1 4

Page 5: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Multiple ProcessorsOpportunities for parallelism

Opportunities for reliability

Challenges with synchronization

CS 347 Notes 1 5

Illustration: Two Generals’ Problem

Page 6: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

One General’s Problem

CS 347 Notes 1 6

General

Army Enemy

Page 7: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Two Generals’ Problem

CS 347 Notes 1 7

Army A Enemy Army B

MessengersGeneral A General B

Page 8: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Two Generals’ ProblemRulesA and B must attack at the same timeA and B synchronize through messengersMessengers can get lost

CS 347 Notes 1 8

Page 9: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

How Many Messages Do We Need?

CS 347 Notes 1 9

General A General BAttack at 9am!

Page 10: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

How Many Messages Do We Need?

CS 347 Notes 1 10

General A General BAttack at 9am!

Ack, attacking at 9am!

Page 11: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

How Many Messages Do We Need?

CS 347 Notes 1 11

General A General BAttack at 9am!

Ack, attacking at 9am!

Got your ack!

Page 12: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

How Many Messages Do We Need?TheoremNo protocol with a finite number of messages can solve the stated two generals’ problem

Proof ideaAssume there is a protocol→ There is a solution (some sequence of delivered/lost messages)→ A and B attack at the same timeAssume last delivered message in the solution is lost→ Sender would attack alone→ Protocol is incorrect

CS 347 Notes 1 12

Page 13: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Alternatives?Need to relax rules

CS 347 Notes 1 13

Page 14: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Probabilistic Approach

CS 347 Notes 1 14

General AAttack at 9am!

Attack at 9am!

Attack at 9am!

Attack at 9am!

General B

Send as many messages as possible, hope at least onegets through

Page 15: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 15

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

Eventually both armies attack: A attacks right away, keeps resending until receives confirmation from B

Page 16: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 16

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

One message per time unitEach message delivered with probability pWhat is the probability that B commits by time t?

Page 17: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 17

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

c(1) = p

Page 18: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 18

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

c(1) = pc(2) = p + (1 – p) p

Page 19: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 19

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

c(1) = pc(2) = p + (1 – p) pc(3) = p + (1 – p) p + (1 – p)2 pc(4) = p + (1 – p) p + (1 – p)2 p + (1 – p)3 p

Page 20: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

0.30

0.51

0.660.76

0.83

0.50

0.75

0.880.94 0.97

0.70

0.91 0.970.99

1.00

0.90

0.99 1.00 1.00 1.00

0

0.2

0.4

0.6

0.8

1

1.00 2.00 3.00 4.00 5.00

c(t)

t

0.3

0.5

0.7

0.9

CS 347 Notes 1 20

p

Page 21: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Eventual Commit

CS 347 Notes 1 21

General AAttack ASAP!

(retransmit)

(retransmit)

On my way!

General B

How expensive is the protocol?E(p) = expected number of messages (function of p)

Exercise: compute E(p)

Page 22: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Two-Phase Eventual Commit

CS 347 Notes 1 22CS 347 Notes 1 22

General AReady to attack?

Yes!

Attack ASAP!

On my way!

General B

(retransmit)

(retransmit)

Phase 1

Phase 2

Eventually both armies attack: A attacks in phase 2, keeps resending until receives confirmation from B

Page 23: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Commit ProtocolsWill study commit (aka consensus) protocols in detail

CS 347 Notes 1 23

Page 24: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 347Multiple processors (and memories) ✔

Heterogeneous and autonomous components

CS 347 Notes 1 24

Page 25: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Heterogeneity

CS 347 Notes 1 25

Application

RDBMS

Portfolio

Files

History of dividends, ratios, …Stock ticker tape

Select investments

Page 26: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

AutonomyExample: Unable to get statistics for query optimization

Example: Components may not agree to perform computation

CS 347 Notes 1 26

Page 27: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 347We study data management with multiple processors and possible heterogeneity and autonomy, impacting:

Data organizationQuery processingAccess structuresConcurrency controlRecovery

CS 347 Notes 1 27

Page 28: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

CS 347Huge practical importance:

Massive data sets, managed by many computersE.g., how to crawl and search the Web?

Need for integration of data from many sourcesE.g., how to do comparison shopping?

Need for collaborative computingE.g., collecting and analyzing data in sensor networksE.g., massively multiplayer online games (World of Warcraft)

CS 347 Notes 1 28

Page 29: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

LogisticsLecturesMondays and Wednesday 1:30pm – 2:50pm / Huang 018

InstructorZoltan FernOffice hours: Mondays 3pm – 4pm / Gates 459

Course assistantsNeeral DodhiaLili YangOffice hours: Wednesdays TBD

CS 347 Notes 1 29

Page 30: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

LogisticsWebsitehttps://cs347.stanford.edu

Videoshttps://mvideox.stanford.edu/course/707

Forumhttps://piazza.com/stanford/spring2016/cs347

CS 347 Notes 1 30

Page 31: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

LogisticsAssignments (20%)4-5 automated quizzes (out Wednesday, due next Friday)1-2 homeworks (read paper and answer questions)

Midterm exam (35%)In class1:30pm on Wednesday, April 27 (tentative)

Final exam (45%)12:15pm on Friday, June 3

CS 347 Notes 1 31

Page 32: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

SyllabusPart 1: ConceptsData fragmentationQuery processingQuery optimizationConcurrency control, failuresReliable data managementReplicationNetwork partitionsTime and clocks

CS 347 Notes 1 32

Page 33: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

SyllabusPart 2: SystemsPeer-to-peer systemsPublish-subscribe systemsMapReduceDistributed information retrievalOther popular systems

Distributed data stores (Dynamo, Memcached, TAO)Non-relational databases (BigTable, Cassandra, HBase, Spanner)Massively parallel processing on MapReduce (Hive, Impala, Pig)Cluster computing (Spark)Event processing (Storm)Graph processing (Pregel)

CS 347 Notes 1 33

Page 34: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

PrerequisitesDatabase basics (CS 145)SQL, indexing, transactions

Database implementation (CS 245)Query plans, cost estimation, join algorithms, recovery, logging

Distributed computing basicsInterconnection networks, LAN vs. WAN

CS 347 Notes 1 34

Page 35: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Introductory TopicsDatabase architecturesClient-server systemsParallel vs. distributed databasesCloud computing

CS 347 Notes 1 35

Page 36: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[1] Shared memory

CS 347 Notes 1 36

P

M

… P

Page 37: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[2] Shared disk

CS 347 Notes 1 37

P

M…

P

M

Page 38: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[2’] Shared data storage

CS 347 Notes 1 38

P

M…

P

M

Storage area network (SAN)Distributed file system (e.g., HDFS)

Page 39: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[3] Shared nothing

CS 347 Notes 1 39

P

M …

P

M

Page 40: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[4] Hybrid: example A

CS 347 Notes 1 40

P

M

… P

P

M

… P

Page 41: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[4] Hybrid: example B

CS 347 Notes 1 41

P

M …

P

M

P

M …

P

M…

R R

WAN

LAN 1 LAN 2

Page 42: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[4] Hybrid: tandemE.g., Microsoft SQL Server Parallel Data Warehouse

CS 347 Notes 1 42

P

M

…P

M

P

M

P

M

Page 43: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[5] Unusual: Datacycle (broadcast disk)

CS 347 Notes 1 43

P

M

P

M

Page 44: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[5] Unusual: per-disk processorE.g., Oracle Exadata Database Machine

CS 347 Notes 1 44

P

M …P’ P’ Small processorsSome (tiny) memory

Page 45: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database Architectures[5] Unusual: sensor networks

CS 347 Notes 1 45

P

M

B

P

M

B

P

M

B

P

M

B

P

M

B

P

M …

Data collection node

sensor

battery

Page 46: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Database ArchitecturesIssues for selecting architecturePerformanceCostReliabilityScalabilityGeographic distribution of dataData patterns/clusters

CS 347 Notes 1 46

Page 47: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Introductory TopicsDatabase architectures ✔Client-server systemsParallel vs. distributed databasesCloud computing

CS 347 Notes 1 47

Page 48: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Client-Server Systems

CS 347 Notes 1 48

Application

Front end

Query processor

Transaction processor

File access

Client

Server

Database serverClient sends (complex) script

Page 49: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Transaction serverClient sends (simple) transaction

Client-Server Systems

CS 347 Notes 1 49

Application

Front end

Query processor

Transaction processor

File access

Client

Server

Page 50: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Data serverClient sends (basic) record requests

Client-Server Systems

CS 347 Notes 1 50

Application

Front end

Query processor

Transaction processor

File access

Client

Server

Page 51: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Client-Server SystemsIssues for selecting software partitioningObject granularity (~ data properties)Location of locks (~ operation properties)Location of caches (~ workload properties)

CS 347 Notes 1 51

Page 52: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Client-Server SystemsBasic tradeoffOffloading work to clients vs. data transmitted

CS 347 Notes 1 52

C

S

C

S

Retrieve file blocksReserve hotel room

Page 53: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Introductory TopicsDatabase architectures ✔Client-server systems✔Parallel vs. distributed databasesCloud computing

CS 347 Notes 1 53

Page 54: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Parallel vs. DistributedMore similarities than differences!

CS 347 Notes 1 54

Page 55: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Parallel DatabasesTypical system propertiesFast interconnectHomogeneous software (and even hardware)

Typical goalsHigh performanceTransparency

CS 347 Notes 1 55

Page 56: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Distributed DatabasesTypical system propertiesGeographical distributionHeterogeneity and autonomy

Typical goalsData sharingDisconnected operation capabilities

CS 347 Notes 1 56

Page 57: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Cloud Computing“The fight to dominate cloud computing will increase competition and innovation”(Battle of the clouds, October 2009)

“Tech giants are waging a price war to win other firms’ computing business”(Silver lining, August 2014)

“As cloud-computing prices keep falling, the whole IT business will change”(The cheap, convenient cloud, April 2015)

CS 347 Notes 1 57

Page 58: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Cloud ComputingWhat is it?On-demand access to shared computing resourcesInfrastructure/platform/software as a serviceHardware virtualizationBusiness model (focus on business, not infra, differentiators)

CS 347 Notes 1 58

Page 59: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Cloud ComputingHow does it relate to?Client-server modelSoftware as a serviceGrid computingDistributed computingMassively parallel computingPeer-to-peer computingUtility computing

CS 347 Notes 1 59

Page 60: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

NextDescribing distributed dataQuery processing in parallel databasesQuery processing in distributed databases

CS 347 Notes 1 60

Page 61: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Query Processing in Parallel DBsTypically, we can distribute/partition/sort/etc. data to make DB operations (e.g., joins) fast

CS 347 Notes 1 61

Page 62: Parallel and Distributed Data Processing - Stanford … · Parallel and Distributed Data Processing ... Cloud computing CS 347 Notes 1 53. ... Distributed computing Massively parallel

Query Processing in Distributed DBsTypically, data distribution is given – we need to find query processing strategy to minimize cost (e.g., communication cost)

CS 347 Notes 1 62