data-intensive distributed computing - roegiest · serverless architectures doesn’t mean you...

34
Data-Intensive Distributed Computing The Final Part This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 431/631 451/651 (Winter 2019) Adam Roegiest Kira Systems April 4, 2019 These slides are available at http://roegiest.com/bigdata-2019w/

Upload: others

Post on 04-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Data-Intensive Distributed Computing

The Final Part

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019)

Adam RoegiestKira Systems

April 4, 2019

These slides are available at http://roegiest.com/bigdata-2019w/

Page 2: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

The datacenter is the computer!

Scale “out”, not “up”Limits of SMP and large shared-memory machines

Move processing to the dataCluster have limited bandwidth, code is a lot smaller

Process data sequentially, avoid random accessSeeks are expensive, disk throughput is good

“Big ideas”

Assume that components will breakEngineer software around hardware failures

*

*

*

Page 3: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: NASA/JPL

Page 4: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Humans will colonize MarsSooner than you think

Source: https://www.newscientist.com/article/dn23542-how-to-build-a-mars-colony-that-lasts-forever/

Page 5: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: http://observer.com/2016/06/elon-musk-charts-path-to-colonizing-mars-within-a-decade/

Source: https://twitter.com/SpaceX/status/725351354537906176

Source: https://www.theguardian.com/science/2015/aug/27/buzz-aldrin-colonize-mars-within-25-years

Page 6: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

“The Pilgrims on the Mayflower came here to live and stay. They didn’t wait around Plymouth Rock for the return trip, and neither will people building up a population and a settlement [on Mars].”

“Mars can’t just be a one-shot mission” – Buzz Aldrin

Source: Mayflower in Plymouth Harbor by William Halsall (1882)

Page 7: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Grow foodBuild shelter

Mine fuel and materials

Connect with family and friendsEngage in leisure activities

Search the web

Conduct science

Maslow's hierarchy of needs

Produce breathable air

Needs

Page 8: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

The fundamental problem: Latency

speed of light: 2-24 minutes

Bandwidth is “reasonable”

rockets: 5-10 months

Lunar Laser Communications Demonstration: 622-Mbps downlink, 20-Mbps uplink

SneakerNet on rockets: Easily PBs

Searching the web should be as easy from Mars as it is from Marseille!

Page 9: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

What’s doable, what’s not?

Page 10: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Example: How do I grow potatoes in recycled organic waste?

Source: 20th Century Fox

Page 11: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Search from Mars: Implementation

C. Clarke, G. Cormack, J. Lin, and A. Roegiest. Ten Blue Links on Mars. WWW 2017.J. Lin, C. Clarke, and G. Baruah. Searching from Mars. IEEE Internet Computing, 20(1):78-82, 2016.

Step 2. Beam the diffs

Step 3. User model activate!

We know exactly how to do this!

We have a good idea how to do this!

Step 1. Rocket SneakerNet

It’s a caching problem!

We’ve worked out some simulations already…

Page 12: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Search from Mars ~ Search from regions on Earth with poor connectivity

Easter IslandCanadian ArcticVillages in rural India

For the truly skeptical…

Page 13: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: Wikipedia (Everest)

Big Data

Page 14: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

What’s growing faster?

First, a story…

Moore’s LawBig DataWhat do I mean here?What do I mean here?

Big Data > Moore’s LawBig Data < Moore’s LawBig Data ~ Moore’s Law

J. Lin. Is Big Data a Transient Problem? IEEE Internet Computing, 19(5):86-90, 2015.

Page 15: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

What’s growing faster?

Moore’s LawBig DataLet’s restrict to

Human-generated data

Bounds?Human population

Data generation per unit time

Page 16: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Back to my story…

Big Data > Moore’s LawBig Data < Moore’s LawBig Data ~ Moore’s Law

Implications?

Page 17: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write
Page 18: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

What’s growing faster?

Moore’s LawBig Data

Bounds?Human population

Data generation per unit time

Let’s restrict toHuman-generated data

Page 19: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: Google

Serverless Architectures

Page 20: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Server

Page 21: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

Disk

Server

Processor

Memory

Disk

Server

Processor

Memory

Disk

Server

Processor

Memory

Disk

Server

Page 22: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Cloud (I’m going to illustrate with AWS)

Page 23: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Processor

Memory

Disk

(Virtualized) Server

Cloud

Persistent Store (S3)

Page 24: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Cloud

Persistent Store (S3)

Page 25: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Processor

Memory

(scratch) Disk

(Virtualized) Server

Cloud

“State” as a service (S3, RDS, SQS, …)

?

Page 26: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

(Virtualized) Server

Processor

Memory

(Virtualized) Server

Processor

Memory

(Virtualized) Server

Processor

Memory

(Virtualized) Server

Cloud

“State” as a service (S3, RDS, SQS, …)

?

Page 27: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Processor

Memory

(Virtualized) Server

Processor

Memory

(Virtualized) Server

Processor

Memory

(Virtualized) Server

Processor

Memory

Cloud

“State” as a service (S3, RDS, SQS, …)

Function

Page 28: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

FaaS FaaSFaaSFaaS

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Cloud

“State” as a service (S3, RDS, SQS, …)

Page 29: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Serverless Architectures

Doesn’t mean you don’t have serversJust that managing them is the cloud provider’s problem

Write functions with well-defined entry and exit pointsCloud provider handles all other aspect of execution

Page 30: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: Amazon Web Services

Page 31: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

(Current) Serverless Architectures

Asynchronous, loosely-coupled, event-driven

Functions touch relatively little data

What about serverless data analytics?

Design goal: pure pay-as-you-go, zero costs for idle capacityCompared to current options?

Page 32: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

QueueQueue

Spark Context

Client

Flint

Scheduler

Backend

Flint

Executor

Input Partition

S3

SQS

LambdaFlint

Executor

Output Partition

Output Partition

Flint

Executor

Final Stage

Intermediate Stage

Flint

Executor

Input Partition

Flint

Executor

Input Partition

Amazon Web Services

Data Movement

Control Flow

PySpark execution backendFlint

Youngbin Kim and Jimmy Lin. Serverless Data Analytics with Flint. IEEE Cloud 2018.

Page 33: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

The datacenter is the computer!

Scale “out”, not “up”Limits of SMP and large shared-memory machines

Move processing to the dataCluster have limited bandwidth, code is a lot smaller

Process data sequentially, avoid random accessSeeks are expensive, disk throughput is good

“Big ideas”

Assume that components will breakEngineer software around hardware failures

*

*

*

Page 34: Data-Intensive Distributed Computing - Roegiest · Serverless Architectures Doesn’t mean you don’t have servers Just that managing them is the cloud provider’s problem Write

Source: flickr (https://www.flickr.com/photos/39414578@N03/16042029002)