solving the scalability dilemma with clouds, crowds, and algorithms michael franklin uc berkeley

46
UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter Bodik, Kristal Curtis, Armando Fox, Randy Katz, Mike Jordan, Nick Lanham, David Patterson, Scott Shenker, Ion Stoica, Beth Trushkowsky, Stephen Tu and Matei Zaharia Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

Upload: xena

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

UC Berkeley

1

Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms

Michael FranklinUC Berkeley

Joint work with: Michael Armbrust, Peter Bodik, Kristal Curtis, Armando Fox, Randy Katz, Mike Jordan, Nick Lanham, David Patterson, Scott Shenker,

Ion Stoica, Beth Trushkowsky, Stephen Tu and Matei Zaharia

Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

Page 2: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Save the Date(s): CIDR 2011 Conference

2

• Abstracts Due: Sept 24, 2010• Papers Due: October 1, 2010• Focus: innovative and visionary approaches to data

systems architecture and use. • Regular CIDR track plus CCC-sponsored

“outrageous ideas” track.• Website coming soon!

5th Biennial Conference on Innovative Data Systems Research

CIDR 2011 Jan 9-12 Asilomar, CA

Page 3: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Continuous Improvement of Client Devices

Page 4: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

4

Computing as a Commodity

Page 5: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

5

Ubiquitous Connectivity

Page 6: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

6

AMP: Algorithms, Machines, People

Adaptive/Active Machine

Learning and Analytics

Cloud ComputingCrowdSourcing

Massive and

DiverseData

Page 7: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

7

The Scalability Dilemma

• State-of-the Art Machine Learning techniques do not scale to large data sets.

• Data Analytics frameworks can’t handle lots of incomplete, heterogeneous, dirty data.

• Processing architectures struggle with increasing diversity of programming models and job types.

• Adding people to a late project makes it later.

Exactly Opposite of what we Expect and Need

Page 8: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

RAD Lab 5-year MissionEnable 1 person to develop, deploy, operate a

next-generation Internet application at scaleInitial Technical Bet:

• Machine Learning to make large-scale systems self-managingMulti-area faculty, postdocs, & students

• Systems, Networks, Databases, Security, Statistical Machine Learning all in a single, open, collaborative space

Corporate Sponsorship and intensive industry interaction– Bi-annual 2.5 day offsite research retreats with sponsors

8

Page 9: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

PIQL + SCADS

9

SCADS:Distributed Key

Value Store

PIQL: QueryInterface &Executor

Flexible ConsistencyManagement

“Active PIQL”(don’t ask)

Page 10: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

UC Berkeley

SCADS: Scale Independent Storage

10

Page 11: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Scale Independence

• As a site’s user base grows and workload volatility increases:– No changes to application required– Cost per user remains constant– Request latency SLA is unchanged

• Key techniques– Model-Driven Scale Up and Scale Down– Performance Insightful Query Language– Declarative Performance/Consistency

Tradeoffs11

Page 12: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

12

Over-provisioning a stateless systemWikipedia example

overprovision by 25%to handle spike

Michael Jackson dies

Page 13: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

13

Over-provisioning a stateful systemWikipedia example

overprovision by 300%to handle spike

(assuming data stored on ten servers)

Page 14: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Data storage configuration• Shared-nothing storage cluster

– (key,value) pairs in a namespace, e.g. (user,email)– Each node stores set of data ranges, – Data ranges can be split until some minimum size

promised by PIQL, to ensure range queries don’t touch more than one node

14

A-C A-CFD-E D-E

D-EF-G G

Page 15: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

15

Workload-based policy stages

Storage nodes

Stage 1: Replicate

Wor

kloa

d threshold

Bin

s

Page 16: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

16Storage nodes

Stage 2: Data Movement

Wor

kloa

d threshold

Bin

s

destination

Workload-based policy stages

Page 17: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

17Storage nodes

Stage 3: Server Allocation

Wor

kloa

d threshold

Bin

s

Workload-based policy stages

Page 18: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

18

Workload-based policyPolicy input:

– Workload per histogram bin – Cluster configuration

Policy output:– Short actions (per bin)

SCADSnamespace

Policyactions

ActionExecutor

Performance Model

Workload Histogram

actionssampled workload as histogram

smoothed workload

config

Considerations:– Performance model– Overprovision buffer

Action Executor– Limit actions to X kb/s

Page 19: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

19

Example ExperimentWorkload

• Ebates.com + wikipedia’s MJ spike (see Bodik et al. SOCC 2010 for workload generation)• One million (key,value) pairs, each ~256 bytes

Model: max sustainable workload per server

Cost:• machine cost: 1 unit/10 minutes• SLA: 99th percentile of get/put latency

Deployment• using m1.small instances on EC2, 1GB of RAM• server boot up time: 48 seconds• Delay server removal until 2 minutes left

Page 20: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

20

Goal: selectively absorb hotspot

thou

sand

req

/ sec

Page 21: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

21

Actions during the spike

10:00 10:14

Add replica

Move data, partition

Move data, coalesce

data movement and actions during the spike

Kb/

s

Page 22: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

22

Configuration at end of Spike

Per server workload and # keys after added replicas

Page 23: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

23

Cost-comparison to fixed and optimal

• Fixed allocation policy: 648 server units• Optimal policy: 310 server units

Overprovision factor

get/put SLA (ms)

# server units

% savings (vs fixed alloc)

0.5 180/250 358 48

0.6 140/225 389 40

0.7 120/200 422 35

Page 24: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

PIQL [Armbrust et al. SIGMOD 2010 (demo) and SOCC 2010 (design paper)]

• “Performance Insightful” language subset• Compiler reasons about operation bounds

– Unbounded queries are disallowed– Queries above specified threshold generate a warning– Predeclare query templates: Optimizer decides what

indexes are needed (i.e., materialized views)• Provides: Bounded number of operations• + Strong SLAs = Predictable performance?24

RDBMS NoSQL

Page 25: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

PIQL DDL

25

ENTITY User {  string username, string password,  PRIMARY KEY(username)}

ENTITY Subscription { boolean approved,  string owner, string target, FOREIGN KEY owner REF User, FOREIGN KEY target REF User

MAX 5000, PRIMARY KEY(owner, target)}

ENTITY Thought {  int timestamp, string owner,  string text,  FOREIGN KEY owner REFERENCES User  PRIMARY KEY(owner,

timestamp)}

F.K.s are Required forJoins

Cardinality Limits requiredfor un-paginated Joins

Page 26: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

More Queries

26

“Return the most recent thoughts from allof my “approved” subscriptions.”

Operations are bounded via schema and limit max

Page 27: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

PIQL:Help Fix “Bad” Queries

• Interactive Query Visualizer– Shows record counts

and # ops– Highlights unbounded

parts of query– SIGMOD’10 Demo:

piql.knowsql.org

RDBMS NoSQL

Page 28: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

PIQL + SCADS

• Goals are “Scale Independence” and “Performance Insightfulness”

• SCADS provides scalable foundation with SLA adherence

• PIQL uses language restrictions, schema limits, and precomputed views to bound # of SCADS operations per query.

• These work together to bridge the gap between “SQL” and “NoSQL” worlds.

28

Page 29: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

UC Berkeley

Spark: Support for Iterative Data-Intensive Computing

M. Zaharia et al. HotClouds Workshop 2010

29

Page 30: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Analytics: Logistic Regression

Goal: find best line separating 2 datasets

+

+ ++

+

+

++ +

– ––

–– –

+

target

random initial line

Page 31: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Serial Version

val data = readData(...)

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient}

println("Final w: " + w)

Page 32: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Spark Version

val data = spark.hdfsTextFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value}

println("Final w: " + w)

Page 33: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Spark Version

val data = spark.hdfsTextFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value}

println("Final w: " + w)

Page 34: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Spark Version

val data = spark.hdfsTextFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value}

println("Final w: " + w)

Page 35: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Iterative Processing Dataflow

Hadoop / Dryad Spark. . .

w

f(x,w) w

f(x,w)x

xx

w

f(x,w)

Page 36: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Performance

40s / iteration

first iteration 60sfurther iterations 2s

Page 37: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

UC Berkeley

What about the People?

38

Page 38: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

39

Participatory Culture – “Indirect”John Murrell: GM SV 9/17/09…every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Page 39: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

40

Participatory Culture - Direct

Page 40: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Crowdsourcing Example

41

From: Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010.

Page 41: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Mechanical Turk vs. Cluster Computing

• What challenges are similar?• What challenges are new?• Allocation, Cost, Reliability, Quality, Bias,

Making jobs appealing, ….

Page 42: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

43

AMP: Algorithms, Machines, People

Adaptive/Active Machine

Learning and Analytics

Cloud ComputingCrowdSourcing

Massive and

DiverseData

Page 43: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Clouds and CrowdsInteractive Cloud Analytic Cloud People Cloud

Data Acquisition

Transactional systems

Data entry

… + Sensors(physical & software)

… + Web 2.0

Computation Get and Put Map ReduceParallel DBMS

Stream Processing

… + Collaborative Structures (e.g., Mechanical Turk,

Intelligence Markets)

Data Model Records Numbers, Media … + Text, Media, Natural Language

Response Time

Seconds Hours/Days … +Continuous

44

The Future Cloud will be a Hybrid of These.

Page 44: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

AMPLab Technical Plan• Machine Learning & Analytics (Jordan, Fox, Franklin)

– Error Bars on all Answers– Active learning, continuous/adaptive improvement

• Data Management (Franklin, Joseph)– Pay-as-you-go integration and structure– Privacy

• Infrastructure (Stoica, Shenker, Patterson, Katz)– Nexus cloud OS and analytics languages

• Hybrid Crowd/Cloud Systems (Bayen, Waddell) – Incentive structures, systems aspects

45

Page 45: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Guiding Use Cases• Crowdsourced

Sensing, Work, Policy, Journalism

• Urban Micro-Simulation

46

Page 46: Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Alogorithms, Machines & People

• A holistic view of the entire stack.

• Highly interdisciplinary faculty & students

• Developing a five-year plan; will dovetail with RADLab completion

For more information: [email protected]

47

Enable many people to collaborate to collect, generate, clean, make sense of and utilize lots of data.

Data Visualization, Collaboration, HCI, PoliciesText analyticsMachine Learning and StatsDatabase, OLAP, MapReduceSecurity and PrivacyMPP,Data Centers, NetworksMulti-Core Parallelism