solving the scalability dilemma with clouds, crowds, and algorithms michael franklin uc berkeley

Download Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC  Berkeley

Post on 23-Feb-2016




0 download

Embed Size (px)


Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley - PowerPoint PPT Presentation




Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms

Michael FranklinUC Berkeley

Joint work with: Michael Armbrust, Peter Bodik, Kristal Curtis, Armando Fox, Randy Katz, Mike Jordan, Nick Lanham, David Patterson, Scott Shenker, Ion Stoica, Beth Trushkowsky, Stephen Tu and Matei Zaharia

Image: John Curley Berkeley

Save the Date(s): CIDR 2011 Conference2 Abstracts Due: Sept 24, 2010Papers Due: October 1, 2010Focus: innovative and visionary approaches to data systems architecture and use. Regular CIDR track plus CCC-sponsored outrageous ideas track.Website coming soon!

5th Biennial Conference on Innovative Data Systems Research CIDR 2011 Jan 9-12 Asilomar, CA

Continuous Improvement of Client Devices

4Computing as a Commodity

44200KW racked HW, 7TB RAM, 1.5TB disk, 250 servers

In tents Computing5 Ubiquitous Connectivity

AMP: Algorithms, Machines, People6Massive and DiverseData

6The Scalability Dilemma7State-of-the Art Machine Learning techniques do not scale to large data sets.Data Analytics frameworks cant handle lots of incomplete, heterogeneous, dirty data.Processing architectures struggle with increasing diversity of programming models and job types.Adding people to a late project makes it later.

Exactly Opposite of what we Expect and Need

RAD Lab 5-year MissionEnable 1 person to develop, deploy, operate a next-generation Internet application at scaleInitial Technical Bet: Machine Learning to make large-scale systems self-managingMulti-area faculty, postdocs, & studentsSystems, Networks, Databases, Security, Statistical Machine Learning all in a single, open, collaborative spaceCorporate Sponsorship and intensive industry interactionBi-annual 2.5 day offsite research retreats with sponsors8

81/14/11AMPLab Overview - franklin@cs.berkeley.eduPIQL + SCADS9

SCADS:Distributed KeyValue StorePIQL: QueryInterface &ExecutorFlexible ConsistencyManagementActive PIQL(dont ask)

SCADS: Scale Independent Storage10UC Berkeley

Scale IndependenceAs a sites user base grows and workload volatility increases:No changes to application requiredCost per user remains constantRequest latency SLA is unchangedKey techniquesModel-Driven Scale Up and Scale DownPerformance Insightful Query LanguageDeclarative Performance/Consistency Tradeoffs11

Over-provisioning a stateless systemWikipedia example12

overprovision by 25%to handle spikeMichael Jackson dies

What are new challenges if want to apply scale up/down to a *stateful* system instead?12Over-provisioning a stateful systemWikipedia example13

overprovision by 300%to handle spike(assuming data stored on ten servers)

Data storage configurationShared-nothing storage cluster(key,value) pairs in a namespace, e.g. (user,email)Each node stores set of data ranges, Data ranges can be split until some minimum size promised by PIQL, to ensure range queries dont touch more than one node14A-CA-CFD-ED-ED-EF-GG

Directing the get/put requests of the underlying key-value storeremind that PIQL promises certain range query boundaries, so Director's job is to respect those boundaries so range queries dont have to touch >1 node in those cases14Workload-based policy stages15Storage nodesStage 1: ReplicateWorkloadthresholdBins

Y-ax-s=workloadBlack box=standbys1516Storage nodesStage 2: Data MovementWorkloadthresholdBinsdestinationWorkload-based policy stages

Y-ax-s=workloadBlack box=standbys1617Storage nodesStage 3: Server AllocationWorkloadthresholdBinsWorkload-based policy stages

Y-ax-s=workloadBlack box=standbys17Workload-based policyPolicy input:Workload per histogram bin Cluster configurationPolicy output:Short actions (per bin)18SCADSnamespacePolicyactionsActionExecutorPerformance ModelWorkload Histogramactionssampled workload as histogramsmoothed workloadconfigConsiderations:Performance modelOverprovision buffer

Action ExecutorLimit actions to X kb/s

Sample workload using real-time reporting feature of ChukwaBin smallest unit of data movementSmooth observed workload with hysteresisHysteresis up and hysteresis down -- Learned reasonable values with policy simulator

Data movement affects performanceRestrict how fast data is copied

18Example + wikipedias MJ spike (see Bodik et al. SOCC 2010 for workload generation)One million (key,value) pairs, each ~256 bytesModel: max sustainable workload per server

Cost:machine cost: 1 unit/10 minutesSLA: 99th percentile of get/put latency

Deploymentusing m1.small instances on EC2, 1GB of RAMserver boot up time: 48 secondsDelay server removal until 2 minutes left19

Goal: selectively absorb hotspot20thousand req / sec


Actions during the spike

2110:0010:14Add replicaMove data, partitionMove data, coalescedata movement and actions during the spikeKb/s

21Configuration at end of Spike22

Per server workload and # keys after added replicas

22Cost-comparison to fixed and optimalFixed allocation policy: 648 server unitsOptimal policy: 310 server units

23Overprovision factorget/put SLA (ms)# server units% savings (vs fixed alloc)0.5180/250358480.6140/225389400.7120/20042235

PIQL [Armbrust et al. SIGMOD 2010 (demo) and SOCC 2010 (design paper)]Performance Insightful language subsetCompiler reasons about operation boundsUnbounded queries are disallowedQueries above specified threshold generate a warningPredeclare query templates: Optimizer decides what indexes are needed (i.e., materialized views)Provides: Bounded number of operations+ Strong SLAs = Predictable performance?24RDBMSNoSQL

PIQL DDL 25ENTITY User { string username, string password, PRIMARY KEY(username)}

ENTITY Subscription {boolean approved, string owner, string target, FOREIGN KEY owner REF User,FOREIGN KEY target REF UserMAX 5000,PRIMARY KEY(owner, target)}ENTITY Thought { int timestamp, string owner, string text, FOREIGN KEY owner REFERENCES User PRIMARY KEY(owner, timestamp)}F.K.s are Required forJoinsCardinality Limits requiredfor un-paginated Joins

More Queries

26Return the most recent thoughts from allof my approved subscriptions.

Operations are bounded via schema and limit max

PIQL:Help Fix Bad QueriesInteractive Query VisualizerShows record counts and # opsHighlights unbounded parts of querySIGMOD10 Demo: piql.knowsql.orgRDBMSNoSQL

Highlight unbounded edge, show the 27PIQL + SCADSGoals are Scale Independence and Performance InsightfulnessSCADS provides scalable foundation with SLA adherencePIQL uses language restrictions, schema limits, and precomputed views to bound # of SCADS operations per query.These work together to bridge the gap between SQL and NoSQL worlds.28

Spark: Support for Iterative Data-Intensive Computing

M. Zaharia et al. HotClouds Workshop 2010

29UC Berkeley

Analytics: Logistic RegressionGoal: find best line separating 2 datasets++++++++++targetrandom initial line

Note that dataset is reused on each gradient computation30Serial Versionval data = readData(...)

var w = Vector.random(D)

for (i