low latency execution for apache spark
TRANSCRIPT
Low latency execution for apache spark
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout
Who am I ?
PhD candidate, AMPLab UC Berkeley Dissertation: System design for large scale machine learning Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: SPARK STREAMING
Low latency: execution
Scheduler
Few milliseconds
Large Clusters
Low latency execution engine for iterative workloads
Machine learning Stream processing
State
This talk
Execution Models
Centralized task scheduling
Lineage, Parallel Recovery
Microsoft Dryad
Execution models: batch processing
Task
Control Message Driver
Driver: Coordinate shuffles, data sharing
SHUFFLE
Network Transfer
Iteration
B ROADCAST
Execution models: parallel operators
Long-lived operators
Checkpointing based fault tolerance
Naiad
SHUFFLE
SHUFFLE
Iteration
Task
Control Message Driver
Network Transfer
Low Latency
SCALING behavior
0 50
100 150 200 250
4 8 16 32 64 128
Tim
e (m
s)
Machines
Network Wait Compute Task Fetch Scheduler Delay
Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core
Median-task time breakdown
Can we achieve low latency with Apache Spark ?
DESIGN INSIGHT Fine-grained execution with coarse-grained scheduling
DRIZZLE
SHUFFLE
B ROADCAST
Iteration
Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
DAG scheduling
Assign tasks to hosts using (a) locality preferences (b) straggler mitigation (c) fair sharing etc.
Tasks Host1
Host2
Driver
Host1
Host2
Serialize & Launch
Host Metadata
Scheduler
Same DAG structure for many iterations Can reuse scheduling decisions
Batch scheduling
Schedule a batch of iterations at once
Fault tolerance, scheduling at batch boundaries
1 stage in each iteration
b = 2
How much does this help ?
1
10
100
1000
4 8 16 32 64 128
Tim
e / I
ter (
ms)
Machines
Apache Spark Drizzle-10 Drizzle-50 Drizzle-100
Workload: Sum of 10k numbers per-core
Single Stage Job, 100 iterations – Varying Drizzle batch size
DRIZZLE
SHUFFLE
B ROADCAST
Iteration
Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
coordinating shuffles: APACHE SPARK
Task
Control Message
Data Message
Driver
Intermediate Data
Driver sends metadata Tasks pull data
coordinating shuffles: PRE-SCHEDULING
Pre-schedule down-stream tasks on executors
Trigger tasks once dependencies are met
Task
Control Message Data Message
Driver
Intermediate Data
Pre-scheduled task
Push-metadata, pull-data
Coalesce metadata across cores Transfer during downstream stage
Data Message Intermediate Data Metadata Message
Micro-benchmark: 2-stages
0 50
100 150 200 250 300 350
4 8 16 32 64 128
Tim
e / I
ter (
ms)
Machines
Apache Spark Drizzle-Pull
Workload: Sum of 10k numbers per-core
100 iterations
DRIZZLE
SHUFFLE
B ROADCAST
Iteration
Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables
Shared variables
Fine-grained updates to shared data Distributed shared variables
- MPI using MPI_AllReduce - Bloom, CRDTs - Parameter servers, key-value stores - reduce followed by broadcast
Drizzle: Replicated shared variables
Custom merge strategies 1
Commutative updates e.g. vector addition
Task Shared Variable
MERGE
1
1
1
1
2
2
2
2
Enabling async updates
Synchronous
Asynchronous semantics within a batch Synchronous semantics enforced at batch boundaries
Task Shared Variable
1 2
Async Stale versions 1 1 2
3
Merge
Evaluation
Micro-benchmarks - Single stage - Multiple stages - Shared variables
End-to-end experiments - Streaming benchmarks - Logistic Regression
Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib
USING DRIZZLE API
DAGScheduler.scala defrunJob( rdd:RDD[T], func:Iterator[T]=>U)
Internal defrunBatchJob( rdds:Seq[RDD[T]], funcs:Seq[Iterator[T]=>U])
Using drizzle
StreamingContext.scala defthis( sc:SparkContext, batchDuration:Duration)
Spark Streaming defthis( sc:SparkContext, batchDuration:Duration, jobsPerBatch:Int)
b=1 à Batch processing
Choosing batch size
Higher overhead Smaller window for fault tolerance
In practice: Largest batch such that overhead is below fixed threshold e.g., For 128 machines, batch of few seconds is enough
b=N à Parallel operators
Lower overhead Larger window for fault tolerance
Evaluation
Micro-benchmarks - Single stage - Multiple stages - Shared variables
End-to-end experiments - Streaming benchmarks - Logistic Regression
Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib
Streaming BENCHMARK
0
25
50
75
100
0 300 600 900 1200 1500
CDF
Event Latency (ms)
Drizzle, 90ms Spark Streaming 560ms
Yahoo Streaming Benchmark: 1M JSON Ad-events / second, 64 machines
Event Latency: Difference between window end, processing end
Mllib: LOGISTIC REGRESSION
0 100 200 300 400 500 600
4 8 16 32 64 128
Tim
e / I
ter (
ms)
Machines
Apache Spark Drizzle
Sparse Logistic Regression on RCV1 677K examples, 47K features
Work In progress
Automatic batch size tuning Open source release Apache Spark JIRA to discuss potential contribution
conclusion
Overheads when using Apache Spark for streaming, ML workloads Drizzle: Low Latency Execution Engine
- Decouple execution from centralized scheduling - Milliseconds latency for iterative workloads
Workloads / Questions / Contributions
Shivaram Venkataraman [email protected]