low latency execution for apache spark

Low latency execution for apache spark

Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout

Who am I ?

PhD candidate, AMPLab UC Berkeley Dissertation: System design for large scale machine learning Apache Spark PMC Member. Contributions to Spark core, MLlib, SparkR

Low latency: SPARK STREAMING

Low latency: execution

Scheduler

Few milliseconds

Large Clusters

Low latency execution engine for iterative workloads

Machine learning Stream processing

State

This talk

Execution Models

Centralized task scheduling

Lineage, Parallel Recovery

Microsoft Dryad

Execution models: batch processing

Task

Control Message Driver

Driver: Coordinate shuffles, data sharing

SHUFFLE

Network Transfer

Iteration

B ROADCAST

Execution models: parallel operators

Long-lived operators

Checkpointing based fault tolerance

Naiad

SHUFFLE

SHUFFLE

Iteration

Task

Control Message Driver

Network Transfer

Low Latency

SCALING behavior

0 50

100 150 200 250

4 8 16 32 64 128

Tim

e (m

s)

Machines

Network Wait Compute Task Fetch Scheduler Delay

Cluster: 4 core, r3.xlarge machines Workload: Sum of 10k numbers per-core

Median-task time breakdown

Can we achieve low latency with Apache Spark ?

DESIGN INSIGHT Fine-grained execution with coarse-grained scheduling

DRIZZLE

SHUFFLE

B ROADCAST

Iteration

Batch Scheduling Pre-Scheduling Shuffles Distributed Shared Variables

DAG scheduling

Assign tasks to hosts using (a) locality preferences (b) straggler mitigation (c) fair sharing etc.

Tasks Host1

Host2

Driver

Host1

Host2

Serialize & Launch

Host Metadata

Scheduler

Same DAG structure for many iterations Can reuse scheduling decisions

Batch scheduling

Schedule a batch of iterations at once

Fault tolerance, scheduling at batch boundaries

1 stage in each iteration

b = 2

How much does this help ?

1

10

100

1000

4 8 16 32 64 128

Tim

e / I

ter (

ms)

Machines

Apache Spark Drizzle-10 Drizzle-50 Drizzle-100

Workload: Sum of 10k numbers per-core

Single Stage Job, 100 iterations – Varying Drizzle batch size

DRIZZLE

SHUFFLE

B ROADCAST

Iteration


coordinating shuffles: APACHE SPARK

Task

Control Message

Data Message

Driver

Intermediate Data

Driver sends metadata Tasks pull data

coordinating shuffles: PRE-SCHEDULING

Pre-schedule down-stream tasks on executors

Trigger tasks once dependencies are met

Task

Control Message Data Message

Driver

Intermediate Data

Pre-scheduled task

Push-metadata, pull-data

Coalesce metadata across cores Transfer during downstream stage

Data Message Intermediate Data Metadata Message

Micro-benchmark: 2-stages

0 50

100 150 200 250 300 350

4 8 16 32 64 128

Tim

e / I

ter (

ms)

Machines

Apache Spark Drizzle-Pull

Workload: Sum of 10k numbers per-core

100 iterations

DRIZZLE

SHUFFLE

B ROADCAST

Iteration


Shared variables

Fine-grained updates to shared data Distributed shared variables

- MPI using MPI_AllReduce - Bloom, CRDTs - Parameter servers, key-value stores - reduce followed by broadcast

Drizzle: Replicated shared variables

Custom merge strategies 1

Commutative updates e.g. vector addition

Task Shared Variable

MERGE

1

1

1

1

2

2

2

2

Enabling async updates

Synchronous

Asynchronous semantics within a batch Synchronous semantics enforced at batch boundaries

Task Shared Variable

1 2

Async Stale versions 1 1 2

3

Merge

Evaluation

Micro-benchmarks -  Single stage -  Multiple stages -  Shared variables

End-to-end experiments -  Streaming benchmarks -  Logistic Regression

Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib

USING DRIZZLE API

DAGScheduler.scala defrunJob( rdd:RDD[T], func:Iterator[T]=>U)

Internal defrunBatchJob( rdds:Seq[RDD[T]], funcs:Seq[Iterator[T]=>U])

Using drizzle

StreamingContext.scala defthis( sc:SparkContext, batchDuration:Duration)

Spark Streaming defthis( sc:SparkContext, batchDuration:Duration, jobsPerBatch:Int)

b=1 à Batch processing

Choosing batch size

Higher overhead Smaller window for fault tolerance

In practice: Largest batch such that overhead is below fixed threshold e.g., For 128 machines, batch of few seconds is enough

b=N à Parallel operators

Lower overhead Larger window for fault tolerance

Evaluation

Micro-benchmarks -  Single stage -  Multiple stages -  Shared variables

End-to-end experiments -  Streaming benchmarks -  Logistic Regression

Implemented on Apache Spark 1.6.1 Integrations with Spark Streaming, MLlib

Streaming BENCHMARK

0

25

50

75

100

0 300 600 900 1200 1500

CDF

Event Latency (ms)

Drizzle, 90ms Spark Streaming 560ms

Yahoo Streaming Benchmark: 1M JSON Ad-events / second, 64 machines

Event Latency: Difference between window end, processing end

Mllib: LOGISTIC REGRESSION

0 100 200 300 400 500 600

4 8 16 32 64 128

Tim

e / I

ter (

ms)

Machines

Apache Spark Drizzle

Sparse Logistic Regression on RCV1 677K examples, 47K features

Work In progress

Automatic batch size tuning Open source release Apache Spark JIRA to discuss potential contribution

conclusion

Overheads when using Apache Spark for streaming, ML workloads Drizzle: Low Latency Execution Engine

- Decouple execution from centralized scheduling - Milliseconds latency for iterative workloads

Workloads / Questions / Contributions

Shivaram Venkataraman [email protected]

low latency execution for apache spark

Data & Analytics