what is spark? generalcs162/fa17/static/lectures/25.pdf · spark: unified engine across spark sql...

CS162�Operating Systems and�Systems Programming�

Lecture 25 �Apache Spark – Berkeley Data Analytics Stack (BDAS)

November 27th, 2017Prof. Ion Stoica

http://cs162.eecs.Berkeley.edu

Lec 24.211/27/17 CS162 © UCB Fall 2017

A Short History

•  Started at UC Berkeley in 2009

•  Open Source: 2010

•  Apache Project: 2013

•  Today: most popular big data processing engine

Lec 24.311/27/17 CS162 © UCB Fall 2017

What Is Spark?

•  Parallel execution engine for big data processing

•  Easy to use: 2-5x less code than Hadoop MR– High level API’s in Python, Java, and Scala

•  Fast: up to 100x faster than Hadoop MR– Can exploit in-memory when available–  Low overhead scheduling, optimized engine

•  General: support multiple computation models

Lec 24.411/27/17 CS162 © UCB Fall 2017

General

•  Unifies batch, interactive computations

Spark Core

Spark SQL

Lec 24.511/27/17 CS162 © UCB Fall 2017

General

•  Unifies batch, interactive, streaming computations

Spark Core

Spark SQL Spark Streaming

Lec 24.611/27/17 CS162 © UCB Fall 2017

General

•  Unifies batch, interactive, streaming computations•  Easy to build sophisticated applications

–  Support iterative, graph-parallel algorithms–  Powerful APIs in Scala, Python, Java

Spark Core

Spark Streaming Spark SQL MLlib GraphX

Lec 24.711/27/17 CS162 © UCB Fall 2017

Easy to Write Code

WordCount in 50+ lines of Java MR

WordCount in 3 lines of Spark

Lec 24.811/27/17 CS162 © UCB Fall 2017

Large-Scale Usage

Largest cluster: 8000 nodes

Largest single job: 1 petabyte

Top streaming intake: 1 TB/hour

2014 on-disk sort record

Lec 24.911/27/17 CS162 © UCB Fall 2017

Fast: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Lec 24.1011/27/17 CS162 © UCB Fall 2017

RDD: Core Abstraction

•  Resilient Distributed Datasets (RDDs)– Collections of objects distr. across a

cluster, stored in RAM or on Disk–  Built through parallel transformations– Automatically rebuilt on failure

Operations•  Transformations�

(e.g. map, filter, groupBy)•  Actions�

(e.g. count, collect, save)

Write programs in terms of distributed datasets and operations on them

Lec 24.1111/27/17 CS162 © UCB Fall 2017

Operations on RDDs

•  Transformations f(RDD) => RDD§  Lazy (not computed immediately)§  E.g. “map”

•  Actions:§  Triggers computation§  E.g. “count”, “saveAsTextFile”

Lec 24.1211/27/17 CS162 © UCB Fall 2017

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”) !

Lec 24.1311/27/17 CS162 © UCB Fall 2017

Working With RDDs

RDDRDDRDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !


Lec 24.1411/27/17 CS162 © UCB Fall 2017

Working With RDDs

RDDRDDRDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line) !

linesWithSpark.count()!74!!linesWithSpark.first()!# Apache Spark!


Lec 24.1511/27/17 CS162 © UCB Fall 2017

Load error messages from a log into memory, then interactively search for various patterns

Example: Log Mining

Lec 24.1611/27/17 CS162 © UCB Fall 2017


Worker

Worker

Worker

Driver

Example: Log Mining

Lec 24.1711/27/17 CS162 © UCB Fall 2017


Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log Mining

Lec 24.1811/27/17 CS162 © UCB Fall 2017


Worker

Worker

Worker

Driver


Base RDD

Example: Log Mining

Lec 24.1911/27/17 CS162 © UCB Fall 2017



errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log Mining

Lec 24.2011/27/17 CS162 © UCB Fall 2017




Worker

Worker

Worker

Driver

Transformed RDD

Example: Log Mining

Lec 24.2111/27/17 CS162 © UCB Fall 2017




messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining

Lec 24.2211/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log Mining

Lec 24.2311/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Example: Log Mining

Lec 24.2411/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

Example: Log Mining

Lec 24.2511/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Driver

ReadHDFS Block

ReadHDFS Block

Read HDFS Block

Example: Log Mining

Lec 24.2611/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data

Example: Log Mining

Lec 24.2711/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Example: Log Mining

Lec 24.2811/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

Example: Log Mining

Lec 24.2911/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2


tasks

tasks

tasks

Driver

Example: Log Mining

Lec 24.3011/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2


Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache

Example: Log Mining

Lec 24.3111/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2


Driver results

results

results

Example: Log Mining

Lec 24.3211/27/17 CS162 © UCB Fall 2017





messages.cache()

Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2


Driver

Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from mem vs. 20s for on-disk

Example: Log Mining

Lec 24.3311/27/17 CS162 © UCB Fall 2017

Language Support

Standalone Programs• Python, Scala, & Java

Interactive Shells• Python & Scala

Performance• Java & Scala are faster due to

static typing• …but Python is often fine

Python

lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala

val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Lec 24.3411/27/17 CS162 © UCB Fall 2017

Expressive API

•  map reduce

Lec 24.3511/27/17 CS162 © UCB Fall 2017

Expressive API

•  map •  filter •  groupBy •  sort •  union •  join •  leftOuterJoin •  rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Lec 24.3611/27/17 CS162 © UCB Fall 2017

Fault Recovery

RDDs track lineage information that can be used to efficiently reconstruct lost partitions

Lec 24.3711/27/17 CS162 © UCB Fall 2017

Fault Recovery Example

•  Two-partition RDD A={A1, A2} stored on disk1)  filter and cache à RDD B2)  joinà RDD C3)  aggregate à RDD D

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Lec 24.3811/27/17 CS162 © UCB Fall 2017


•  C1 lost due to node failure before reduce finishes

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Lec 24.3911/27/17 CS162 © UCB Fall 2017


•  C1 lost due to node failure before reduce finishes•  Reconstruct C1, eventually, on different node

A1

A2

RDD

A

agg. D

filter

filter B2

B1

join C2

agg. D

join C1

Lec 24.4011/27/17 CS162 © UCB Fall 2017

•  Many important apps must process large data streams at second-scale latencies

–  Site statistics, intrusion detection, online ML

•  To build and scale these apps users want:–  Integration: with offline analytical stack–  Fault-tolerance: both for crashes and stragglers

Spark Streaming: Motivation

Lec 24.4111/27/17 CS162 © UCB Fall 2017

How does it work?

•  Data streams are chopped into batches – A batch is an RDD holding a few 100s ms worth of data

•  Each batch is processed in Spark

data streams

rece

iver

s

batches

Lec 24.4211/27/17 CS162 © UCB Fall 2017

Streaming

How does it work?

•  Data streams are chopped into batches – A batch is an RDD holding a few 100s ms worth of data

•  Each batch is processed in Spark•  Results pushed out in batches

data streams

rece

iver

s

batches results

Lec 24.4311/27/17 CS162 © UCB Fall 2017

Streaming Word Count

val lines = context.socketTextStream(“localhost”, 9999)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start() print some counts on screen

count the words

split lines into words

create DStream from data over socket

start processing the stream

Lec 24.4411/27/17 CS162 © UCB Fall 2017

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Lec 24.4511/27/17 CS162 © UCB Fall 2017

Word Count

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Spark Streaming

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Storm

public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Lec 24.4611/27/17 CS162 © UCB Fall 2017

119

57 56 58 5881

57 59 57 59

020406080100120140

1 2 3 4 5 6 7 8 9 10

Iteratrion

time(s)

Iteration

Failurehappens


Lec 24.4711/27/17 CS162 © UCB Fall 2017

Administrivia (1/2)

• Midterm 3 coming up on Wen 11/29 6:30-8PM– All topics up to and including Lecture 24

» Focus will be on Lectures 17 – 24 and associated readings, and Projects 3

» But expect 20-30% questions from materials from Lectures 1-16

– Closed book– 2 sides hand-written notes both sides

Lec 24.4811/27/17 CS162 © UCB Fall 2017

Administrivia (2/2)

•  Exam location: Assigned based on Cal student ID number: –  ends in 0, 1, 2, 3, 4: Li Ka Shing 245–  ends in 5, 6, 7: GPB 100–  ends in 8, 9: Kroeber 160

• Rest of today– 7:45pm: HKN class survey– 8-9:30pm Review (VLSB 2050)

Lec 24.4911/27/17 CS162 © UCB Fall 2017

BREAK

Lec 24.5011/27/17 CS162 © UCB Fall 2017

From RDDs to DataFrames

Spark early adopters

Data Engineers Data Scientists Statisticians R users PyData …

Users

Understands MapReduce

& functional APIs

Lec 24.5111/27/17 CS162 © UCB Fall 2017 Lec 24.5211/27/17 CS162 © UCB Fall 2017

DataFrames in Spark

Distributed collection of data grouped into named columns �(i.e. RDD with schema)Domain-specific functions designed for common tasks

– Metadata–  Sampling–  Project, filter, aggregation, join, …– UDFs

Available in Python, Scala, Java, and R

Lec 24.5311/27/17 CS162 © UCB Fall 2017

Spark DataFrame

•  >head(filter(df,df$waiting<50))#anexampleinR•  ##eruptionswaiting•  ##11.75047•  ##21.75047•  ##31.86748

Similar APIs as single-node tools like Pandas, R, i.e., easy to learn

*Old Faithful geyser data from http://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat

Lec 24.5411/27/17 CS162 © UCB Fall 2017

Spark RDD Execution

Java/Scala frontend

JVM backend

Python frontend

Python backend

opaque closures (user-defined functions)

Lec 24.5511/27/17 CS162 © UCB Fall 2017

Spark DataFrame Execution

DataFrame frontend

Logical Plan

Physical execution

Catalyst optimizer

Intermediate representation for computation

Lec 24.5611/27/17 CS162 © UCB Fall 2017

Spark DataFrame Execution

Python DF

Logical Plan

Physical execution

Catalyst optimizer

Java/Scala DF

R DF

Intermediate representation for computation

Simple wrappers to create logical plan

Lec 24.5711/27/17 CS162 © UCB Fall 2017

Performance

0 2 4 6 8 10

Java/Scala

Python

Runtime for an example aggregation workload (sec)

RDD

Lec 24.5811/27/17 CS162 © UCB Fall 2017

Benefit of Logical Plan: �Performance Parity Across Languages

0 2 4 6 8 10

Java/Scala

Python

Java/Scala

Python

R

SQL

Runtime for an example aggregation workload (sec)

DataFrame

RDD

Lec 24.5911/27/17 CS162 © UCB Fall 2017

Further Optimizations

•  Whole-stage code generation–  Remove expensive iterator calls–  Fuse across multiple operators

Spark 1.6 14M rows/s

Spark 2.0 125M

rows/s

Parquet in 1.6

11M rows/s

Parquet in 2.0

90M rows/s

Optimized input / output •  Parquet + built-in cache

Automatically applies to SQL, DataFrames, Datasets Lec 24.6011/27/17 CS162 © UCB Fall 2017

0

100

200

300

400

500

600

q1 q11

q13

q15

q17

q19

q2

q21

q25

q26

q28

q29 q3

q3

1 q3

4 q3

7 q3

8 q3

9a q3

9b

q4

q40

q42

q43

q46

q48

q50

q52

q55

q59 q6

1 q6

2 q6

6 q6

8 q7

q71

q73

q74

q75

Runt

ime

(sec

onds

)

TPC-DS Spark 2.0 vs 1.6 – Lower is Better

Time (1.6)

Time (2.0)

Lec 24.6111/27/17 CS162 © UCB Fall 2017

Processing Business logic change & new ops

(windows, sessions)

From Streaming to Structured Streaming

Output How do we define

output over time & correctness?

Data Late arrival, varying distribution over time, …

Lec 24.6211/27/17 CS162 © UCB Fall 2017

The simplest way to perform streaming analytics is not having to reason about streaming

Lec 24.6311/27/17 CS162 © UCB Fall 2017

Spark 2.0 Infinite DataFrames

Spark 1.3 Static DataFrames

Single API ! Lec 24.6411/27/17 CS162 © UCB Fall 2017

Structured Streaming

High-level streaming API built on DataFrames• Event time, windowing, sessions, sources & sinks

Also supports interactive & batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models

Not just streaming, but “continuous applications”

Lec 24.6511/27/17 CS162 © UCB Fall 2017

logs = ctx.read.format("json").open("s3://logs")

logs.groupBy(“userid”, “hour”).avg(“latency”)

.write.format("jdbc")

.save("jdbc:mysql//...")

Example: Batch Aggregation

Lec 24.6611/27/17 CS162 © UCB Fall 2017

logs = ctx.read.format("json").stream("s3://logs")

logs.groupBy(“userid”, “hour”).avg(“latency”)

.write.format("jdbc")

.startStream("jdbc:mysql//...")

Example: Continuous Aggregation

Lec 24.6711/27/17 CS162 © UCB Fall 2017

Goal: end-to-end continuous applications

Example

Reporting

Applications

ML Model

Ad-hoc Queries

Traditional streaming Other processing types

Kafka Database ETL

Lec 24.6811/27/17 CS162 © UCB Fall 2017

Apache Spark Today

•  > 1,500 contributors•  > 400K meetup members world wide•  > 300K students trained world wide

•  1,000s deployments in productions–  Virtually every large enterprise– Available in all clouds (e.g., AWS, Google Compute Engine, MS

Azure, IBM, …)– Distributed by IBM, Cloudera, Hortonworks, MapR, Oracle, …

•  Databricks, startup to commercialize Apache Spark

Lec 24.6911/27/17 CS162 © UCB Fall 2017

{JSON}

Data Sources

Spark Core

DataFrames ML Pipelines

Spark Streaming

Spark SQL MLlib GraphX Spark: unified engine across data sources, workloads and environments

Lec 24.7011/27/17 CS162 © UCB Fall 2017

Summary

•  Server à Datacenter•  OS à Datacenter OS (e.g., Apache Mesos)•  Applications à Big data / ML applications (e.g., Apache Spark)

•  AMPLab– Massive success in industry,…–  and, academia:

»  Faculty at MIT, Stanford, CMU, Cornell, etc»  Two ACM Dissertation Awards

•  New lab starting: RISELab

RISELab �(Real-time Intelligent Secure Execution)

Lec 24.7211/27/17 CS162 © UCB Fall 2017

From batch data to advanced analytics

AMPLab

From live data to real-time decisions

RISELab

Lec 24.7311/27/17 CS162 © UCB Fall 2017

RISE stack

scheduler object store RISE μkernel

Ray Clipper …

Ground (data context service)

Tegra

optimizer

Secure Analytics &

ML

Tim

e M

achi

ne

SGX

what is spark? generalcs162/fa17/static/lectures/25.pdf · spark: unified engine across spark sql...

Documents