bigdata processing with spark - part ii

SIKS Big Data CoursePart TwoProf.dr.ir. Arjen P. de [email protected], December 7, 2016

Recap Spark Data Sharing Crucial for:

- Interactive Analysis- Iterative machine learning algorithms

Spark RDDs- Distributed collections, cached in memory across cluster nodes

Keep track of Lineage- To ensure fault-tolerance- To optimize processing based on knowledge of the data partitioning

RDDs in More DetailRDDs additionally provide:- Control over partitioning, which can be used to optimize data

placement across queries.- usually more efficient than the sort-based approach of Map

Reduce- Control over persistence (e.g. store on disk vs in RAM)- Fine-grained reads (treat RDD as a big table)

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Scheduling Process

rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects

build operator DAG agnostic

to operators!

doesn’t know about

stages

DAGScheduler

split graph into stages of taskssubmit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster managerretry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

RDD API Example

// Read input fileval input = sc.textFile("input.txt")

val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines

val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 }

6

RDD API Example

// Read input fileval input = sc.textFile( )

val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines

val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b }

7

Transformations

sc.textFile().map().filter().map().reduceByKey()

8

DAG View of RDD’s

textFile() map() filter() map() reduceByKey()

9

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3

input tokenized counts

Transformations build up a DAG, but don’t “do anything”

10

How runJob Works

Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).

11

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Hadoop RDD

Partition 1

Partition 2

Partition 3


runJob(counts)

Physical Optimizations

1. Certain types of transformations can be pipelined.

2. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated.

Pipelining and truncation produce a set of stages where each stage is composed of tasks

12

Scheduler OptimizationsPipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task

Task DetailsStage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:

Taskf1 f2 …

map output fileor master

externalstorage

fetch mapoutputs

and/or

How runJob Works

Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).

15

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3


runJob(counts)

16

How runJob WorksNeeds to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).


Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3

runJob(counts)

Stage Graph

Task 1

Task 2

Task 3

Task 1

Task 2

Stage 1 Stage 2

Each task will:1.Read Hadoop input2.Perform maps and filters3.Write partial sums

Each task will:1.Read partial sums2.Invoke user function passed to runJob.

Shuffle write Shuffle readInput read

Physical Execution Model Distinguish between:

- Jobs: complete work to be done- Stages: bundles of work that can execute together- Tasks: unit of work, corresponds to one RDD partition

Defining stages and tasks should not require deep knowledge of what these actually do- Goal of Spark is to be extensible, letting users define new

RDD operators

RDD InterfaceSet of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locations

Optional partitioning info (Partitioner)Captures all current Spark operations!

Example: HadoopRDDpartitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding block

preferredLocations(part) = HDFS block locationpartitioner = none

Example: FilteredRDDpartitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter it

preferredLocations(part) = none (ask parent)partitioner = none

Example: JoinedRDDpartitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled data

preferredLocations(part) = nonepartitioner = HashPartitioner(numTasks)Spark will now

know this data is hashed!

Dependency Types

union join with inputs not

co-partitioned

map, filter

join with inputs co-partitioned

“Narrow” deps:

groupByKey

“Wide” (shuffle) deps:

Improving Efficiency Basic Principle: Avoid Shuffling!

Filter Input Early

Avoid groupByKey on Pair RDDs

All key-value pairs will be shuffled accross the network, to a reducer where the values are collected together

groupByKey

“Wide” (shuffle) deps:

aggregateByKey Three inputs

- Zero-element- Merging function within partition- Merging function across partitions

val initialCount = 0;val addToCounts = (n: Int, v: String) => n + 1val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2

val countByKey = kv.aggregateByKey(initialCount)(addToCounts,sumPartitionCounts)

Combiners!

combineByKey

val result = input.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) =>

(acc1._1 + acc2._1, acc1._2 + acc2._2) ).map{

case (key, value) => (key, value._1 / value._2.toFloat) }

result.collectAsMap().map(println(_))

Control the Degree of Parallellism Repartition

- Concentrate effort - increase use of nodes Coalesce

- Reduce number of tasks

Broadcast Values In case of a join with a small RHS or LHS, broadcast the

small set to every node in the cluster

Broadcast Variables Create with SparkContext.broadcast(initVal) Access with .value inside tasks

Immutable!- If you modify the broadcast value after creation, that change is

local to the node

Maintaining Partitioning mapValues instead of map flatMapValues instead of flatMap

- Good for tokenization!

The best trick of all, however…

Use Higher Level API’s!DataFrame APIs for core processing

Works across Scala, Java, Python and R

Spark ML for machine learning

Spark SQL for structured query processing

38

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQLstructured

data

MLlibmachinelearning

GraphXgraph

Combining Processing Types// Load data using SQLpoints = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning modelmodel = KMeans.train(points, 10)

// Apply it to a streamsc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b)

Performance of CompositionSeparate computing frameworks:

…HDFS read

HDFS write

HDFS read

HDFS write

HDFS read

HDFS write

HDFS write

HDFS read

Spark:

Encode Domain Knowledge In essence, nothing more than libraries with pre-cooked

code – that still operates over the abstraction of RDDs

Focus on optimizations that require domain knowledge

Spark MLLib

Data Sets

Challenge: Data RepresentationJava objects often many times larger than data

class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))

User 0x… 0x…

String

3

0

1 2

Bobby

5 0x…

int[]

char[] 5

DataFrames / Spark SQLEfficient library for working with structured data

»Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs

»Optimized computation and storage underneath

Spark SQL added in 2014, DataFrames in 2015

Spark SQL Architecture

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

…

DataSource

API

SQL DataFrames

CodeGenerator

DataFrame APIDataFrames hold rows with a known schema and offer relational operations through a DSLc = HiveContext()users = c.sql(“select * from users”)

ma_users = users[users.state == “MA”]

ma_users.count()

ma_users.groupBy(“name”).avg(“age”)

ma_users.map(lambda row: row.user.toUpper())

Expression AST

What DataFrames Enable1. Compact binary representation

• Columnar, compressed cache; rows for processing

2. Optimization across operators (join reordering, predicate pushdown, etc)

3. Runtime code generation

Performance

Data SourcesUniform way to access structured data

»Apps can migrate across Hive, Cassandra, JSON, …»Rich semantics allows query pushdown into data

sources

SparkSQL

users[users.age > 20]

select * from users

ExamplesJSON:

JDBC:

Together:

select user.id, text from tweets

{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 }}

tweets.jsonselect age from users where lang = “en”

select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark

SQL{JSON}

select id, age fromusers where lang=“en”

Thanks Matei Zaharia, MIT (https://cs.stanford.edu/~matei/) Paul Wendell, Databricks http://spark-project.org

https://cs.stanford.edu/~matei/

http://spark-project.org/

bigdata processing with spark - part ii

Science