bigdata processing with spark - part ii
TRANSCRIPT
SIKS Big Data CoursePart TwoProf.dr.ir. Arjen P. de [email protected], December 7, 2016
Recap Spark Data Sharing Crucial for:
- Interactive Analysis- Iterative machine learning algorithms
Spark RDDs- Distributed collections, cached in memory across cluster nodes
Keep track of Lineage- To ensure fault-tolerance- To optimize processing based on knowledge of the data partitioning
RDDs in More DetailRDDs additionally provide:- Control over partitioning, which can be used to optimize data
placement across queries.- usually more efficient than the sort-based approach of Map
Reduce- Control over persistence (e.g. store on disk vs in RAM)- Fine-grained reads (treat RDD as a big table)
Slide by Matei Zaharia, creator Spark, http://spark-project.org
Scheduling Process
rdd1.join(rdd2) .groupBy(…) .filter(…)
RDD Objects
build operator DAG agnostic
to operators!
doesn’t know about
stages
DAGScheduler
split graph into stages of taskssubmit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster managerretry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
ThreadsTask
stagefailed
RDD API Example
// Read input fileval input = sc.textFile("input.txt")
val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 }
6
RDD API Example
// Read input fileval input = sc.textFile( )
val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b }
7
Transformations
sc.textFile().map().filter().map().reduceByKey()
8
DAG View of RDD’s
textFile() map() filter() map() reduceByKey()
9
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
Transformations build up a DAG, but don’t “do anything”
10
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
11
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
Physical Optimizations
1. Certain types of transformations can be pipelined.
2. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated.
Pipelining and truncation produce a set of stages where each stage is composed of tasks
12
Scheduler OptimizationsPipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
Task DetailsStage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:
Taskf1 f2 …
map output fileor master
externalstorage
fetch mapoutputs
and/or
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
15
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
16
How runJob WorksNeeds to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
input tokenized counts
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
Stage Graph
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task will:1.Read Hadoop input2.Perform maps and filters3.Write partial sums
Each task will:1.Read partial sums2.Invoke user function passed to runJob.
Shuffle write Shuffle readInput read
Physical Execution Model Distinguish between:
- Jobs: complete work to be done- Stages: bundles of work that can execute together- Tasks: unit of work, corresponds to one RDD partition
Defining stages and tasks should not require deep knowledge of what these actually do- Goal of Spark is to be extensible, letting users define new
RDD operators
RDD InterfaceSet of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locations
Optional partitioning info (Partitioner)Captures all current Spark operations!
Example: HadoopRDDpartitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding block
preferredLocations(part) = HDFS block locationpartitioner = none
Example: FilteredRDDpartitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter it
preferredLocations(part) = none (ask parent)partitioner = none
Example: JoinedRDDpartitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled data
preferredLocations(part) = nonepartitioner = HashPartitioner(numTasks)Spark will now
know this data is hashed!
Dependency Types
union join with inputs not
co-partitioned
map, filter
join with inputs co-partitioned
“Narrow” deps:
groupByKey
“Wide” (shuffle) deps:
Improving Efficiency Basic Principle: Avoid Shuffling!
Filter Input Early
Avoid groupByKey on Pair RDDs
All key-value pairs will be shuffled accross the network, to a reducer where the values are collected together
groupByKey
“Wide” (shuffle) deps:
aggregateByKey Three inputs
- Zero-element- Merging function within partition- Merging function across partitions
val initialCount = 0;val addToCounts = (n: Int, v: String) => n + 1val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts,sumPartitionCounts)
Combiners!
combineByKey
val result = input.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) =>
(acc1._1 + acc2._1, acc1._2 + acc2._2) ).map{
case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
Control the Degree of Parallellism Repartition
- Concentrate effort - increase use of nodes Coalesce
- Reduce number of tasks
Broadcast Values In case of a join with a small RHS or LHS, broadcast the
small set to every node in the cluster
Broadcast Variables Create with SparkContext.broadcast(initVal) Access with .value inside tasks
Immutable!- If you modify the broadcast value after creation, that change is
local to the node
Maintaining Partitioning mapValues instead of map flatMapValues instead of flatMap
- Good for tokenization!
The best trick of all, however…
Use Higher Level API’s!DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
38
Higher-Level Libraries
Spark
Spark Streaming
real-time
Spark SQLstructured
data
MLlibmachinelearning
GraphXgraph
Combining Processing Types// Load data using SQLpoints = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning modelmodel = KMeans.train(points, 10)
// Apply it to a streamsc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b)
Performance of CompositionSeparate computing frameworks:
…HDFS read
HDFS write
HDFS read
HDFS write
HDFS read
HDFS write
HDFS write
HDFS read
Spark:
Encode Domain Knowledge In essence, nothing more than libraries with pre-cooked
code – that still operates over the abstraction of RDDs
Focus on optimizations that require domain knowledge
Spark MLLib
Data Sets
Challenge: Data RepresentationJava objects often many times larger than data
class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5
DataFrames / Spark SQLEfficient library for working with structured data
»Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs
»Optimized computation and storage underneath
Spark SQL added in 2014, DataFrames in 2015
Spark SQL Architecture
Logical Plan
Physical Plan
Catalog
OptimizerRDDs
…
DataSource
API
SQL DataFrames
CodeGenerator
DataFrame APIDataFrames hold rows with a known schema and offer relational operations through a DSLc = HiveContext()users = c.sql(“select * from users”)
ma_users = users[users.state == “MA”]
ma_users.count()
ma_users.groupBy(“name”).avg(“age”)
ma_users.map(lambda row: row.user.toUpper())
Expression AST
What DataFrames Enable1. Compact binary representation
• Columnar, compressed cache; rows for processing
2. Optimization across operators (join reordering, predicate pushdown, etc)
3. Runtime code generation
Performance
Performance
Data SourcesUniform way to access structured data
»Apps can migrate across Hive, Cassandra, JSON, …»Rich semantics allows query pushdown into data
sources
SparkSQL
users[users.age > 20]
select * from users
ExamplesJSON:
JDBC:
Together:
select user.id, text from tweets
{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 }}
tweets.jsonselect age from users where lang = “en”
select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark
SQL{JSON}
select id, age fromusers where lang=“en”
Thanks Matei Zaharia, MIT (https://cs.stanford.edu/~matei/) Paul Wendell, Databricks http://spark-project.org