lightning fast cluster computing - apache software...
TRANSCRIPT
Lightning Fast Cluster Computing
Michast
Michael Armbrust - @michaelarmbrust Reflections | Projections 2015
2
What is Apache ?
What is Apache ?
Fast and general computing engine for clusters created by students at UC Berkeley • Makes it easy to process large (GB-PB) datasets • Support for Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop Map/Reduce for some
applications
Spark Model
Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory
or disk across a cluster > Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure
Example: Log Mining Load messages from a log file into memory, then interactively search for the problem
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count()
messages.filter(lambda x: “bar” in x).count()
. . .
tasks
results Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec"(vs 170 sec for on-disk data)
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
filter reduce map
Inpu
t file
RDDs track lineage info to rebuild lost data
filter reduce map
Inpu
t file
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
Speed-up ML Using Memory
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing T
ime
(s)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
9
On-Disk Sort Record: Time to sort 100TB
2100 machines 2013 Record: Hadoop
2014 Record: Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Higher-Level Libraries
Spark
Spark Streaming
real-time
Spark SQL structured data
MLlib machine learning
GraphX graph
Seamlessly switch components
// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning model model = KMeans.train(points, 10)
// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Streaming
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
SparkSQL Streaming
Powerful Stack – Agile Development
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
GraphX
Streaming SparkSQL
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
GraphX
Streaming SparkSQL
Your App?
Open Source Ecosystem Applications
Environments Data Sources
Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community
Get Involved on
Check us out at Contribute code through
Best way to get started is to fix a bug
Don’t forget to write a test!
About Databricks
• The hardest part of using Spark is managing 100s of machines.
• Databricks makes this easy
21
Founded by creators of Spark and remains largest contributor.
Demo
Using to analyze emojoi use on Twitter
What’s next for ?
+ declarative programming
Create and Running Spark Programs Faster:
• Write less code • Read less data • Let the optimizer do the hard work
DataFrame noun – [dey-tuh-freym]
1. A distributed collection of rows organized into named columns.
2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
Write Less Code: Compute an Average
private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
Using DataFrames
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
Using SQL
SELECT name, avg(age) FROM people GROUP BY name
Not Just Less Code: Faster Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
Machine Learning Pipelines tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)
df0 df1 df2 df3 tokenizer hashingTF lr.model
lr
Pipeline Model
Optimization happens as late as possible, therefore Spark SQL can
optimize across functions.
30
31
def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # udf adds city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == “Champaign") .select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join relevant users
Physical Plan
join
scan (events) filter
scan (users)
32
def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Physical Plan with Predicate Pushdown
and Column Pruning
join
optimized scan
(events) optimized
scan (users)
events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == “Champaign") .select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan (events) filter
scan (users)
Plan Optimization & Execution
Set Footer from Insert Dropdown Menu 33
SQL AST
DataFrame
Unresolved Logical
Plan
Logical Plan
Optimized Logical
Plan RDDs
Selected Physical
Plan
Analysis Logical Optimization
Physical Planning
Cost
Mod
el
Physical Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code Generation
Writing Rules as Tree Transformations 1. Find filters on top of
projections. 2. Check that the filter
can be evaluated without the result of the project.
3. If so, switch the operators.
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
34
Prior Work: "Optimizer Generators Volcano / Cascades: • Create a custom language for expressing
rules that rewrite trees of relational operators.
• Build a compiler that generates executable code for these rules.
Cons: Developers need to learn this custom language. Language might not be powerful enough. 35
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
36
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Partial Function Tree
37
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Find Filter on Project
38
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Check that the filter can be evaluated without the result of the project.
39
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
If so, switch the order.
40
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Scala: Pattern Matching
41
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Catalyst: Attribute Reference Tracking
42
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
} Scala: Copy Constructors
43
Optimizing with Rules
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
Projectname
Filterid = 1
People
CombineProjection
IndexLookupid = 1
return: name
PhysicalPlan
44
• Type-safe: operate on domain objects with compiled lambda functions • Fast: Code-generated
encoders for fast serialization • Interoperable: Easily
convert DataFrames to Datasets without boiler plate
45
Coming Soon: Datasets
val df = ctx.read.json("people.json") // Convert to custom objects. case class Person(name: String, age: Int) val ds: Dataset[Person] = df.as[Person] ds.filter(_.age > 30) // Compute histogram of age by name. ds.groupBy(_.name).mapGroups { case (name, people) => val buckets = Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }
Questions?
https://databricks.com/company/careers https://github.com/apache/spark
46