spark 1.1 and beyond
DESCRIPTION
Spark 1.1 and Beyond. Patrick Wendell. About Me. Work at Databricks leading the Spark team Spark 1.1 Release manager Committer on Spark since AMPLab days. This Talk. Spark 1.1 (and a bit about 1.2) A few notes on performance Q&A with myself, Tathagata Das, and Josh Rosen. - PowerPoint PPT PresentationTRANSCRIPT
Spark 1.1 and Beyond
Patrick Wendell
About Me
Work at Databricks leading the Spark team
Spark 1.1 Release manager
Committer on Spark since AMPLab days
This Talk
Spark 1.1 (and a bit about 1.2)
A few notes on performance
Q&A with myself, Tathagata Das, and Josh Rosen
Spark RDD API
Spark Streamin
greal-time
GraphXGraph(alpha)
MLLibmachine learning
DStream’s: Streams of
RDD’s
RDD-Based Matrices
RDD-Based Graphs
SparkSQL
RDD-Based Tables
A Bit about Spark…
HDFS, S3, CassandraYARN, Mesos, Standalone
Spark Release Process
~3 month release cycle, time-scoped2 months of feature development1 month of QA
Maintain older branches with bug fixes
Upcoming release: 1.1.0 (previous was 1.0.2)
Master
branch-1.1
branch-1.0
V1.0.0 V1.0.1
V1.1.0Morefeatures
More stable
For any P.O.C or non production cluster, we
always recommend running off of the head
of a release branch.
Spark 1.1
1,297 patches
200+ contributors (still counting)
Dozens of organizations
To get updates – join our dev list:E-mail [email protected]
Roadmap
Spark 1.1 and 1.2 have similar themes
Spark core:Usability, stability, and performance
MLlib/SQL/Streaming:Expanded feature set and performanceAround ~40% of
mailing list traffic is about these libraries.
Spark Core in 1.1
Performance “out of the box”Sort-based shuffleEfficient broadcastsDisk spilling in PythonYARN usability improvements
UsabilityTask progress and user-defined
countersUI behavior for failing or large jobs
1.0 was the first “preview” release
1.1 provides upgrade path for SharkReplaced Shark in our benchmarks
with 2-3X perf gainsCan perform optimizations with 10-
100X less effort than Hive.
Spark SQL in 1.1
Turning an RDD into a Relation• // Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
Querying using SQL
• // SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()
• // Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)
JDBC server for multi-tenant access and BI tools
Native JSON support
Public types API – “make your own” Schema RDD’s
Improved operator performance
Native Parquet support and optimizations
Spark SQL in 1.1
Spark Streaming
Stability improvements across the board
Amazon Kinesis support
Rate limiting for streams
Support for polling Flume streams
Streaming + ML: Streaming linear regressions
What’s new in MLlib v1.1• Contributors: 40 (v1.0) -> 68
• Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression
• Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec
• Statistics: sampling (core), correlations, hypothesis testing, random data generation
• Performance and scalability: major improvement to decision tree, tree aggregation
• Python API: decision tree, statistics, linear methods
Performance (v1.0 vs. v1.1)
Sort-based Shuffle
Old shuffle:Each mapper opens a file for each reducer and writes output simultaneously.Files = # mappers * # reducers
New Shuffle:Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:people.map(p => (p.zipCode, p.getIncome)) .reduceByKey(_ + _)
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:people.groupBy(‘zipCode).select(sum(‘income))
GroupBy Operator
Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)
YES:SELECT sum(income) FROM people GROUP BY zipCode;
Spark RDD API
Spark Streamin
greal-time
GraphXGraph(alpha)
MLLibmachine learning
DStream’s: Streams of
RDD’s
RDD-Based Matrices
RDD-Based Graphs
SparkSQL
RDD-Based Tables
Other efforts
HDFS, S3, CassandraYARN, Mesos, Standalone
Pig on Spark
Hive on Spark
Ooyala Job Server
Looking Ahead to 1.2+
[Core]Scala 2.11 supportDebugging tools (task progress, visualization)Netty-based communication layer
[SQL]Portability across Hive versionsPerformance optimizations (TPC-DS and Parquet)Planner integration with Cassandra and other sources
Looking Ahead to 1.2+
[Streaming]Python SupportLower level Kafka API w/ recoverability
[MLLib]Multi-model trainingMany new algorithmsFaster internal linear solver
Q and A
Josh RosenPySpark and Spark Core
Tathagata DasSpark Streaming Lead