spark 1.1 and beyond

24
Spark 1.1 and Beyond Patrick Wendell

Upload: cara-vasquez

Post on 30-Dec-2015

67 views

Category:

Documents


0 download

DESCRIPTION

Spark 1.1 and Beyond. Patrick Wendell. About Me. Work at Databricks leading the Spark team Spark 1.1 Release manager Committer on Spark since AMPLab days. This Talk. Spark 1.1 (and a bit about 1.2) A few notes on performance Q&A with myself, Tathagata Das, and Josh Rosen. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spark 1.1 and Beyond

Spark 1.1 and Beyond

Patrick Wendell

Page 2: Spark 1.1 and Beyond

About Me

Work at Databricks leading the Spark team

Spark 1.1 Release manager

Committer on Spark since AMPLab days

Page 3: Spark 1.1 and Beyond

This Talk

Spark 1.1 (and a bit about 1.2)

A few notes on performance

Q&A with myself, Tathagata Das, and Josh Rosen

Page 4: Spark 1.1 and Beyond

Spark RDD API

Spark Streamin

greal-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of

RDD’s

RDD-Based Matrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

A Bit about Spark…

HDFS, S3, CassandraYARN, Mesos, Standalone

Page 5: Spark 1.1 and Beyond

Spark Release Process

~3 month release cycle, time-scoped2 months of feature development1 month of QA

Maintain older branches with bug fixes

Upcoming release: 1.1.0 (previous was 1.0.2)

Page 6: Spark 1.1 and Beyond

Master

branch-1.1

branch-1.0

V1.0.0 V1.0.1

V1.1.0Morefeatures

More stable

For any P.O.C or non production cluster, we

always recommend running off of the head

of a release branch.

Page 7: Spark 1.1 and Beyond

Spark 1.1

1,297 patches

200+ contributors (still counting)

Dozens of organizations

To get updates – join our dev list:E-mail [email protected]

Page 8: Spark 1.1 and Beyond

Roadmap

Spark 1.1 and 1.2 have similar themes

Spark core:Usability, stability, and performance

MLlib/SQL/Streaming:Expanded feature set and performanceAround ~40% of

mailing list traffic is about these libraries.

Page 9: Spark 1.1 and Beyond

Spark Core in 1.1

Performance “out of the box”Sort-based shuffleEfficient broadcastsDisk spilling in PythonYARN usability improvements

UsabilityTask progress and user-defined

countersUI behavior for failing or large jobs

Page 10: Spark 1.1 and Beyond

1.0 was the first “preview” release

1.1 provides upgrade path for SharkReplaced Shark in our benchmarks

with 2-3X perf gainsCan perform optimizations with 10-

100X less effort than Hive.

Spark SQL in 1.1

Page 11: Spark 1.1 and Beyond

Turning an RDD into a Relation• // Define the schema using a case class.

case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

Page 12: Spark 1.1 and Beyond

Querying using SQL

• // SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

• // Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Page 13: Spark 1.1 and Beyond

JDBC server for multi-tenant access and BI tools

Native JSON support

Public types API – “make your own” Schema RDD’s

Improved operator performance

Native Parquet support and optimizations

Spark SQL in 1.1

Page 14: Spark 1.1 and Beyond

Spark Streaming

Stability improvements across the board

Amazon Kinesis support

Rate limiting for streams

Support for polling Flume streams

Streaming + ML: Streaming linear regressions

Page 15: Spark 1.1 and Beyond

What’s new in MLlib v1.1• Contributors: 40 (v1.0) -> 68

• Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression

• Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec

• Statistics: sampling (core), correlations, hypothesis testing, random data generation

• Performance and scalability: major improvement to decision tree, tree aggregation

• Python API: decision tree, statistics, linear methods

Page 16: Spark 1.1 and Beyond

Performance (v1.0 vs. v1.1)

Page 17: Spark 1.1 and Beyond

Sort-based Shuffle

Old shuffle:Each mapper opens a file for each reducer and writes output simultaneously.Files = # mappers * # reducers

New Shuffle:Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

Page 18: Spark 1.1 and Beyond

GroupBy Operator

Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

YES:people.map(p => (p.zipCode, p.getIncome)) .reduceByKey(_ + _)

Page 19: Spark 1.1 and Beyond

GroupBy Operator

Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

YES:people.groupBy(‘zipCode).select(sum(‘income))

Page 20: Spark 1.1 and Beyond

GroupBy Operator

Spark groupByKey != SQL groupByNO:people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum)

YES:SELECT sum(income) FROM people GROUP BY zipCode;

Page 21: Spark 1.1 and Beyond

Spark RDD API

Spark Streamin

greal-time

GraphXGraph(alpha)

MLLibmachine learning

DStream’s: Streams of

RDD’s

RDD-Based Matrices

RDD-Based Graphs

SparkSQL

RDD-Based Tables

Other efforts

HDFS, S3, CassandraYARN, Mesos, Standalone

Pig on Spark

Hive on Spark

Ooyala Job Server

Page 22: Spark 1.1 and Beyond

Looking Ahead to 1.2+

[Core]Scala 2.11 supportDebugging tools (task progress, visualization)Netty-based communication layer

[SQL]Portability across Hive versionsPerformance optimizations (TPC-DS and Parquet)Planner integration with Cassandra and other sources

Page 23: Spark 1.1 and Beyond

Looking Ahead to 1.2+

[Streaming]Python SupportLower level Kafka API w/ recoverability

[MLLib]Multi-model trainingMany new algorithmsFaster internal linear solver

Page 24: Spark 1.1 and Beyond

Q and A

Josh RosenPySpark and Spark Core

Tathagata DasSpark Streaming Lead