matei zaharia spark community update. an exciting year for spark may 2013may 2014 developers...
TRANSCRIPT
An Exciting Year for Spark
May 2013 May 2014
developers contributing
60 200
companies contributing
17 50
total linesof code
49,000 155,000
commercial support
none all major Hadoop distros
Community Growth
Spark 0.6:17 contributors
Feb ‘13Oct ‘12 May‘14Sept ‘13
Spark 0.8:67 contributors
Feb ‘14
Spark 0.9:83 contributors
Spark 1.0:110 contributors
Spark 0.7:31 contributors
Community Growth
Patches0
50
100
150
200
250
MapReduceStormYarnSpark
Lines Added0
50001000015000200002500030000350004000045000
MapReduceStormYarnSpark
Lines Removed0
2000
4000
6000
8000
10000
12000
14000
16000
MapReduceStormYarnSpark
Activity in last 30 days
EventsDecember 2-3, 2013
Talks from 22 organizations
450 attendees
June 30-July 2, 2014
Talks from 50+ organizations
Sign up now!Videos, slides, registration:
spark-summit.org
Next-Gen MapReduceInfluential bloggers:
» “Leading successor of MapReduce” – Mike Olson, Cloudera» “Two years ago and last year were about Hadoop; this year is
about Spark” – Derrick Harris, GigaOM» “Just about everybody seems to agree” that … “Spark will be
the replacement of Hadoop MapReduce” – Curt Monash, DBMS2
Many Features Added to Core…APIs
»Full parity in Java & Python, Java 8 lambda support
Management»High availability, YARN security
Monitoring»Greatly improved UI, metrics
But Most Action Now in LibrariesAn expressive API is good, but even better to call your algorithm in 1 line!
Additions to the Stack
Spark Core
Spark Streamin
greal-time
SharkSQL
MLlibmachine learning
GraphXgraph
SparkSQL
OverviewSpark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark
Provides native support for executing relational queries (SQL) in Spark
Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust
Shark modified the Hive backend to run over Spark, but had two challenges:
»Limited integration with Spark programs»Hive optimizer not designed for Spark
Spark SQL reuses the best parts of Shark:
Relationship to
Borrows• Hive data loading
• In-memory column store
Adds• RDD-aware optimizer
• Rich language interfaces
Hive CompatibilityInterfaces to access data and code inthe Hive ecosystem:
o Support for writing queries in HQLo Catalog info from Hive MetaStoreo Tablescan operator that uses Hive
SerDeso Wrappers for Hive UDFs, UDAFs,
UDTFs
Parquet CompatibilityNative support for reading data in Parquet:
• Columnar storage avoids reading unneeded data.
• RDDs can be written to parquet files, preserving the schema.
Abstraction: SchemaRDDsResilient Distributed Datasets (RDDs) are Spark’s core abstraction.
• Pro: Distributed coarse-grained transformations
• Con: Operations opaque to engine
SchemaRDDs add:
• Awareness of names & types of data stored
• Optimization using database techniques
Examples
Consider a text file filled with people’s names and ages:
Michael, 30Andy, 31Justin Bieber, 19…
Turning an RDD into a Relation
// Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
Querying Using SQL// SQL statements can be run by using the sql method provided
// by sqlContext.
val teenagers = sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs but also
// support normal RDD operations.
// The columns of a row in the result are accessed by ordinal.
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
SQL + Machine Learningval trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude
FROM Users u
JOIN Events e ON u.userId = e.userId""")
// Since sql returns an RDD, the results can be easily used in
MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new LogisticRegressionWithSGD().run(trainingData)
Joining Diverse Sourcesval hiveContext = new HiveContext(sc)import hiveContext._
// Data in Hivehql("CREATE TABLE IF NOT EXISTS hiveTable (key INT, val STRING)")hql("LOAD DATA LOCAL INPATH 'kv.txt' INTO TABLE hiveTable")
// Data in existing RDDsval rdd = sc.parallelize((1 to 100).map(i => Record(i, "val" + i)))rdd.registerAsTable("rddTable")
// Data in ParquethiveContext.loadParquetFile("f.parquet").registerAsTable("parqTable")
// Query all sources at once!sql("SELECT * FROM hiveTable JOIN rddTable JOIN parqTable WHERE ...")
Spark SQL in Javapublic class Person implements Serializable { public String getName() {...} public void setName(String name) {...} public int getAge() {...} public void setAge(int age) {...}}
JavaRDD<Person> people = sc.textFile("people.txt").map( line -> { String[] parts = line.split(","); new Person(parts[0], Integer.parseInt(parts[1])); });
JavaSQLContext ctx = new JavaSQLContext(sc)JavaSchemaRDD peopleTable = ctx.applySchema(people, Person.class);
Spark SQL in Pythonfrom pyspark.context import SQLContextsqlCtx = SQLContext(sc)
lines = sc.textFile("people.txt")parts = lines.map(lambda l: l.split(","))people = parts.map(lambda p: {"name": p[0], "age": int(p[1])})
peopleTable = sqlCtx.applySchema(people)peopleTable.registerAsTable("people")
teenagers = sqlCtx.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")teenNames = teenagers.map(lambda p: "Name: " + p.name)
Spark SQL ResearchCatalyst framework: compact optimizer based on functional language techniques
»Pattern-matching, fixpoint convergence of rules
Complex analytics: expose and optimize MLlib and GraphX algos in SQL
Learn More
Visit spark.apache.org for the latest Spark news, docs & tutorials
Join us at this year’s Summit:spark-summit.org
OverviewSpark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark
Provides native support for executing relational queries (SQL) in Spark
Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust
Shark modified the Hive backend to run over Spark, but had two challenges:
»Limited integration with Spark programs»Hive optimizer not designed for Spark
Spark SQL reuses the best parts of Shark
Relationship to
Borrows:• Hive data loading
• In-memory column store
Adds:• RDD-aware optimizer
• Rich language interfaces