spark 2013-04-17

26
The Spark The Spark Ecosystem Ecosystem Michael Malak Michael Malak technicaltidbit.c technicaltidbit.c om om

Upload: michaelmalak

Post on 26-Jan-2015

109 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Spark 2013-04-17

The Spark EcosystemThe Spark Ecosystem

Michael MalakMichael Malak

technicaltidbit.comtechnicaltidbit.com

Page 2: Spark 2013-04-17

AgendaAgenda

• What Hadoop gives usWhat Hadoop gives us• What everyone is complaining about in 2013What everyone is complaining about in 2013• SparkSpark– Berkeley TeamBerkeley Team– BDAS (Berkeley Data Analytics Stack)BDAS (Berkeley Data Analytics Stack)– RDDs (Resilient Distributed Datasets)RDDs (Resilient Distributed Datasets)– SharkShark– Spark StreamingSpark Streaming– Other Spark subsystemsOther Spark subsystems

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 22

Page 3: Spark 2013-04-17

What Hadoop Gives UsWhat Hadoop Gives Us

• HDFSHDFS• Map/ReduceMap/Reduce

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 33

Page 4: Spark 2013-04-17

Hadoop: HDFSHadoop: HDFS

Image from mark.chmarny.comImage from mark.chmarny.com

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 44

Page 5: Spark 2013-04-17

Hadoop: Map/ReduceHadoop: Map/Reduce

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 55

Image from people.apache.org/~rdonkin

Image from blog.octo.com

Page 6: Spark 2013-04-17

Map/Reduce ToolsMap/Reduce Tools

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 66

Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Page 7: Spark 2013-04-17

Hadoop Distribution Dogs in the Hadoop Distribution Dogs in the RaceRace

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 77

Hadoop Distribution Query Tool

Stinger

Apache Drill

Page 8: Spark 2013-04-17

Other Open Source SolutionsOther Open Source Solutions

• DruidDruid• SparkSpark

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 88

Page 9: Spark 2013-04-17

Not just caching, but streamingNot just caching, but streaming

• 11stst generation: HDFS generation: HDFS• 22ndnd generation: Caching & “Push” Map/Reduce generation: Caching & “Push” Map/Reduce• 33rdrd generation: Streaming generation: Streaming

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 99

Page 10: Spark 2013-04-17

Berkeley TeamBerkeley Team• 40 students40 students• 8 faculty8 faculty• 3 staff software 3 staff software

engineersengineers• Silicon Valley style Silicon Valley style

skunkworks office skunkworks office spacespace

• 2 years into 6 year 2 years into 6 year programprogram

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1010

Image from Ian Stoica’s slides from Strata 2013 presentation

Page 11: Spark 2013-04-17

Spark

BDASBDAS(Berkeley Data Analytics Stack)(Berkeley Data Analytics Stack)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1111

Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

Page 12: Spark 2013-04-17

RDDsRDDs(Resilient Distributed Dataset)(Resilient Distributed Dataset)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1212

Image from Matei Zaharia’s paper

Page 13: Spark 2013-04-17

RDDs: LazinessRDDs: Laziness

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1313

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

.map(_.split(‘\t’)(2))

.filter(_.contains(“foo”))

cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

Page 14: Spark 2013-04-17

RDDs: Transformations vs. ActionsRDDs: Transformations vs. Actions

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1414

Transformations

map(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actions

reduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Page 15: Spark 2013-04-17

Hive vs. SharkHive vs. Shark

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1515

HDFS files

Shark

Hiv

eQL

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Hiv

eQL

Page 16: Spark 2013-04-17

Shark: Copy from HDFS to RDDShark: Copy from HDFS to RDD

CREATE TABLE wiki_small_in_mem TBLPROPERTIES CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache" = "true") AS SELECT * FROM wiki;("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s Creates a table that is stored in a cluster’s memory using RDD.cache().memory using RDD.cache().

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1616

Page 17: Spark 2013-04-17

Shark: Just a ShimShark: Just a Shim

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1717

Shark

Images from Reynold Xin’s presentation

Page 18: Spark 2013-04-17

What about “Big Data”?What about “Big Data”?

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1818

PB

TB

GB

MB

KB

Shar

k Eff

ectiv

enes

sSh

ark

Effec

tiven

ess

Page 19: Spark 2013-04-17

Median Hadoop job input sizeMedian Hadoop job input size

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 1919

Image from Reynold Xin’s presentation

Page 20: Spark 2013-04-17

Spark Streaming: MotivationSpark Streaming: Motivation

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2020

x1,000,000 clients HDFS

Page 21: Spark 2013-04-17

DStreamDStream

RDD

RDD

Spark Streaming: DStreamSpark Streaming: DStream

• ““A series of small batches”A series of small batches”

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2121

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

. . .RDD{{“id”: “catlover”},

“eventType”: “buyGoods”}}{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Page 22: Spark 2013-04-17

Spark Streaming: DAGSpark Streaming: DAG

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2222

Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Page 23: Spark 2013-04-17

Spark Streaming: Example CodeSpark Streaming: Example Code

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2323

// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Page 24: Spark 2013-04-17

Stateful Spark StreamingStateful Spark Streaming

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2424

Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Page 25: Spark 2013-04-17

Other Spark SubsystemsOther Spark Subsystems

• Bagel (similar to Google Pregel)Bagel (similar to Google Pregel)• Sparkler (Matrix decomposition)Sparkler (Matrix decomposition)• (Machine Learning)(Machine Learning)

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2525

Page 26: Spark 2013-04-17

TeaserTeaser

• Future Meetup: Machine Future Meetup: Machine learning from real-time learning from real-time data streamsdata streams

Global Big Data Apr 23, 2013Global Big Data Apr 23, 2013 technicaltidbit.comtechnicaltidbit.com 2626