spark summit 2014: spark streaming for realtime auctions
DESCRIPTION
Spark Summit (2014-06-30) http://spark-summit.org/2014/talk/building-a-data-processing-system-for-real-time-auctions What do you do when you need to update your models sooner than your existing batch workflows provide? At Sharethrough we faced the same question. Although we use Hadoop extensively for batch processing we needed a system to process click stream data as soon as possible for our real time ad auction platform. We found Spark Streaming to be a perfect fit for us because of it’s easy integration into the Hadoop ecosystem, powerful functional programming API, and low friction interoperability with our existing batch workflows. In this talk, we’ll present about some of the use cases that led us to choose a stream processing system and why we use Spark Streaming in particular. We’ll also discuss how we organized our jobs to promote reusability between batch and streaming workflows as well as improve testabilityTRANSCRIPT
![Page 1: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/1.jpg)
Spark Streaming for Realtime Auctions @russellcardullo
Sharethrough
![Page 2: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/2.jpg)
Agenda
• Sharethrough? • Streaming use cases • How we use Spark • Next steps
![Page 3: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/3.jpg)
Sharethrough
vs
![Page 4: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/4.jpg)
The Sharethrough Native Exchange
![Page 5: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/5.jpg)
How can we use streaming data?
![Page 6: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/6.jpg)
Use Cases
Creative Optimization Spend Tracking Operational Monitoring
$µ* = max { µk } k
![Page 7: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/7.jpg)
Creative Optimization
• Choose best performing variant
• Short feedback cycle required
impression: {! device: “iOS”,! geocode: “C967”,! pkey: “p193”,! …!}
Variant 1 Variant 2 Variant 3
Click!Content
Here Hello Friend
? ? ?
![Page 8: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/8.jpg)
Spend Tracking
• Spend on visible impressions and clicks
• Actual spend happens asynchronously
• Want to correct prediction for optimal serving
Predicted Spend
Actual Spend
![Page 9: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/9.jpg)
Operational Monitoring
• Detect issues with content served on third party sites
• Use same logs as reporting
Placement Click Rates
P1
P5
P2 P3 P4
P6 P7 P8
![Page 10: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/10.jpg)
We can directly measure business impact of using this data sooner
![Page 11: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/11.jpg)
Why use Spark to build these features?
![Page 12: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/12.jpg)
Why Spark?
• Scala API • Supports batch and streaming • Active community support • Easily integrates into existing Hadoop
ecosystem • But it doesn’t require Hadoop in order to run
![Page 13: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/13.jpg)
How we’ve integrated Spark
![Page 14: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/14.jpg)
Existing Data Pipelineweb!
server
log file
flume
HDFS
web!server
log file
flume
web!server
log file
flume
analytics!web!
service
database
![Page 15: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/15.jpg)
Pipeline with Streamingweb!
server
log file
flume
HDFS
web!server
log file
flume
web!server
log file
flume
analytics!web!
service
database
![Page 16: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/16.jpg)
Batch Streaming• “Real-Time” reporting • Low latency to use data • Only reliable as source • Low latency > correctness
• Daily reporting • Billing / earnings • Anything with strict SLA • Correctness > low latency
![Page 17: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/17.jpg)
Spark Job Abstractions
![Page 18: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/18.jpg)
Job Organization
nterSource Transform Sink
Job
![Page 19: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/19.jpg)
Sources!case class BeaconLogLine(! timestamp: String,! uri: String,! beaconType: String,! pkey: String,! ckey: String!)!!object BeaconLogLine {!! def newDStream(ssc: StreamingContext, inputPath: String): DStream[BeaconLogLine] = {! ssc.textFileStream(inputPath).map { parseRawBeacon(_) }! }!! def parseRawBeacon(b: String): BeaconLogLine = {! ...! }!}!
case class!for pattern!matching
generate!DStream
encapsulate !
common!operations
![Page 20: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/20.jpg)
Transformations
!!def visibleByPlacement(source: DStream[BeaconLogLine]): DStream[(String, Long)] = {! source.! filter(data => {! data.uri == "/strbeacon" && data.beaconType == "visible"! }).! map(data => (data.pkey, 1L)).! reduceByKey(_ + _)!}!!
type safety!
from!case class
![Page 21: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/21.jpg)
Sinks
!!class RedisSink @Inject()(store: RedisStore) {!! def sink(result: DStream[(String, Long)]) = {! result.foreachRDD { rdd =>! rdd.foreach { element =>! val (key, value) = element! store.merge(key, value)! }! }! }!}!!
custom!
sinks for!new stores
![Page 22: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/22.jpg)
Jobs!object ImpressionsForPlacements {!! def run(config: Config, inputPath: String) {! val conf = new SparkConf().! setMaster(config.getString("master")).! setAppName("Impressions for Placement")!! val sc = new SparkContext(conf)! val ssc = new StreamingContext(sc, Seconds(5))!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)!! ssc.start! ssc.awaitTermination! }!}!
source
transform
sink
![Page 23: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/23.jpg)
Advantages?
![Page 24: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/24.jpg)
Code Reuse!object PlacementVisibles {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)! …!}!!…!!object PlacementEngagements {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val engagements = engagementsByPlacement(source)! sink(engagements)! …!}
composable!jobs
![Page 25: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/25.jpg)
Readability
! ssc.textFileStream(inputPath).! map { parseRawBeacon(_) }.! filter(data => {! data._2 == "/strbeacon" && data._3 == "visible"! }).! map(data => (data._4, 1L)).! reduceByKey(_ + _).! foreachRDD { rdd =>! rdd.foreach { element =>! store.merge(element._1, element._2)! }! }!
?
![Page 26: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/26.jpg)
Readability
!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! redis.sink(visible)!!
![Page 27: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/27.jpg)
Testingdef assertTransformation[T: Manifest, U: Manifest](! transformation: T => U,! input: Seq[T],! expectedOutput: Seq[U]!): Unit = {! val ssc = new StreamingContext("local[1]", "Testing", Seconds(1))! val source = ssc.queueStream(new SynchronizedQueue[RDD[T]]())! val results = transformation(source)!! var output = Array[U]()! results.foreachRDD { rdd => output = output ++ rdd.collect() }! ssc.start! rddQueue += ssc.sparkContext.makeRDD(input, 2)! Thread.sleep(jobCompletionWaitTimeMillis)! ssc.stop(true)!! assert(output.toSet === expectedOutput.toSet)! }!
function,!
input,!expectation
test
![Page 28: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/28.jpg)
Testing
!!test("#visibleByPlacement") {!! val input = Seq(! "pkey=abcd, …",! "pkey=abcd, …",! "pkey=wxyz, …",! )!! val expectedOutput = Seq( ("abcd",2),("wxyz", 1) )!! assertTransformation(visibleByPlacement, input, expectedOutput)!}!!
use our!test helper
![Page 29: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/29.jpg)
Other Learnings
![Page 30: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/30.jpg)
Other Learnings
• Keeping your driver program healthy is crucial • 24/7 operation and monitoring • Spark on Mesos? Use Marathon.
• Pay attention to settings for spark.cores.max • Monitor data rate and increase as needed
• Serialization on classes • Java • Kryo
![Page 31: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/31.jpg)
What’s next?
![Page 32: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/32.jpg)
Twitter Summingbird
• Write-once, run anywhere • Supports:
• Hadoop MapReduce • Storm • Spark (maybe?)
![Page 33: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/33.jpg)
Amazon Kinesis
web!server
Kinesis
web!server
web!server
other!applications
mobile!device
app!logs
![Page 34: Spark Summit 2014: Spark Streaming for Realtime Auctions](https://reader033.vdocuments.site/reader033/viewer/2022050708/53ed73158d7f7289708b5cb5/html5/thumbnails/34.jpg)
Thanks!