introduction to spark streaming
TRANSCRIPT
![Page 1: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/1.jpg)
Introduction to Spark Streaming
Real time processing on Apache Spark
![Page 2: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/2.jpg)
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
![Page 3: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/3.jpg)
Agenda● Real time analytics in Big data● Unification● Spark streaming● DStream● DStream and RDD● Stream processing● DStream transformation● Hands on
![Page 4: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/4.jpg)
3 V’s of Big data● Volume
○ TB’s and PB’s of files○ Driving need for batch processing systems
● Velocity○ TB’s of stream data○ Driving need for stream processing systems
● Variety○ Structured, semi structured and unstructured○ Driving need for sql, graph processing systems
![Page 5: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/5.jpg)
Velocity● Speed at which
○ Collect the data○ Process to get insights
● More and more big data analytics becoming real time● Primary drivers
○ Social media○ IoT○ Mobile applications
![Page 6: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/6.jpg)
Use cases● Twitter needs to crunch few billion tweets/s to publish
trending topics● Credit card companies needs to crunch millions of
transactions/s for identifying fraud● Mobile applications like whatsapp needs to constantly
crunch logs for service availability and performance
![Page 7: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/7.jpg)
Real Time analytics ● Ability to collect and process TB’s of streaming data to
get insights● Data will be consumed from one or more streams● Need for combining historical data with real time data● Ability to stream data for downstream application
![Page 8: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/8.jpg)
Stream processing using M/R● Map/Reduce is inherently batch processing system
which is not suitable for streaming● Need for data source as disk put latencies in the
processing● Stream needs multiple transformation which cannot be
expressed effectively on M/R● Overhead in launch of a new M/R job is too high
![Page 9: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/9.jpg)
Apache Storm● Apache storm is a stream processing system build on
top of HDFS● Apache storm has it’s on API’s and do not use
Map/Reduce● It’s a one message at time in core and micro batch is
built on top of it(trident)● Built by twitter
![Page 10: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/10.jpg)
Limitations of Streaming on Hadoop● M/R is not suitable for streaming● Apache storm needs learning new API’s and new
paradigm● No way to combine batch result from M/R with Apache
storm streams● Maintaining two runtimes are always hard
![Page 11: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/11.jpg)
Unified Platform for Big Data Apps
Apache Spark
Batch Interactive Streaming
Hadoop Mesos NoSQL
![Page 12: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/12.jpg)
Spark streamingSpark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
![Page 13: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/13.jpg)
Micro batch● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batch and runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it
can do all that under a sec● These batches are called as DStreams
![Page 14: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/14.jpg)
Discretized streams (DStream)
Input stream is divided into multiple discrete batches. Batch is configurable.
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Input
Stream
![Page 15: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/15.jpg)
DStream● Discretized streams● Each batch of data is converted to small discrete
batches● Batch size can be from 1s - multiple mins● DStream can be constructed from
○ Sockets○ Kafka○ HDFS○ Custom receivers
![Page 16: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/16.jpg)
DStream to RDD
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Input
Stream
RDD @t2RDD @ t1 RDD @ t3
![Page 17: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/17.jpg)
Dstream to RDD● Each batch of Dstream is represented as RDD
underneath● These RDD are replicated in cluster for fault tolerance● Every DStream operation result in RDD transformation● There are API’s to access these RDD is directly● Can combine stream and batch processing
![Page 18: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/18.jpg)
DStream transformationval ssc = new StreamingContext(args(0), "wordcount", Seconds(5))
val lines = ssc.socketTextStream("localhost",50050)
val words = lines.flatMap(_.split(" "))
Spark Streaming
batch @ t1 batch @t2 batch @ t3
Socket
Stream
RDD @t2RDD @ t1 RDD @ t3
FlatMapRDD @ t2
FlatMapRDD @ t1
FlatMapRDD @ t3
flatMap flatMap flatMap
flatMap flatMap flatMap
![Page 19: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/19.jpg)
Socket stream● Ability to listen to any socket on remote machines● Need to configure host and port● Both Raw and Text representation of socket available● Built in retry mechanism
![Page 20: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/20.jpg)
Wordcount example
![Page 21: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/21.jpg)
File Stream● File streams allows for track new files in a given
directory on HDFS● Whenever there is new file appears, spark streaming
will pick it up● Only works for new files, modification for existing files
will not be considered● Tracked using file creation time
![Page 22: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/22.jpg)
FileStream example
![Page 23: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/23.jpg)
Receiver architecture
Spark Cluster Streaming Application(Driver)
Reciever Block Manager
Job Generator
Dstream Transformations
Store
BlockRDD
Mini Batch
Recieve
![Page 24: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/24.jpg)
Stateful operations● Ability to maintain random state across multiple batches
● Fault tolerant
● Exactly once semantics
● WAL (Write Ahead Log) for receiver crashes
![Page 25: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/25.jpg)
StatefulWordcount example
![Page 26: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/26.jpg)
How stateful operations work?● Generally state is a mutable operation● But in functional programming, state is represented with
state machine going from one state to another fn(oldState,newInfo) => newState● In Spark, state is represented using RDD. ● Change in the state is represented using transformation
of RDD’s● Fault tolerance of RDD helps in fault tolerance of state
![Page 27: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/27.jpg)
Transform API● In stream processing, ability to combine stream data
with batch data is extremely important● Both batch API and stream API share RDD as
abstraction ● transform api of DStream allows us to access
underneath RDD’s directly
Ex : Combine customer sales data with customer information
![Page 28: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/28.jpg)
CartCustomerJoin example
![Page 29: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/29.jpg)
Window based operations
![Page 30: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/30.jpg)
Window wordcount
![Page 31: Introduction to Spark Streaming](https://reader034.vdocuments.site/reader034/viewer/2022042522/55a9fe401a28abe5578b45f0/html5/thumbnails/31.jpg)
References● http://www.slideshare.net/pacoid/qcon-so-paulo-
realtime-analytics-with-spark-streaming● http://www.slideshare.net/ptgoetz/apache-storm-vs-
spark-streaming● https://spark.apache.org/docs/latest/streaming-
programming-guide.html