structured streaming in spark

Structured Streaming Spark Streaming 2.0 Giri R Varatharajan

Upload: giri-r-varatharajan

Post on 16-Apr-2017




0 download


Page 1: Structured streaming in Spark

Structured StreamingSpark Streaming 2.0

https://hadoopist.wordpress.comGiri R Varatharajan

Page 2: Structured streaming in Spark

What is Structured Streaming in Apache Spark

● Continuous Data Flow Programming Model in

Spark introduced in 2.0

● Low Tolerance & High Throughput System

● Exactly Once Semantic - No Duplicates

● Stateful Aggregation over the Time, Event,

Window, Record.

● A Streaming platform built on top of Spark SQL

● Express your the computational code as your

batch computational code in Spark SQL


● Alpha Release released with Spark 2.0

● Supports HDFS, S3 now and support for Kafka,

Kinesis and Other Sources very soon.

Page 3: Structured streaming in Spark

Spark Streaming

< 2.0Behavior

● Micro Batching : streams are called as Discretized

Streams (DStreams)

● Running Aggregations needs to be specified with

a updateStateByKey method

● Requires careful construction of fault tolerance.

Micro Batching

Page 4: Structured streaming in Spark

Streaming Model

● Live Data Streams Keep appending

to the Dataframe called Unbounded


● Runs incremental aggregates on the

Unbounded table.

Page 5: Structured streaming in Spark

Spark Streaming



● Continuous Data Flow : Streams are appended in

an Unbounded Table with Dataframes APIs on it.

● No need to specify any method for running

aggregates over the time, window, or record.

● Look at the network socket wordcount program.

● Streaming is performed in Complete, Append,

Update Mode(s)

Continuous Data Flow

Lines = Input TablewordCounts = Result Table

Page 6: Structured streaming in Spark

Streaming Model

//Socket Stream - Read as and when it arrives in NetCat Channelval lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()

Page 7: Structured streaming in Spark

Streaming Model

val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word").count().orderBy("window")

Page 8: Structured streaming in Spark

Create/Read Streams


● File Source (HDFS, S3, Text, Parquet, Csv,


● Socket Stream (NetCat)

● Kafka, Kinesis and Other Input Sources are Under

Research so cross your fingers.

● DataStreamReader API




Page 9: Structured streaming in Spark

Outputting Streams


Output Sink Types:

● Parquet Sink - HDFS, S3, Parquet

● Console Sink - Terminal

● Memory Sink - In memory table that can be queried over time interactively

● Foreach Sink

● DataStreamWriter



Output Modes:

● Append Mode(Default)

○ New rows only appended

○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)

● Complete Mode

■ Output the whole result to any Sink

■ Applicable only for aggregated Queries (groupBy, etc)

● Update Mode

○ Updates on any of the row attributes will get appended to the output sink.

Page 10: Structured streaming in Spark

CheckPointing ● In case of Failure recover the previous progress

and state of a previous query, and continue where

it left off.

● Configure a CheckPoint location in writeStream

method of DataStreamWriter

● Must be configured for Parquet Sink, File Sink.

Page 11: Structured streaming in Spark

Unsupported Operations yet

● Sort, Limit of First N rows, Distinct on Input


● Joins bt two streaming datasets

● Outer Joins (FO, LO, RO) bt two streaming


● ds.count() ⇒ Use ds.groupBy.count() instead

Page 12: Structured streaming in Spark

Key Takeaways ● Structured Streaming is still experimental but please try it out.

● Streaming Events are gathered and appended to a infinite

dataframe series (Unbounded Table) and queries are running on

top of that.

● Development is very similar to the development of Spark for

Static Dataframe/DataSets APIs.

● Execute Ad-hoc Queries, Run aggregates, update DBs, track

session data, prepare dashboards,etc.

● readStream() - Schema of the Streaming Dataframes are

checked only at run time hence it’s untyped.

● writeStream() with various Output Modes, Output Sinks are

available. Always remember when to use what types of Output


● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks

are the upcoming features and are being developed at the open

source community.

● Structured Streaming is not recommended for Production

workloads at this point even if it’s a File Streaming, Socket


Page 13: Structured streaming in Spark

Thank You Spark Code is available in my github:

Other Spark related repositories:

My blogs and Learning in Spark: