real time data processing with spark & cassandra @ nosqlmatters 2015 paris

@doanduyhai

Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate

@doanduyhai

Who Am I ?!Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact

duy_hai.doan@datastax.com @doanduyhai

@doanduyhai

Datastax!•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 200+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

Spark & Cassandra Integration!

Spark & its eco-system!Cassandra & token ranges!

Stand-alone cluster deployment!!

@doanduyhai

What is Apache Spark ?!Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach

@doanduyhai

Spark characteristics!Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded

Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell

@doanduyhai

Spark code example!Setup

Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)

val$conf$=$new$SparkConf(true)$$ .setAppName("basic_example")$$ .setMaster("local[3]")$$val$sc$=$new$SparkContext(conf)$

val$people$=$List(("jdoe","John$DOE",$33),$$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$

@doanduyhai

RDDs!RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$$val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$$val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$$val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$

@doanduyhai

RDDs!RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], …

RDD[A] split into partitions Partitions distributed over n workers à parallel computing

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

@doanduyhai

What is Apache Cassandra?!Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction

@doanduyhai

Cassandra data distribution reminder!Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2)

@doanduyhai

Cassandra token ranges!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function

@doanduyhai

Linear scalability!

user_id1

user_id2

user_id3

user_id4

user_id5

@doanduyhai

Linear scalability!

user_id1

user_id2

user_id3

user_id4

user_id5

@doanduyhai

Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)

For Spark

For Cassandra

@doanduyhai

Use Cases!

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration, Data conversion

@doanduyhai

Cluster deployment!C*

SparkM SparkW

C* SparkW

Stand-alone cluster

@doanduyhai

Cluster deployment!

Spark Master

Spark Worker Spark Worker Spark Worker Spark Worker

Executor Executor Executor Executor

Driver Program

Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker

C* C* C* C*

Spark & Cassandra Connector!

Core API!SparkSQL!

SparkStreaming!

@doanduyhai

Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support

@doanduyhai

Connector architecture – Core API!Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples

@doanduyhai

Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;

@doanduyhai

Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!!

Streaming data OUT of Cassandra tables ? •  work in progress …

Connector API !

Connector API!Data Locality Implementation!

@doanduyhai

Connector API!Connecting to Cassandra

!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!!import!com.datastax.driver.spark._!!!!//!Spark!connection!options!!val!conf!=!new!SparkConf(true)!! .setMaster("spark://192.168.123.10:7077")!! .setAppName("cassandra.demo")!! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!! .set("cassandra.username",!"cassandra")!! .set("cassandra.password",!"cassandra")!!!val!sc!=!new!SparkContext(conf)!

@doanduyhai

Connector API!Preparing test data

CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&&INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&

@doanduyhai

Connector API!Reading from Cassandra

!//!Use!table!as!RDD!!val!rdd!=!sc.cassandraTable("test",!"words")!!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!!!rdd.toArray.foreach(println)!!//!CassandraRow[word:!bar,!count:!30]!!//!CassandraRow[word:!foo,!count:!20]!!!rdd.columnNames!!!!//!Stream(word,!count)!!rdd.size!!!!!!!!!!!//!2!!!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!!!firstRow.getInt("count")!!//!Int!=!30!

@doanduyhai

Connector API!Writing data to Cassandra

!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!!!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!

SELECT&*&FROM&test.words;&&&&&&word&|&count&&&&&&999999+9999999&&&&&&bar&|&&&&30&&&&&&foo&|&&&&20&&&&&&cat&|&&&&40&&&&&&fox&|&&&&50&&

@doanduyhai

Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

Spark partition RDD

Cassandra tokens ranges

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

Use Murmur3Partitioner

@doanduyhai

Read data locality!Read from Cassandra

Spark shuffle operations

@doanduyhai

Repartition before write !

Write to Cassandra

rdd.repartitionByCassandraReplica("keyspace","table")

@doanduyhai

Or async batch writes!

Async batches fan-out writes to Cassandra

Spark shuffle operations

@doanduyhai

Write data locality!

•  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)

@doanduyhai

Joins with data locality!

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));

CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))

@doanduyhai

Joins pipeline with data locality!

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … !!

@doanduyhai

Perfect data locality scenario!

•  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table

https://github.com/doanduyhai/Cassandra-Spark-Demo

@doanduyhai

What’s for future ?!Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra

@doanduyhai

What’s for future ?!What’s about data locality ?

@doanduyhai

val join: CassandraJoinRDD[(String,Int), (String,String)] =

sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)

// Select only useful columns for join and processing

.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")

.as((_:String, _:Int))

.repartitionByCassandraReplica(KEYSPACE, ARTISTS)

.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))

.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")

What’s for future ?!

1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary

@doanduyhai

What’s for future ?!

SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';

Advantages of same JVM Cassandra + Solr integration

Single-pass local full text search (no fan out) 2

Data retrieval

D: ] 3X/8, 4X/8]

Thank You @doanduyhai

duy_hai.doan@datastax.com

https://academy.datastax.com/

real time data processing with spark & cassandra @ nosqlmatters 2015 paris

Technology

datastax cassandra + spark streaming

spark cassandra 2016

analytics with spark and cassandra

cassandra + spark + elk

spark streaming with cassandra

spark and cassandra (hulu talk)

performance analysis of spark using k-means · like...

advanced apache spark meetup data sources api cassandra...

a guide to stress testing kafka, spark and cassandra … ·...

analytics with cassandra & spark

manchester hadoop meetup: cassandra spark internals

duyhai doan - real time analytics with cassandra and spark -...

introduction to cassandra • why spark + cassandra ... ·...

chapter 1: an introduction to smack...chapter 7: study case...

olap with spark and cassandra

spark/cassandra integration theory & practice

big data analytics with spark & cassandra

data driven performance repository to classify and ... ·...

using spark over cassandra

cassandra & spark for iot