real time data processing with spark & cassandra @ nosqlmatters 2015 paris

49
@doanduyhai Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate

Upload: duyhai-doan

Post on 16-Jul-2015

1.527 views

Category:

Technology


8 download

TRANSCRIPT

Page 1: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate

Page 2: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Who Am I ?!Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact

[email protected] @doanduyhai

2

Page 3: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Datastax!•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 200+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

Page 4: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Spark & Cassandra Integration!

Spark & its eco-system!Cassandra & token ranges!

Stand-alone cluster deployment!!

Page 5: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

What is Apache Spark ?!Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach

5

Page 6: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Spark characteristics!Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded

Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell

6

Page 7: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Spark code example!Setup

Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)

val$conf$=$new$SparkConf(true)$$ .setAppName("basic_example")$$ .setMaster("local[3]")$$val$sc$=$new$SparkContext(conf)$

val$people$=$List(("jdoe","John$DOE",$33),$$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$

7

Page 8: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

RDDs!RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$$val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$$val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$$val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$

8

Page 9: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

RDDs!RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], …

RDD[A] split into partitions Partitions distributed over n workers à parallel computing

9

Page 10: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

10

Page 11: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

11

Page 12: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

What is Apache Cassandra?!Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction

12

Page 13: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Cassandra data distribution reminder!Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2)

n1

n2

n3

n4

n5

n6

n7

n8

13

Page 14: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Cassandra token ranges!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

14

Page 15: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Linear scalability!

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

user_id1

user_id2

user_id3

user_id4

user_id5

15

Page 16: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Linear scalability!

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

user_id1

user_id2

user_id3

user_id4

user_id5

16

Page 17: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;

17

Page 18: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark

18

Page 19: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)

For Spark

For Cassandra

19

Page 20: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Use Cases!

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration, Data conversion

20

Page 21: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Cluster deployment!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Stand-alone cluster

21

Page 22: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Cluster deployment!

Spark Master

Spark Worker Spark Worker Spark Worker Spark Worker

Executor Executor Executor Executor

Driver Program

Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker

C* C* C* C*

22

Page 23: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Spark & Cassandra Connector!

Core API!SparkSQL!

SparkStreaming!

Page 24: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support

24

Page 25: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector architecture – Core API!Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples

25

Page 26: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;

26

Page 27: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!!

Streaming data OUT of Cassandra tables ? •  work in progress …

27

Page 28: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Connector API !

Connector API!Data Locality Implementation!

Page 29: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector API!Connecting to Cassandra

!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!!import!com.datastax.driver.spark._!!!!//!Spark!connection!options!!val!conf!=!new!SparkConf(true)!! .setMaster("spark://192.168.123.10:7077")!! .setAppName("cassandra.demo")!! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!! .set("cassandra.username",!"cassandra")!! .set("cassandra.password",!"cassandra")!!!val!sc!=!new!SparkContext(conf)!

29

Page 30: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector API!Preparing test data

CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&&INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&

30

Page 31: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector API!Reading from Cassandra

!//!Use!table!as!RDD!!val!rdd!=!sc.cassandraTable("test",!"words")!!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!!!rdd.toArray.foreach(println)!!//!CassandraRow[word:!bar,!count:!30]!!//!CassandraRow[word:!foo,!count:!20]!!!rdd.columnNames!!!!//!Stream(word,!count)!!rdd.size!!!!!!!!!!!//!2!!!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!!!firstRow.getInt("count")!!//!Int!=!30!

31

Page 32: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Connector API!Writing data to Cassandra

!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!!!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!

SELECT&*&FROM&test.words;&&&&&&word&|&count&&&&&&999999+9999999&&&&&&bar&|&&&&30&&&&&&foo&|&&&&20&&&&&&cat&|&&&&40&&&&&&fox&|&&&&50&&

32

Page 33: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

33

Page 34: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Spark partition RDD

Cassandra tokens ranges

34

Page 35: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Use Murmur3Partitioner

35

Page 36: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Read data locality!Read from Cassandra

Spark shuffle operations

36

Page 37: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Repartition before write !

Write to Cassandra

rdd.repartitionByCassandraReplica("keyspace","table")

37

Page 38: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Or async batch writes!

Async batches fan-out writes to Cassandra

Spark shuffle operations

38

Page 39: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Write data locality!

39

•  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)

Page 40: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Joins with data locality!

40

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));

CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))

Page 41: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Joins pipeline with data locality!

41

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … !!

Page 42: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

Perfect data locality scenario!

42

•  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table

Page 43: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Demo

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 44: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

What’s for future ?!Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra

44

Page 45: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

What’s for future ?!What’s about data locality ?

45

Page 46: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

val join: CassandraJoinRDD[(String,Int), (String,String)] =

sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)

// Select only useful columns for join and processing

.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")

.as((_:String, _:Int))

.repartitionByCassandraReplica(KEYSPACE, ARTISTS)

.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))

.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")

What’s for future ?!

1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary

46

Page 47: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

@doanduyhai

What’s for future ?!

47

SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';

1

2

3

Advantages of same JVM Cassandra + Solr integration

1

Single-pass local full text search (no fan out) 2

Data retrieval

D: ] 3X/8, 4X/8]

Page 48: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Q & R

! " !

Page 49: Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris

Thank You @doanduyhai

[email protected]

https://academy.datastax.com/