big data-driven applications with cassandra and spark

46
Big Data-Driven Applications with Cassandra and Spark Artem Chebotko, Ph.D. Solution Architect

Upload: artem-chebotko

Post on 23-Jan-2018

530 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Big Data-Driven Applications  with Cassandra and Spark

Big Data-Driven Applications

with Cassandra and Spark

Artem Chebotko, Ph.D.

Solution Architect

Page 2: Big Data-Driven Applications  with Cassandra and Spark

1 Modern Big Data and Cloud Applications

2 Cassandra and Spark Highlights

3 Architecture Overview

4 Languages and APIs

5 Live Demo

2

Page 3: Big Data-Driven Applications  with Cassandra and Spark

Modern Application Requirements

• Numerous Endpoints

• Geographically Distributed

• Continuously Available

• Instantaneously Responsive

• Immediately Decisive

• Predictably Scalable

3

Page 4: Big Data-Driven Applications  with Cassandra and Spark

Applications by Response Time and Workload

4

Analytical (OLAP)Operational (OLTP)

Page 5: Big Data-Driven Applications  with Cassandra and Spark

Applications by Response Time and Workload

5

Real-time transactions

Analytical (OLAP)Operational (OLTP)

• Web and IoT apps

• Financial transactions

Page 6: Big Data-Driven Applications  with Cassandra and Spark

Applications by Response Time and Workload

6

Real-time transactions

Real-time analytics

Analytical (OLAP)Operational (OLTP)

• Web and IoT apps

• Financial transactions

• Recommendations

Page 7: Big Data-Driven Applications  with Cassandra and Spark

Applications by Response Time and Workload

7

Real-time transactions

Real-time analytics

Streaming analytics

Analytical (OLAP)Operational (OLTP)

• Web and IoT apps

• Financial transactions

• Recommendations

• Fraud prevention

Page 8: Big Data-Driven Applications  with Cassandra and Spark

Applications by Response Time and Workload

8

Real-time transactions

Real-time analytics

Streaming analytics

Batch analytics

Analytical (OLAP)Operational (OLTP)

• Web and IoT apps

• Financial transactions

• Recommendations

• Fraud prevention

• Predictive models

• Fraud detection

Page 9: Big Data-Driven Applications  with Cassandra and Spark

Roles of Cassandra and Spark

9

Real-time transactions

Real-time analytics

Streaming analytics

Batch analytics

Analytical (OLAP)Operational (OLTP)

• Web and IoT apps

• Financial transactions

• Recommendations

• Fraud prevention

• Predictive models

• Fraud detection

Numerous Endpoints

Geographically Distributed

Continuously Available

Instantaneously Responsive

Immediately Decisive

Predictably Scalable

Page 10: Big Data-Driven Applications  with Cassandra and Spark

1 Modern Big Data and Cloud Applications

2 Cassandra and Spark Highlights

3 Architecture Overview

4 Languages and APIs

5 Live Demo

10

Page 11: Big Data-Driven Applications  with Cassandra and Spark

Cassandra – Operational Database

• Millions of concurrent users

• Millisecond response time

• Linear scalability

• Always on

11

Page 12: Big Data-Driven Applications  with Cassandra and Spark

Cassandra – Operational Database

12

Page 13: Big Data-Driven Applications  with Cassandra and Spark

Spark – Analytics Platform

• Real-time, streaming and batch analytics

• Up to 100x faster than Hadoop

• Scalability, fault-resilience

• Versatile and rich API

13

Page 14: Big Data-Driven Applications  with Cassandra and Spark

Spark – Analytics Platform

14

SQL Streaming MLlib GraphX

Cluster Manager

Standalone YARN Mesos

Page 15: Big Data-Driven Applications  with Cassandra and Spark

Spark-Cassandra Connector

Open-Source Package for Spark

• Routine Spark-Cassandra interactions

– Read from and write into Cassandra

• Profound optimizations

– Predicate pushdown

– Data locality

– Cassandra-optimized joins

– Cassandra-aware partitioning

– Shuffle-free grouping

15

Page 16: Big Data-Driven Applications  with Cassandra and Spark

1 Modern Big Data and Cloud Applications

2 Cassandra and Spark Highlights

3 Architecture Overview

4 Languages and APIs

5 Live Demo

16

Page 17: Big Data-Driven Applications  with Cassandra and Spark

C*: Distributed, Shared Nothing, Peer-to-Peer

17

C* Client

C*

C*

C*

-263

+263-1

Driver

C* Client Driver

transaction

transaction

transaction

transaction

Page 18: Big Data-Driven Applications  with Cassandra and Spark

C*: Partitioning and Replication

18

replica 2

replica 1

replica 3

coordinatorpartitioner

partition

partition

key

write request

acknowledgment

CL=QUORUM

RF=3

...

...

TABLE

Page 19: Big Data-Driven Applications  with Cassandra and Spark

C*: Partitioning and Replication

19

replica 2

replica 1

replica 3

partition

partition

key

result

CL=ONE

RF=3

coordinatorpartitioner

read request

Page 20: Big Data-Driven Applications  with Cassandra and Spark

Spark: Master-Worker, Failover Masters

20

Spark

ClientDriver

Master

Worker

SparkContext

Spark

ClientDriver

SparkContextWorker

Worker

Executor Executor

Executor Executor

Executor Executor

Page 21: Big Data-Driven Applications  with Cassandra and Spark

Spark: Computation Scheduling

21

Driver

SparkContextDAG Job 0

Stage 1

task task

task

Stage 0

task task

task

Stage 2

task task

task

Job 1

Stage 4

task task

task

Stage 3

task task

task

Stage 5

task task

task

Executor

task

cache

task

Executor

task

cache

task

Executor

task

cache

task

Page 22: Big Data-Driven Applications  with Cassandra and Spark

Spark-Cassandra Connector

22

C*

C*

C*

Master WorkerExecutor

Executor

Spark-Cassandra Connector

WorkerExecutor

Executor

Spark-Cassandra

Connector

WorkerExecutor

Executor

Spark-Cassandra

Connector

Page 23: Big Data-Driven Applications  with Cassandra and Spark

Spark-Cassandra Connector

23

C*

C*

C*

Master WorkerExecutor

Executor

Spark-Cassandra Connector

WorkerExecutor

Executor

Spark-Cassandra

Connector

WorkerExecutor

Executor

Spark-Cassandra

Connector

Clu

ste

r N

ode

Spark NodeMaster JVM

Connector.jar

Worker JVMExecutor JVM

Executor JVM

C* NodeC* JVM

Page 24: Big Data-Driven Applications  with Cassandra and Spark

Multi-DC Deployment and Workload Separation

24

C* Client Driver Spark

ClientDriver

SparkContextC* Client Driver

C*

C*

C*

Master

WorkerExecutor

WorkerWorkerC*

C*

C*

Executor

Executor

ExecutorExecutor Executor

Executor

C*C*

C*

Data

Replication

C* Client Driver

Spark

ClientDriver

SparkContext

real-time

transactions

interactive and

batch analytics

DC

Operations

DC

Analytics

Page 25: Big Data-Driven Applications  with Cassandra and Spark

1 Modern Big Data and Cloud Applications

2 Cassandra and Spark Highlights

3 Architecture Overview

4 Languages and APIs

5 Live Demo

25

Page 26: Big Data-Driven Applications  with Cassandra and Spark

Getting Started with Cassandra and Spark Applications

• Data Model and Cassandra Query Language

• Core Spark and Spark-Cassandra Connector

26

Page 27: Big Data-Driven Applications  with Cassandra and Spark

Keyspace and Replication

27

CREATE KEYSPACE iot

WITH replication = {'class': 'NetworkTopologyStrategy',

'DC-Kyiv-Operations' : 3,

'DC-Houston-Analytics': 2};

USE iot;

Page 28: Big Data-Driven Applications  with Cassandra and Spark

Table with Single-Row Partitions

28

username age address

Alice 28 Santa Clara, CA

Alex 37 Austin, TX

users CREATE TABLE users (

username TEXT,

age INT,

address TEXT,

PRIMARY KEY(username)

);

SELECT * FROM users

WHERE username = ?;

Page 29: Big Data-Driven Applications  with Cassandra and Spark

Table with Single-Row Partitions

29

id type settings owner

1 phone {gps ⇒ on,

pedometer ⇒ on}

Alice

2 wristband {heart rate ⇒ on, …} Alice

3 thermostat {temp ⇒ 75, …} Alice

4 security {…} Alex

5 phone {…} Alex

sensors CREATE TABLE sensors (

id INT,

type TEXT,

settings MAP<TEXT,TEXT>,

owner TEXT,

PRIMARY KEY(id)

);

SELECT * FROM sensors

WHERE id = ?;

Page 30: Big Data-Driven Applications  with Cassandra and Spark

Table with Multi-Row Partitions

30

username id type settings age address

Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA

Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA

Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA

Alex 4 security … 37 Austin, TX

Alex 5 phone … 37 Austin, TX

sensors_by_user

AS

CA

SC

Page 31: Big Data-Driven Applications  with Cassandra and Spark

Table with Multi-Row Partitions

CREATE TABLE sensors_by_user (

username TEXT, age INT STATIC, address TEXT STATIC,

id INT, type TEXT, settings MAP<TEXT,TEXT>,

PRIMARY KEY(username, id)

) WITH CLUSTERING ORDER BY (id ASC);

SELECT * FROM sensors_by_user WHERE username = ?;

SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;

SELECT * FROM sensors_by_user WHERE username = ? AND id > ?

ORDER BY id DESC;

31

Page 32: Big Data-Driven Applications  with Cassandra and Spark

Retrieving Data from C*

• SparkContext, RDD, Connector

32

val rdd = sc.cassandraTable("iot","sensors_by_user")

.select("username","id","type")

Page 33: Big Data-Driven Applications  with Cassandra and Spark

Predicate Pushdown

33

sc.cassandraTable("iot","sensors_by_user")

.select("username","id","type")

.filter(row => row.getString("username") == "Alice")

• Suboptimal code

Page 34: Big Data-Driven Applications  with Cassandra and Spark

Predicate Pushdown

34

sc.cassandraTable("iot","sensors_by_user")

.select("username","id","type")

.filter(row => row.getString("username") == "Alice")

.where("username = 'Alice'")

• Predicate pushed down to C*

Page 35: Big Data-Driven Applications  with Cassandra and Spark

Data Locality

input.split.size_in_mb input.consistency.level input.fetch.size_in_rows

35

Cassandra Spark

Node

(64) (LOCAL_ONE) (1000)

Page 36: Big Data-Driven Applications  with Cassandra and Spark

• Standard Spark join = shuffle + shuffle

Cassandra-Optimized Joins

36

val s = sc.cassandraTable("iot","sensors")

.keyBy(row => row.getString("owner"))

val u = sc.cassandraTable("iot","users")

.keyBy(row => row.getString("username"))

s.join(u)

Page 37: Big Data-Driven Applications  with Cassandra and Spark

Cassandra-Optimized Joins

37

• Shuffle

B

Partition 1

A C D

Map Task

Partition A

Reduce Task

B

Partition 2

A C D

Map Task

Partition B

Reduce Task

B

Partition 3

A C D

Map Task

Partition D

Reduce Task

Partition C

Reduce Task

Buckets:

memory

memory

Shuffle write

Shuffle read

disk

disk

Aggregation Aggregation Aggregation

Aggregation AggregationAggregationAggregation

Page 38: Big Data-Driven Applications  with Cassandra and Spark

• Connector join = no shuffle + no data locality

Cassandra-Optimized Joins

38

sc.cassandraTable("iot","sensors")

.select("id","type","owner".as("username"))

.joinWithCassandraTable("iot","users")

.on(SomeColumns("username"))

Page 39: Big Data-Driven Applications  with Cassandra and Spark

Cassandra-Optimized Joins

39

id type owner

username

1 … Alice

4 … Alex

3 ... Alice

2 … Alice

5 … Alexusername age address

Alex 37 …

username age address

Alice 28 …

Page 40: Big Data-Driven Applications  with Cassandra and Spark

• Connector join + CAP = shuffle + data locality

Cassandra-Aware Partitioning

40

sc.cassandraTable("iot","sensors")

.select("id","type","owner".as("username"))

.repartitionByCassandraReplica("iot","users")

.joinWithCassandraTable("iot","users")

.on(SomeColumns("username"))

Page 41: Big Data-Driven Applications  with Cassandra and Spark

Cassandra-Aware Partitioning

41

id type owner

username

1 … Alice

2 … Alice

3 ... Alice

username age address

Alex 37 …

username age address

Alice 28 …

id type owner

username

4 … Alex

5 … Alex

Page 42: Big Data-Driven Applications  with Cassandra and Spark

• Suboptimal code

Shuffle-Free Grouping

42

sc.cassandraTable("iot","sensors_by_user")

.select("username","id","type")

.as((u:String,i:Int,t:String)=>(u,(i,t)))

.groupByKey

Page 43: Big Data-Driven Applications  with Cassandra and Spark

• Shuffling eliminated at no extra cost

Shuffle-Free Grouping

43

sc.cassandraTable("iot","sensors_by_user")

.select("username","id","type")

.as((u:String,i:Int,t:String)=>(u,(i,t)))

.groupByKeyspanByKey

Page 44: Big Data-Driven Applications  with Cassandra and Spark

Saving Data to C*

44

rdd.saveToCassandra("iot","users",

SomeColumns("username", "age"))

output.consistency.level (LOCAL_QUORUM)

output.batch.grouping.key (Partition)

output.batch.size.bytes (1024)

output.batch.grouping.buffer.size (1000)

output.concurrent.writes (5)

Page 45: Big Data-Driven Applications  with Cassandra and Spark

1 Modern Big Data and Cloud Applications

2 Cassandra and Spark Highlights

3 Architecture Overview

4 Languages and APIs

5 Live Demo

45

Page 46: Big Data-Driven Applications  with Cassandra and Spark

Artem Chebotko

[email protected]

www.linkedin.com/in/artemchebotko

46