spark + cassandra = real time analytics on operational data

38
Analytique temps réel sur des données transactionnelles = Cassandra + Spark 20/02/15 Victor Coustenoble Ingénieur Solutions [email protected] @vizanalytics

Upload: victor-coustenoble

Post on 19-Jul-2015

2.099 views

Category:

Technology


1 download

TRANSCRIPT

Analytique temps réel sur des données transactionnelles

= Cassandra + Spark 20/02/15

Victor Coustenoble Ingénieur [email protected]@vizanalytics

Comment utilisez vous Cassandra?

3

En contrôlant votre

consommation d’énergie

En regardant des films

en streaming

En naviguant

sur des sites Internet

En achetant

en ligne

En effectuant un règlement

via Smart Phone

En jouant à des

jeux-vidéo très

connus

• Collections/Playlists

• Recommandation/Pe

rsonnalisation

• Détection de Fraude

• Messagerie

• Objets Connectés

Aperçu

Fondé en avril 2010

~35 500+

Santa Clara, Austin, New York, London, Paris, Sydney

400+Employés Pourcent Clients

4

Straightening the road

RELATIONAL DATABASES

CQL SQL

OpsCenter / DevCenter Management tools

DSE for search & analytics Integration

Security Security

Support, consulting & training 30 years ecosystem

Apache Cassandra™

• Apache Cassandra™ est une base de données NoSQL, Open Source, Distribuée et créée pour les applications en ligne, modernes, critiques et avec des montée en charge massive.

• Java , hybride entre Amazon Dynamo et Google BigTable

• Sans Maître-Esclave (peer-to-peer), sans Point Unique de Défaillance (No SPOF)

• Distribuée avec la possibilité de Data Center

• 100% Disponible

• Massivement scalable

• Montée en charge linéaire

• Haute Performance (lecture ET écriture)

• Multi Data Center

• Simple à Exploiter

• Language CQL (comme SQL)

• Outils OpsCenter / DevCenter

6

Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Node 1

Node 2

Node 3Node 4

Node 5

Haute Disponibilité et Cohérence

• La défaillance d’un seul noeud ne doit pas entraîner de défaillance du système

• Cohérence choisie au niveau du client

• Facteur de Réplication (RF) + Niveau de Cohérence (CL) = Succès

• Exemple:

• RF = 3

• CL = QUORUM (= 51% des replicas)

©2014 DataStax Confidential. Do not distribute without consent. 7

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

12 μs ack

> 51% de réponses – donc la requête est réussie

CL(Read) + CL(Write) > RF => Cohérence Immédiate/Forte

DataStax Enterprise

Cassandra

Certifié,

Prêt pour

l’Entreprise

8

Security Analytics Search VisualMonitoring

ManagementServices

In-Memory

Dev. IDE & Drivers

ProfessionalServices

Support & Training

Confiance

d’utilisation

Fonctionnalités

d’Entreprise

DataStax Enterprise - Analytique

• Conçu pour faire des analyses sur des données Cassandra

• Il y a 4 façons de faire de l’Analytique sur des données Cassandra:

1. Recherche (Solr)

2. Analytique en mode Batch (Hadoop)

3. Analytique en mode Batch avec des outils Externe (Cloudera, Hortonworks)

4. Analytique Temps Réel

©2014 DataStax Confidential. Do not distribute without consent.

Partenariat

©2014 DataStax Confidential. Do not distribute without consent. 10

Why Spark on Cassandra?

• Analytics on transactional data and operational applications

• Data model independent queries

• Cross-table operations (JOIN, UNION, etc.)

• Complex analytics (e.g. machine learning)

• Data transformation, aggregation, etc.

• Stream processing

• Better performances than Hadoop Map/Reduce

Real-time Big Data

©2014 DataStax Confidential. Do not distribute without consent. 12

Data Enrichment

Batch Processing

Machine Learning

Pre-computed

aggregates

Data

NO ETL

Real-Time Big Data Use Cases

• Recommendation Engine

• Internet of Things

• Fraud Detection

• Risk Analysis

• Buyer Behaviour Analytics

• Telematics, Logistics

• Business Intelligence

• Infrastructure Monitoring

• …

©2014 DataStax Confidential. Do not distribute without consent. 13

Composants Sparks

Shark

or

Spark SQLStructured

Spark

StreamingReal-time

MLlibMachine learning

Spark (General execution engine)

GraphXGraph

Cassandra

Compatible

Isolation des ressources

Cassandra

Executor

ExecutorSpark

Worker

(JVM)

Cassandra

Executor

ExecutorSpark

Worker

(JVM)

DSE Spark Integration Architecture

Node 1

Node 2

Node 3

Node 4

Cassandra

Executor

ExecutorSpark

Worker

(JVM)

Cassandra

Executor

ExecutorSpark

Worker

(JVM)

Spark

Master

(JVM)

App

Driver

Spark Cassandra Connector

C*

C*

C*C*

Spark Executor

C* Java Driver

Spark-Cassandra Connector

User Application

Cassandra

Cassandra Spark Driver

•Cassandra tables exposed as Spark RDDs

•Load data from Cassandra to Spark

•Write data from Spark to Cassandra

•Object mapper : Mapping of C* tables and rows to Scala objects

•Type conversions : All Cassandra types supported and converted to Scala types

•Server side data selection

•Virtual Nodes support

•Scala and Java APIs

DSE Spark Interactive Shell

$ dse spark...Spark context available as sc.HiveSQLContext available as hc.CassandraSQLContext available as csc.

scala> sc.cassandraTable("test", "kv")res5: com.datastax.spark.connector.rdd.CassandraRDD

[com.datastax.spark.connector.CassandraRow] = CassandraRDD[2] at RDD at CassandraRDD.scala:48

scala> sc.cassandraTable("test", "kv").collectres6: Array[com.datastax.spark.connector.CassandraRow] =

Array(CassandraRow{k: 1, v: foo})

cqlsh> select * from test.kv;

k | v---+-----1 | foo

(1 rows)

Connecting to Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._

// Spark connection optionsval conf = new SparkConf(true)

.setMaster("spark://192.168.123.10:7077")

.setAppName("cassandra-demo").set("cassandra.connection.host", "192.168.123.10") // initial

contact.set("cassandra.username", "cassandra").set("cassandra.password", "cassandra")

val sc = new SparkContext(conf)

Reading Data

val table = sc

.cassandraTable[CassandraRow]("db", "tweets")

.select("user_name", "message")

.where("user_name = ?", "ewa")

row

representation keyspace table

server side column

and row selection

Writing Data

CREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);

val collection = sc.parallelize(Seq(("foo", 2), ("bar", 5)))collection.saveToCassandra("test", "words", SomeColumns("word", "count"))

cqlsh:test> select * from words;

word | count

------+-------

bar | 5

foo | 2

(2 rows)

Mapping Rows to Objects

CREATE TABLE test.cars (id text PRIMARY KEY,model text,fuel_type text,year int

);

case class Vehicle(id: String,model: String,fuelType: String,year: Int

)

sc.cassandraTable[Vehicle]("test", "cars").toArray//Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009),// Vehicle(MT8787, Hyundai x35, Diesel, 2011)

* Mapping rows to Scala Case Classes

* CQL underscore case column mapped to Scala camel case property

* Custom mapping functions (see docs)

Type MappingCQL Type Scala Type

ascii String

bigint Long

boolean Boolean

counter Long

decimal BigDecimal, java.math.BigDecimal

double Double

float Float

inet java.net.InetAddress

int Int

list Vector, List, Iterable, Seq, IndexedSeq, java.util.List

map Map, TreeMap, java.util.HashMap

set Set, TreeSet, java.util.HashSet

text, varchar String

timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime

timeuuid java.util.UUID

uuid java.util.UUID

varint BigInt, java.math.BigInteger

*nullable values Option

Shark

• SQL query engine on top of Spark

• Not part of Apache Spark

• Hive compatible (JDBC, UDFs, types, metadata, etc.)

• Supports in-memory tables

• Available as a part of DataStax Enterprise

Spark SQL• Spark SQL supports a subset of SQL-92 language

• Spark SQL optimized for Spark internals (e.g. RDDs) , better performances than Shark

• Support for in-memory computation

•From Spark command line

•Mapping of Cassandra keyspaces and tables

•Read and write on Cassandra tables

Usage of Spark SQL & HiveQL query

import com.datastax.spark.connector._

// Connect to the Spark clusterval conf = new SparkConf(true)...val sc = new SparkContext(conf)

// Create Cassandra SQL contextval cc = new CassandraSQLContext(sc)

// Execute SQL queryval rdd = cc.sql("INSERT INTO ks.t1 SELECT c1,c2 FROM ks.t2")

// Execute HQL queryval rdd = cc.hql("SELECT * FROM keyspace.table JOIN ... WHERE ...")

Spark Streaming

• For real time analytics

• Push or pull model

• Stream TO and FROM Cassandra

• Micro batching (each batch represented as RDD)

• Fault tolerant

• Data processed in small batches

• Exactly-once processing

• Unified stream and batch processing framework

• Supports Kafka, Flume, ZeroMQ, Kinesis, MQTT

producers

Usage of Spark Streaming

• Due to the unifying Spark architecture, portions of batch and streaming development can be reused

• Given that Spark Streaming is backed by Cassandra, no need to depend uponsolutions like Apache Zookeeper ™ in production

import com.datastax.spark.connector.streaming._

// Spark connection optionsval conf = new SparkConf(true)...

// streaming with 1 second batch windowval ssc = new StreamingContext(conf, Seconds(1))

// stream inputval lines = ssc.socketTextStream(serverIP, serverPort)

// count wordsval wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

// stream outputwordCounts.saveToCassandra("test", "words")

// start processingssc.start() ssc.awaitTermination()

Python API

$ dse pysparkPython 2.7.8 (default, Oct 20 2014, 15:05:19) [GCC 4.9.1] on linux2Type "help", "copyright", "credits" or "license" for more information.Welcome to

____ __/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 1.1.0

/_/

Using Python version 2.7.8 (default, Oct 20 2014 15:05:19)SparkContext available as sc.>>> sc.cassandraTable("test", "kv").collect()[Row(k=1, v=u'foo')]

DataStax Enterprise + Spark Special Features

•Easy setup and config

• no need to setup a separate Spark cluster

• no need to tweak classpaths or config files

•High availability of Spark Master

•Enterprise security

• Password / Kerberos / LDAP authentication

• SSL for all Spark to Cassandra connections

•CFS integration (no SPOF distributed file system)

•Cassandra access through Spark Python API

•Certified and Supported on Cassandra

•Shark availability

DataStax Enterprise - High Availability

• All nodes are Spark Workers

• By default resilient to Worker failures

• First Spark node promoted as Spark Master (state saved

in CFS, no SPOF)

• Standby Master promoted on failure (New Spark Master

reconnects to Workers and the driver app and continues the job)

Without DataStax Enterprise

33

C* SparkMSparkW

C* SparkW

C* SparkWC* SparkW

C* SparkW

With DataStax Enterprise

34

C* SparkMSparkW

C* SparkW*

C* SparkWC* SparkW

C* SparkW

Master state in C*

Spare master for H/A

Spark Use Cases

35

Load data from various

sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration,

Data conversion

DataStax Enterprise© 2014 DataStax, All Rights Reserved. Company Confidential

External Hadoop Distribution

Cloudera, Hortonworks

OpsCenter

Services

Hadoop

Monitoring

Operations

Operational

Application

Real Time

Search

Real Time

Analytics

Batch

Analytics

SGBDR

Analytics

Transformation

s

36

Cassandra Cluster – Nodes Ring – Column Family Storage

High Performance – Alway Available – Massive Scalability

Advanced

Security

In-Memory

How to Spark on Cassandra?

DataStax Cassandra Spark driver

https://github.com/datastax/cassandra-driver-spark

Compatible with

•Spark 1.2

•Cassandra 2.0.x and 2.1.x

•DataStax Enterprise 4.5 et 4.6

DataStax Enterprise 4.6 = Cassandra 2.0 + Driver + Spark 1.1

Spark 1.2 in next DSE 4.7 version (March)

Merci Questions ?

We power the big data apps that transform business.

©2013 DataStax Confidential. Do not distribute without consent.

[email protected]

@vizanalytics