real time analytics with apache cassandra - cassandra day berlin

26
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Real - Time Analytics with Apache Cassandra Cassandra Day Berlin, 11.2.2016 Guido Schmutz

Upload: guido-schmutz

Post on 21-Jan-2017

1.072 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Real-Time Analytics with Apache CassandraCassandra Day Berlin, 11.2.2016

Guido Schmutz

Page 2: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]: http://guidoschmutz.wordpress.comSlideshare: http://de.slideshare.net/gschmutzTwitter: gschmutz

2

Page 3: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Our company.

© Trivadis – The Company3 2/11/16

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

Page 4: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

With over 600 specialists and IT experts in your region.

© Trivadis – The Company4 2/11/16

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Page 5: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Agenda

1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Titan:db for Graph Data

5

Page 6: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Customer Use Case andArchitecture

6

Page 7: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Data Science Lab @ Armasuisse W&T

W+T flagship project, standing for innovation & tech transfer

Building capabilities in the areas of:• Social Media Intelligence

(SOCMINT)

• Big Data Technologies & Architectures

Invest into new, innovative and not widely-proven technology• Batch / Real-time analysis

• NoSQL databases

• Text analysis (NLP)• Graph Data

• …

3 Phases: June 2013 – June 2015

7

Page 8: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

SOCMINT System – Time Dimension

Major data model: Time series (TS)

TS reflect user behaviors over time

Activities correlate with events

Anomaly detectionEvent detection & prediction

8

Page 9: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

SOCMINT System – Social Dimension

User-user networks (social graphs);

Twitter: follower, retweet and mention graphs

Who is central in a social network?

Who has retweeted a given tweet to whom?

9

Page 10: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

SOCMINT System - “Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest10

Page 11: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

SOCMINT System – Frameworks & Components in Use

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest11

Page 12: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Streaming Analytics Processing Pipeline

Kafka provides reliable and efficient queuing

Storm processes (rollups, counts)

Cassandra stores results at same speed

StoringProcessingQueuing

12

TwitterSensor 1

TwitterSensor 2

TwitterSensor 3

VisualizationApplication

VisualizationApplication

Page 13: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Cassandra Data Modeling

13

Page 14: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Cassandra Data Modelling

14

• Don’t think relational !

• Denormalize, Denormalize, Denormalize ….

• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model

• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure

Page 15: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

“Static” Tables – “Skinny Row”

15

rowkey

CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,

PRIMARY KEY (rowkey));

Growsup

toBillionofRow

s

rowkey-1 c1 c2 c3value-c1 value-c2 value-c3

rowkey-2 c1 c3value-c1 value-c3

rowkey-3 c1 c2 c3value-c1 value-c2 value-c3

c1 c2 c3

PartitionKey

Page 16: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

“Dynamic” Tables – “Wide Row”

16

rowkey

Billion

ofR

ows rowkey-1 ckey-1:c1 ckey-1:c2

value-c1 value-c2

rowkey-2

rowkey-3

CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,

PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-3:c1 ckey-3:c2value-c1 value-c2

ckey-1:c1 ckey-1:c2value-c1 value-c2

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-1:c1 ckey-1:c2value-c1 value-c2

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-3:c1 ckey-3:c2value-c1 value-c2

1 2Billion

PartitionKey Clustering Key

Page 17: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Cassandra for Timeseries Data

17

Page 18: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Show Timeseries: Provide list of metrics

18

CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,

PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);

Use of “Static” Table

bucket-id defines buckets of values • HOUR-2015-10 = values

collected hourly in one partition for one month

ABC-001:HOUR-2015-10 dse:10:00:count1’550

ABC-001:DAY-2015-10 dse:14-OCT:count105’999

dse:13-OCT:count120’344

nosql:14-OCT:count2’532

dse:09:00:count2’299

nosql:10:00:count25

30d*24h*nkeys=n*720cols

OpenSourceTimeSeriesDBsoverCassandra:KairosDB: https://kairosdb.github.io/Heroic: http://spotify.github.io/heroicPartitionKey Clustering Key

Page 19: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Show Timeseries: Provide list of metrics

19

UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';

SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;

sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230

PartitionKey Clustering Key

Page 20: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Titan:db & Cassandra for Graph Data

20

Page 21: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Supporting Graph Data with Titan:db and Cassandra

21

http://thinkaurelius.github.io/titan/

Page 22: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Gremlin in Action – Creating the Graph

22

Page 23: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Gremlin in Action – Graph Traversal

23

Page 24: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Gremlin in Action – Graph Traversal (II)

24

Page 25: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Summary - Know your domain

Connectedness ofDatalow high

DocumentDataStore

Key-ValueStores

Wide-ColumnStore

GraphDatabases

RelationalDatabases

Page 26: Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Guido SchmutzEmail: [email protected]+41 79 412 05 39

26