codecentric ag: using cassandra and clojure for data crunching backends

@ifesdjeen

CassandraMonitoring

Precision

is not same as

Semantics

is not same as

Anomalydetection

Do you see the elephant being swallowed by the snake?

Agenda

Ad-hocqueries

AggregationsFast

MachineLearning

parallel queriesStep 1

+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+

Used to avoid timestamp resolution collisions To ensure sub-resolution order Snapshot the data on overflow or timeout Ensures idempotence

Sequence ID

Fighting Dispersion

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Range Tables

Full Table Scan


Start End

Open Range

Start End


“Between” Range


Start End

(rich query API)

Step 2 add some algebra

Stream Fusion for

rich ad-hoc queries

What is even Stream Fusion

map

filter

reduce

single step mapFilterReduce

data Step data cursor = Yield data !cursor | Skip !cursor | Done

data Stream data = ∃s. Stream (cursor → Step data cursor) cursor

Stream Beginning: reading from the DB

map

Yield data cursor → Yield (f cursor) cursor Skip cursor → Skip cursor Done → Done

maps :: (a → b) → Stream a → Stream b

filter

Yield data cursor | p data → Yield data cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done

filters :: (a → Bool) → Stream a → Stream a

reduce/fold

Yield x cursor → loop (f data x) cursor Skip cursor → loop data cursor Done → z

foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc

Append

class Monoid a where mempty :: a

mappend :: a -> a -> a -- ^ Identity of 'mappend'

-- ^ An associative operation

class (Monoid intermediate) => Aggregate intermediate end where combine :: intermediate -> end

Combine

data Count = Count Int

instance Monoid Count where mempty = Count 0 mappend (Count a) (Count b) = Count $ a + b

instance Aggregate Count Int where combine (Count a) = a

Count Example

add some MLStep 3

Storing Models

Support Vector Machines

Hyperplaneα·x - φ = 1

[ α1 α1 α1 ...αn ] ρ

Option 1:list<double>

CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY KEY(path))

Problems

High deserialisation overhead Need to add PK specifiers for multiple SVs

Alternative:blob & byte buffers

Vector Representation

0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+

byte address

points 1 2 3 40 n

Matrix Representation

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

m1 m2 m3 m4m0 mn

Advantages

“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to go multi-dimensional Easy to implement atomic in-memory operations

Bayesian Classifiers

P(X | blue)= Number of Blue near X

Total number of blueP(X | red)=

Number of Red near X

Total number of Red

[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)]

... [Mean(xn), Var(xn)]]

Advantages

“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to implement atomic in-memory operations

make it rocket-fastStep 4

Approximate Data Structures

Bloom Filtersare basically long arrays / vectors

BitSet

0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+

...

bit address

Advantages

64 bits per 8-byte Long Easy to represent by the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations

Count-min sketchesare basically int matrices

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

m1 m2 m3 m4m0 mn

Histogramsare basically long vectors

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 04001n

byte address

byte address

Longs (counts)

Doubles (bin start number)

Conclusions

Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good API fits

@ifesdjeen

http://bit.ly/cassandrasummit2015

http://bit.ly/cassandrasummit2015

codecentric ag: using cassandra and clojure for data crunching backends

Technology