codecentric ag: using cassandra and clojure for data crunching backends
TRANSCRIPT
@ifesdjeen
CassandraMonitoring
Precision
is not same as
Semantics
is not same as
Anomalydetection
Do you see the elephant being swallowed by the snake?
Agenda
Ad-hocqueries
AggregationsFast
MachineLearning
parallel queriesStep 1
+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+
Used to avoid timestamp resolution collisions To ensure sub-resolution order Snapshot the data on overflow or timeout Ensures idempotence
Sequence ID
Fighting Dispersion
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Range Tables
Full Table Scan
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Open Range
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
“Between” Range
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
(rich query API)
Step 2 add some algebra
Stream Fusion for
rich ad-hoc queries
What is even Stream Fusion
map
filter
reduce
single step mapFilterReduce
data Step data cursor = Yield data !cursor | Skip !cursor | Done
data Stream data = ∃s. Stream (cursor → Step data cursor) cursor
Stream Beginning: reading from the DB
map
Yield data cursor → Yield (f cursor) cursor Skip cursor → Skip cursor Done → Done
maps :: (a → b) → Stream a → Stream b
filter
Yield data cursor | p data → Yield data cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done
filters :: (a → Bool) → Stream a → Stream a
reduce/fold
Yield x cursor → loop (f data x) cursor Skip cursor → loop data cursor Done → z
foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc
Append
class Monoid a where mempty :: a
mappend :: a -> a -> a -- ^ Identity of 'mappend'
-- ^ An associative operation
class (Monoid intermediate) => Aggregate intermediate end where combine :: intermediate -> end
Combine
data Count = Count Int
instance Monoid Count where mempty = Count 0 mappend (Count a) (Count b) = Count $ a + b
instance Aggregate Count Int where combine (Count a) = a
Count Example
add some MLStep 3
Storing Models
Support Vector Machines
Hyperplaneα·x - φ = 1
[ α1 α1 α1 ...αn ] ρ
Option 1:list<double>
CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY KEY(path))
Problems
High deserialisation overhead Need to add PK specifiers for multiple SVs
Alternative:blob & byte buffers
Vector Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+
byte address
points 1 2 3 40 n
Matrix Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 0400 1n
n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 0400 1n
m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
m1 m2 m3 m4m0 mn
Advantages
“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to go multi-dimensional Easy to implement atomic in-memory operations
Bayesian Classifiers
P(X | blue)= Number of Blue near X
Total number of blueP(X | red)=
Number of Red near X
Total number of Red
[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)]
... [Mean(xn), Var(xn)]]
0 8 16 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+
0 0
16 24 32 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+
1 1
2n*8 (2n+1)*8 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+
n n
byte address
payloads
Advantages
“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to implement atomic in-memory operations
make it rocket-fastStep 4
Approximate Data Structures
Bloom Filtersare basically long arrays / vectors
BitSet
0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+
...
bit address
Advantages
64 bits per 8-byte Long Easy to represent by the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations
Count-min sketchesare basically int matrices
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 0400 1n
n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 0400 1n
m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
m1 m2 m3 m4m0 mn
Histogramsare basically long vectors
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 0400 1n
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+
01 02 03 04001n
byte address
byte address
Longs (counts)
Doubles (bin start number)
Conclusions
Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good API fits