terror & hysteria: cost effective scaling of time series data with cassandra (sam bisbee, threat...

Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra

Sam Bisbee, Threat Stack CTO

Typical [time series] problems on C*

● Disk utilization creates a scaling pattern of lighting money on fire

– Only works for a month or two, even with 90% disk utilization

● Every write up we found focused on schema design for tracking integers across time

– There are days we wish we only tracked integers

● Data drastically loses value over time, but C*'s design doesn't acknowledge this

– TTLs only address 0 value states, not partial value

– Ex., 99% of reads are for data in its first day

● Not all sensors are equal

Categories of Time Series Data

Volume of Tx's

Size of Tx's

CRUD, Web 2.0

System Monitoring(CPU, etc.)System Monitoring(CPU, etc.)

Traditional object store

Threat Stack

Categories of Time Series Data

Volume of Tx's

Size of Tx's

CRUD, Web 2.0

System Monitoring(CPU, etc.)System Monitoring(CPU, etc.)

Traditional object store

Threat Stack

Traditional timeseries on C*, whateveryone writes about

“We're going to needa bigger boat. Or disks.”

We care about this thing called margins

(see: we're in Boston, not the Valley)

Data at Threat Stack

● 5 to 10TBs per day of raw data

– Crossed several TB per day in first few months of production with ~4 people

● 80,000 to 150,000 Tx per second, analyzed in real time

– Internal goal of analyzing, persisting, and firing alerts in <1s

● 90% write to 10% read tx

● Pre-compute query results for 70% of queries for UI

– Optimized lookup tables & complex data structures, not just “query & cache”

● 100% AWS, distrust of remote storage in our DNA

– This is not just EBS bashing. This applies to all databases on all platforms, even a cage in a data center.

● By the way, we're on DSE 4.8.4 (C* 2.1)

Generic data model

● Entire platform assumes that events form a partially ordered, eventually consistent, write ahead log

– A wonderful C* use case, so long as you only INSERT

● UPDATE is a dirty word and C* counters are “banned”

– We do our big counts elsewhere (“right tool for the right job”)

● No DELETEs, too many key permutations and don't want tombstones

● Duplicate writes will happen

– Legitimate: fully or partially failed batches of writes

– Legitimate: sensor resends data because it doesn't see platform's acknowledgement of data

– How-do-you-even-computer: people cannot configure NTP, so have fun constantly receiving data from 1970

● TTL on insert time, store and query on event time

We need to show individual events or slices,

cannot use time granularity rows

(1min, 15min, 30min, 1hr, etc.)

Creating and updating tables' schema

● ALTER TABLE isn't fun, so we support dual writes instead

– Create new schema, performing dual reads for new & old

– Cut writes over to new schema

– After TTL time, DROP TABLE old

● Each step is verifiable with unit tests and metrics

● Maintains insert only data model for temporary disk util cost

● Allows trivial testing of analysis and A/B'ing of schema

– Just toss a new schema in, gather some insights, and then feel free to drop it

AWS Instance Types & EBS

● EBS is generally banned on our platform

– Too many of us lived through the great outage

– Too many of us cannot live with unpredictable I/O patterns

– Biggest reason: you cannot RI EBS

● Originally used i2.2xlarge's in 2014/2015

– Considering amount of “learning” we did, we were very grateful for SSDs due to amount of streaming we had to do

● Moved to d2.xlarge's and d2.2xlarge's in 2015

– RAID 0 the spindles with xfs

– We like the CPU and RAM to disk ratio, especially since compaction stops after a few hours

$/TB on AWS

i2.2xlarge d2.2xlarge c3.2xlarge +6 x 2TB io1 EBS

No Prepay $619.04 / 1.6TB= $386 / TB / month

$586.92 / 12TB= $49.91 / TB / month

$1,713.16 / 12TB= $142.77/TB/month

Partial Prepay $530.37 / 1.6TB= $331.48/TB/month

$502.12 / 12TB= $41.85 / TB / month

$1,684.59 / 12TB= $140.39/TB/month

Full Prepay $519.17 / 1.6TB= $324.85/TB/month

$492 / 12TB= $41 / TB / month

$1,680.84 / 12TB= $140.07/TB/month

● Amortizes one-time RI across 1yr, focusing on cost instead of cash out of pocket

● Does not account for N=3 in cluster, so x3 for each record, then x2 for worst case compaction headroom (realistically need MUCH LESS)

● c3 column assumes d2 comparison on disk size, not fair versus i2

We only store some raw data in C*

● Deleting data proved too difficult in the early days, even with DTCS (slides coming on how we solved this)

● Re-streaming due to regular maintenance could take a week or more

– Dropping instance size doesn't solve throughput problem since all resources are cut, not just disk size

– Another reason not to use EBS since you'll “never” get close to 100% disk utilization

● Due to aforementioned C* durability design, cost of data for day 2..N is too high even if you drop replica count

Tying C* to raw data

● Every query must constrain a minimum of:

– Sensor ID

– Event Day

● Every query result must include a minimum of:

– Sensor ID

– Event Day

– Event ID

● Batches of (sensor_id, event_day, event_id) triples are then used to look up the raw events from raw data storage

– This isn't always necessary (aggregates, correlations, etc.)

– Even with additional hops, full reads are still <1s

Using triples to batch writes

● Partition key starts with sensor id and event day

– Bonus: you get fresh ring location every day! Helps for averaging out your schema mistakes over the TTL

● Event batches off of RabbitMQ are already constrained to a single sensor id and event day

– Allows mapping a single AMQP read to a single C* write (RabbitMQ is podded, not clustered)

– Flow state of pipeline becomes trivial to understand

● Batch C* writes on partition key, then data size (soft cap at 5120 bytes, C* inner warn)

Compaction woes, STCS & DTCS

● Used STCS in 2014/2015, expired data would get stuck ∞

– “We could rotate tables” → eh, no

– “We could rotate clusters” → oh c'mon, hell no

– “We could generate every historic permutation of keys within that time bucket with Spark and run DELETEs” →...............

● Used DTCS in 2015, but expired data still got stuck ∞

– When deciding whether an SSTable is too old to compact, compares “now” versus max timestamp (most recent write)

– If you write constantly (time series), then SSTables will rarely or never stop compacting

– This means that you never realize the true value of DTCS for time series, the ability to unlink whole SSTables from disk

Cluster disk states assuming const sensor count

Disk Util

Time

What you want

What you get

Initial build up toretention period

MTCS, fixing DTCS

https://github.com/threatstack/mtcs

Now compare w/ min time(oldest write)

MTCS settings

● Never run repairs (never worked on STCS or DTCS anyway) and hinted handoff is off (great way to kill a cluster anyway)

● max_sstable_age_days = 1

base_time_seconds = 1 hour

● Results in roughly hour bucket sequential SSTables

– Reads are happy due to day or hour resolution, which have to provide this in the partition key anyway

● Rest of DTCS sub-properties are default

● Not worried about really old and small SSTables since those are simply unlinked “soon”

MTCS + sstablejanitor.sh

● Even with MTCS, SSTables were still not getting unlinked

● So enters sstablejanitor.sh

– Cron job fires it once per hour

– Iterates over each SSTable on disk for MTCS tables (chef/cron feeds it a list of tables and their TTLs)

– Uses sstablemetadata to determine max timestamp

– If past TTL, then uses JMX to invoke CompactionManager's forceUserDefinedCompaction on the table

● Hack? Yes, cron + sed + awk + JMX qualifies as a hack, but it works like a charm and we don't carry expired data

● Bonus: don't need to reserve half your disks for compaction

Discussion

@threatstack@sbisbee

terror & hysteria: cost effective scaling of time series data with cassandra (sam bisbee, threat...

Software