cassandra summit 2010 performance tuning

Cassandra Summit 1.0Performance Tuning

Brandon Williams

Riptano, [email protected]

[email protected]@faltering

driftx on freenode

August 10, 2010

Brandon Williams Cassandra Summit 1.0

Tuning WritesTuning Reads

Making writes faster

Use a separate IO device for the commit log.

Hard to accomplish in the cloudRackspace: one IO device, but it’s persistent (RAID arrayunderneath)EC2: EBS is slow, local disk is impersistent

You could put the commitlog on the ephemeral drive anyway,at the price of durabilityBut then, why have a commitlog at all?Maybe you can disable it in 0.7/0.8

Realservers: one RAID array, bad RAID optionsWill anyone ever offer SSDs?




Use a separate IO device for the commit log.Hard to accomplish in the cloud

Rackspace: one IO device, but it’s persistent (RAID arrayunderneath)EC2: EBS is slow, local disk is impersistent






Use a separate IO device for the commit log.Hard to accomplish in the cloudRackspace: one IO device, but it’s persistent (RAID arrayunderneath)

EC2: EBS is slow, local disk is impersistentYou could put the commitlog on the ephemeral drive anyway,at the price of durabilityBut then, why have a commitlog at all?Maybe you can disable it in 0.7/0.8





Use a separate IO device for the commit log.Hard to accomplish in the cloudRackspace: one IO device, but it’s persistent (RAID arrayunderneath)EC2: EBS is slow, local disk is impersistent







You could put the commitlog on the ephemeral drive anyway,at the price of durabilityBut then, why have a commitlog at all?

Maybe you can disable it in 0.7/0.8







Realservers: one RAID array, bad RAID options

Will anyone ever offer SSDs?



What else?

concurrent writers (concurrent readers forreads)

increase if you have lots of cores

memtable flush writersincrease if you have lots of IO



What else?

concurrent writers (concurrent readers forreads)

increase if you have lots of coresmemtable flush writers

increase if you have lots of IO



What are all these options?

memtable throughput in mb

memtable operations in millions

memtable flush after mins

bigger memtables improve writes?

no, but they can improve readswhat?







bigger memtables improve writes?no, but they can improve reads

what?







bigger memtables improve writes?no, but they can improve readswhat?



Compaction: the slayer of reads

a necessary evilIO contention hellyou can reduce compaction priority in 0.6.4 or later

-Dcassandra.compaction.priority=1constantly outstripping it means you need more nodesreducing the priority affects CPU usage, not IO

avoid reading from slow hostsdynamic snitch

accrual failure detector




a necessary evil

IO contention hellyou can reduce compaction priority in 0.6.4 or later







a necessary evilIO contention hell

you can reduce compaction priority in 0.6.4 or later-Dcassandra.compaction.priority=1constantly outstripping it means you need more nodesreducing the priority affects CPU usage, not IO







-Dcassandra.compaction.priority=1

constantly outstripping it means you need more nodesreducing the priority affects CPU usage, not IO







-Dcassandra.compaction.priority=1constantly outstripping it means you need more nodes

reducing the priority affects CPU usage, not IOavoid reading from slow hosts

dynamic snitchaccrual failure detector






avoid reading from slow hosts

dynamic snitchaccrual failure detector



Compaction (con’t)

bigger memtables absorb more overwrites

less sstables makes for more efficient compactionif you are write once then read-only, you *could* turn it off

merge-on-read and bloomfilters save yousomeday, you’ll want to repair




bigger memtables absorb more overwritesless sstables makes for more efficient compaction

if you are write once then read-only, you *could* turn it offmerge-on-read and bloomfilters save yousomeday, you’ll want to repair





if you are write once then read-only, you *could* turn it off

merge-on-read and bloomfilters save yousomeday, you’ll want to repair





if you are write once then read-only, you *could* turn it offmerge-on-read and bloomfilters save you

someday, you’ll want to repair





if you are write once then read-only, you *could* turn it offmerge-on-read and bloomfilters save yousomeday, you’ll want to repair



Know your read pattern

how much data is in the working set?disk is slow: you want that in memory

sometimes you can’t afford the cost

how many reads are repeats?doing lots of random IO within a row?

column index size in kb




how much data is in the working set?

disk is slow: you want that in memorysometimes you can’t afford the cost








how many reads are repeats?

doing lots of random IO within a row?column index size in kb



Caches

on a cold hit, each row requires two seeksone to find the row’s position in the index

key cache eliminates thisanother to read the row

row cache eliminates this, toocolumns in the row are contiguous afterwards

make fat rowsbut not too fat, since the row is the unit of distribution

the OS file cacheuse a good OS



Caches

on a cold hit, each row requires two seeks

one to find the row’s position in the indexkey cache eliminates this

another to read the rowrow cache eliminates this, too

columns in the row are contiguous afterwardsmake fat rowsbut not too fat, since the row is the unit of distribution




Caches








Caches


key cache eliminates this

another to read the rowrow cache eliminates this, too





Caches



row cache eliminates this, too





Caches








Caches




make fat rows

but not too fat, since the row is the unit of distributionthe OS file cache

use a good OS



Caches








Caches





the OS file cache

use a good OS



Caches








Caching Strategies

key cacheexcellent bang for your buckhalf your seeks are gonea lot of keys fit in a relatively small amount of memory

row cacheall seeks are gonebut more heap usage = more GC pressuretrying to use 32GB of row cache will wreck youestimating the correct size can be difficult

use the average row size in cfstats as a starting pointin 0.7, each SSTable has a persistent row size histogramthe penalty for being wrong can be catastrophic: OOMcan’t be done programmatically in Java, or Cassandra woulddo it for youthis is why you can’t set an absolute amount in bytes

if you enable on it very fat rows, it can be badkeep your indexes in a different column family



Caching Strategies


row cacheall seeks are gonebut more heap usage = more GC pressure

trying to use 32GB of row cache will wreck youestimating the correct size can be difficult





Caching Strategies


row cacheall seeks are gonebut more heap usage = more GC pressuretrying to use 32GB of row cache will wreck you

estimating the correct size can be difficultuse the average row size in cfstats as a starting pointin 0.7, each SSTable has a persistent row size histogramthe penalty for being wrong can be catastrophic: OOMcan’t be done programmatically in Java, or Cassandra woulddo it for youthis is why you can’t set an absolute amount in bytes




Caching Strategies







Caching Strategies




if you enable on it very fat rows, it can be bad

keep your indexes in a different column family



Caching Strategies







Caching Strategies (con’t)

OS file cache: it’s freeno size estimation needed

mmap is greatunless it makes you swapswitch to mmap index onlywhy do you have swap enabled, anyway?

Absolute numbers vs percentagespercentages can be an OOM time bombharder to calculate how much memory the cache will use




OS file cache: it’s freeno size estimation neededmmap is great

unless it makes you swap

switch to mmap index onlywhy do you have swap enabled, anyway?






unless it makes you swapswitch to mmap index only

why do you have swap enabled, anyway?






unless it makes you swapswitch to mmap index onlywhy do you have swap enabled, anyway?





lookup order:row cachekey cachedisk (file cache?)

sizing your caches:large key cachesmaller row cache for very hot rowsleave the rest to the OS

don’t make your heap larger than neededmonitor hit rates via JMX

actually, monitor everything you can






don’t make your heap larger than needed

monitor hit rates via JMXactually, monitor everything you can






don’t make your heap larger than neededmonitor hit rates via JMX

actually, monitor everything you can



Test, Measure, Tweak, Repeat

use stress.py as a baselinemake sure you have multiprocessing

move to real world data



Settings you don’t need to touch

commitlog rotation threshold in mb

SlicedBufferSizeInKB

FlushIndexBufferSizeInMB



The End

Questions?


cassandra summit 2010 performance tuning

Technology