understanding cassandra internals to solve real-world problems

Cassandra Internals

Cassandra London Meetup – July 2013

Nicolas Favre-FelixSoftware Engineer

@yowgi – @acunu

1

Nicolas Favre-Felix – Cassandra London July 2013

A lot to talk about• Memtable

• SSTable

• Commit log

• Row Cache

• Key Cache

• Compaction

• Secondary indexes

• Bloom Filters

• Index samples

• Column indexes

• Thrift

• CQL

2


1. High latency in a read-heavy workload

2. High CPU usage with little activity on the cluster

3. nodetool repair taking too long to complete

4. Optimising for the highest insert throughput

Four real-world problems

3


• Acunu professional services for Apache Cassandra

• 24x7 support for questions and emergencies

• Cluster “health check” sessions

• Cassandra Training & Workshop

Context

4


“Reading takes too long”

5


Symptoms

• High latency observed in read operations

• Thousands of read requests per second

6


Staged Event-Driven Architecture (SEDA)

7


SEDA in Cassandra

• Stages in Cassandra have different roles

• MutationStage for writes

• ReadStage for reads

• ... 10 or so in total

• Each Stage is backed by a thread pool

• Not all task queues are bounded

8


ReadStage

• Not all reads are equal:

• Some served from in-memory data structures

• Some served from the Linux page cache

• Some need to hit disk, possibly more than once

• Read operations can be disk-bound

• Avoid saturating disk with random reads

• Recommended pool size: 16×number_of_drives

9


nodetool tpstatsPool Name Active Pending CompletedReadStage 16 3197 733819430RequestResponseStage 0 0 3381277MutationStage 5 0 1130984ReadRepairStage 0 0 80095473ReplicateOnWriteStage 0 0 4728857GossipStage 0 0 20252373AntiEntropyStage 0 0 2228MigrationStage 0 0 19MemtablePostFlusher 0 0 839StreamStage 0 0 40FlushWriter 0 0 2349MiscStage 0 0 0commitlog_archiver 0 0 0AntiEntropySessions 0 0 11InternalResponseStage 0 0 7HintedHandoff 0 0 6018

10


Solution• iostat: little I/O activity

• free: large amount of memory used to cache pages

• → Increased concurrent_reads to 32

• → Latency dropped to reasonable levels

• Recommendations:

• Reduce the number of reads

• Keep an eye on I/O as data grows

• Buy more disks or RAM when falling out of cache

11


“Cassandra is busy doing nothing”

12


Context

• 2-node cluster

• Little activity on the cluster

• Very high CPU usage on the nodes

• Storing metadata on published web content

13


nodetool cfhistograms

• Node-local histogram stored per CF, per node

• Distribution of number of files accessed per read

• Distribution of read and write latencies

• Distribution of row sizes and column counts

• Buckets are approximate but still very useful

14


SSTables accessed per read

0

1,000,000

2,000,000

3,000,000

0 1 2 3 4 5 6 7 8 9 10

Number of reads

SSTables accessed

15


Row size distribution (bytes)

0

1

2

3

4

5

0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000

Number of rows

Row size in bytes

16


Column count distribution

0

2

4

6

8

10

0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000

Number of rows

Number of columns

17


Read latency distribution (µsec)

0

180,000

360,000

540,000

720,000

900,000

1 100 10,000 1,000,000

Number of reads

Number of reads

Latency (µsec)

18


Data model issue• Row key was “views”

• Column names were item names, values counters

• Cassandra stored only a few massive rows

• → Reading from many SSTables

• → De-serialising large column indexes

views post-1234: 77: post-1240: 8 post-1250: 3

19


CF read latency & column index(taken from Aaron Morton’s talk at Cassandra SF 2012)

0

1,500

3,000

4,500

6,000

85th 95th 99th

Late

ncy

(mic

rose

cond

s)

Percentile

First column from 1,200First column from 1,000,000

20


Solution• “Transpose” the table:

• Make the item name the row key

• Have a few counters per item

• Distribute the rows across the whole cluster

post-123 : views: 9078 comments: 3

21


“nodetool repair takes ages”

22


Nodetool repair

• “Active Anti-Entropy” mechanism in Cassandra

• Synchronises replicas

• Running repair is important to replicate tombstones

• Should run at least once every 10 days

• Repair was taking a week to complete

23


Two phases

1. Contact replicas, ask for Merkle Trees

1. They scan their local data and send a tree back

2. Compare Merkle Trees between replicas

1. Identify differences

2. Stream blocks of data out to other nodes

3. Stream data in and merge locally

24


Merkle Treestop hash

hash-0 hash-1

hash-00 hash-01 hash-10 hash-11

data block 0

data block 1

data block 2

data block 3

•Hashes of hashes of ... data

•215 = 32,768 leaf nodes

(memory)

(disk)

25


Cassandra logs

• Merkle Tree requests and responses

• Check how long it took

• Differences found, in number of leaf nodes

• More differences ⇒ more data to stream

• Streaming sessions starting and ending

26


Diagnostic

• Building Merkle Trees: 20-30 minutes

• “4,700 ranges out of sync” (~14% of 32,768)

• Streaming session to repair the range: 4.5 hours

• Much slower rate than expected

27


Solutions

• Increase consistency level from ONE

• Rely on read repair to decrease entropy

• Fix problem of dropped writes

• Review data model and cluster size

• Add more disks and RAM, maybe more nodes

• Investigate network issues (speed, partitions?)

• Monitor both phases of the repair process

28


“How can we write faster?”

29


Context

• Time-series data from 1 million sensors

• 40 data points (e.g. temperature, pressure...)

• Sent in one batch every 5 minutes

• 40M cols / 5 min = 133,000 cols/sec

• One node...

30


Data model 1

• One row per (sensor, day)

• Metrics columns grouped by minute within the row

• Range queries between minutes A and B within a day

CREATE TABLE sensor_data ( sensor_id text, day integer, hour integer, minute integer, metric1 integer, [...] metric40 integer, PRIMARY KEY ((sensor_id, day), minute);

31


Data model 1

• At 12:00, insert 40 cols into row (sensor1, 2013-07-11)

• At 12:05, insert 40 cols into row (sensor1, 2013-07-11)

• These columns might not be written to the same file

• Compaction process needs to merge them together:

• Large amounts of overlap between SSTables

• Rate is around 500 KB/sec

• 30% CPU usage spent compacting; no issues with I/O

32


Data model 2

• One row per (sensor, day, minute)

• No range query within the day (need to enumerate)

• Compaction now reaching 7 MB/sec

• Tests show a 10-20% increase in throughput

- PRIMARY KEY ((sensor_id, day), minute);+ PRIMARY KEY ((sensor_id, day, minute));

33


Next steps

• Workload is CPU-bound, disks are not a problem

• Larger memtable mean lower write amplification

• Managed to flush after 400k ops instead of 200k

• Track time spent in GC with jstat -gcutil

• At this rate, consider adding more nodes

34


1. Interactions between Cassandra and the hardware

2. Implications of a bad data model at the storage layer

3. Internal data structures and processes

4. Work involved in arranging data on disk

Four problems, four solutions

35


Guidelines

• Monitor Cassandra, OS, JVM, hardware

• Learn how to use nodetool

• Follow best practices in data modelling and sizing

• Keep an eye on the Cassandra logs

• Consider available resources as sharing “work”

36


Thank you!

37

understanding cassandra internals to solve real-world problems

Technology

cassandra stages

cassandra sf

values counters cassandra

apache cassandra 24x7

number of reads

read distribution

node distribution of

number of rows row size