understanding cassandra internals to solve real-world problems
TRANSCRIPT
Cassandra Internals
Cassandra London Meetup – July 2013
Nicolas Favre-FelixSoftware Engineer
@yowgi – @acunu
1
Nicolas Favre-Felix – Cassandra London July 2013
A lot to talk about• Memtable
• SSTable
• Commit log
• Row Cache
• Key Cache
• Compaction
• Secondary indexes
• Bloom Filters
• Index samples
• Column indexes
• Thrift
• CQL
2
Nicolas Favre-Felix – Cassandra London July 2013
1. High latency in a read-heavy workload
2. High CPU usage with little activity on the cluster
3. nodetool repair taking too long to complete
4. Optimising for the highest insert throughput
Four real-world problems
3
Nicolas Favre-Felix – Cassandra London July 2013
• Acunu professional services for Apache Cassandra
• 24x7 support for questions and emergencies
• Cluster “health check” sessions
• Cassandra Training & Workshop
Context
4
Nicolas Favre-Felix – Cassandra London July 2013
“Reading takes too long”
5
Nicolas Favre-Felix – Cassandra London July 2013
Symptoms
• High latency observed in read operations
• Thousands of read requests per second
6
Nicolas Favre-Felix – Cassandra London July 2013
Staged Event-Driven Architecture (SEDA)
7
Nicolas Favre-Felix – Cassandra London July 2013
SEDA in Cassandra
• Stages in Cassandra have different roles
• MutationStage for writes
• ReadStage for reads
• ... 10 or so in total
• Each Stage is backed by a thread pool
• Not all task queues are bounded
8
Nicolas Favre-Felix – Cassandra London July 2013
ReadStage
• Not all reads are equal:
• Some served from in-memory data structures
• Some served from the Linux page cache
• Some need to hit disk, possibly more than once
• Read operations can be disk-bound
• Avoid saturating disk with random reads
• Recommended pool size: 16×number_of_drives
9
Nicolas Favre-Felix – Cassandra London July 2013
nodetool tpstatsPool Name Active Pending CompletedReadStage 16 3197 733819430RequestResponseStage 0 0 3381277MutationStage 5 0 1130984ReadRepairStage 0 0 80095473ReplicateOnWriteStage 0 0 4728857GossipStage 0 0 20252373AntiEntropyStage 0 0 2228MigrationStage 0 0 19MemtablePostFlusher 0 0 839StreamStage 0 0 40FlushWriter 0 0 2349MiscStage 0 0 0commitlog_archiver 0 0 0AntiEntropySessions 0 0 11InternalResponseStage 0 0 7HintedHandoff 0 0 6018
10
Nicolas Favre-Felix – Cassandra London July 2013
Solution• iostat: little I/O activity
• free: large amount of memory used to cache pages
• → Increased concurrent_reads to 32
• → Latency dropped to reasonable levels
• Recommendations:
• Reduce the number of reads
• Keep an eye on I/O as data grows
• Buy more disks or RAM when falling out of cache
11
Nicolas Favre-Felix – Cassandra London July 2013
“Cassandra is busy doing nothing”
12
Nicolas Favre-Felix – Cassandra London July 2013
Context
• 2-node cluster
• Little activity on the cluster
• Very high CPU usage on the nodes
• Storing metadata on published web content
13
Nicolas Favre-Felix – Cassandra London July 2013
nodetool cfhistograms
• Node-local histogram stored per CF, per node
• Distribution of number of files accessed per read
• Distribution of read and write latencies
• Distribution of row sizes and column counts
• Buckets are approximate but still very useful
14
Nicolas Favre-Felix – Cassandra London July 2013
SSTables accessed per read
0
1,000,000
2,000,000
3,000,000
0 1 2 3 4 5 6 7 8 9 10
Number of reads
SSTables accessed
15
Nicolas Favre-Felix – Cassandra London July 2013
Row size distribution (bytes)
0
1
2
3
4
5
0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000
Number of rows
Row size in bytes
16
Nicolas Favre-Felix – Cassandra London July 2013
Column count distribution
0
2
4
6
8
10
0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000
Number of rows
Number of columns
17
Nicolas Favre-Felix – Cassandra London July 2013
Read latency distribution (µsec)
0
180,000
360,000
540,000
720,000
900,000
1 100 10,000 1,000,000
Number of reads
Number of reads
Latency (µsec)
18
Nicolas Favre-Felix – Cassandra London July 2013
Data model issue• Row key was “views”
• Column names were item names, values counters
• Cassandra stored only a few massive rows
• → Reading from many SSTables
• → De-serialising large column indexes
views post-1234: 77: post-1240: 8 post-1250: 3
19
Nicolas Favre-Felix – Cassandra London July 2013
CF read latency & column index(taken from Aaron Morton’s talk at Cassandra SF 2012)
0
1,500
3,000
4,500
6,000
85th 95th 99th
Late
ncy
(mic
rose
cond
s)
Percentile
First column from 1,200First column from 1,000,000
20
Nicolas Favre-Felix – Cassandra London July 2013
Solution• “Transpose” the table:
• Make the item name the row key
• Have a few counters per item
• Distribute the rows across the whole cluster
post-123 : views: 9078 comments: 3
21
Nicolas Favre-Felix – Cassandra London July 2013
“nodetool repair takes ages”
22
Nicolas Favre-Felix – Cassandra London July 2013
Nodetool repair
• “Active Anti-Entropy” mechanism in Cassandra
• Synchronises replicas
• Running repair is important to replicate tombstones
• Should run at least once every 10 days
• Repair was taking a week to complete
23
Nicolas Favre-Felix – Cassandra London July 2013
Two phases
1. Contact replicas, ask for Merkle Trees
1. They scan their local data and send a tree back
2. Compare Merkle Trees between replicas
1. Identify differences
2. Stream blocks of data out to other nodes
3. Stream data in and merge locally
24
Nicolas Favre-Felix – Cassandra London July 2013
Merkle Treestop hash
hash-0 hash-1
hash-00 hash-01 hash-10 hash-11
data block 0
data block 1
data block 2
data block 3
•Hashes of hashes of ... data
•215 = 32,768 leaf nodes
(memory)
(disk)
25
Nicolas Favre-Felix – Cassandra London July 2013
Cassandra logs
• Merkle Tree requests and responses
• Check how long it took
• Differences found, in number of leaf nodes
• More differences ⇒ more data to stream
• Streaming sessions starting and ending
26
Nicolas Favre-Felix – Cassandra London July 2013
Diagnostic
• Building Merkle Trees: 20-30 minutes
• “4,700 ranges out of sync” (~14% of 32,768)
• Streaming session to repair the range: 4.5 hours
• Much slower rate than expected
27
Nicolas Favre-Felix – Cassandra London July 2013
Solutions
• Increase consistency level from ONE
• Rely on read repair to decrease entropy
• Fix problem of dropped writes
• Review data model and cluster size
• Add more disks and RAM, maybe more nodes
• Investigate network issues (speed, partitions?)
• Monitor both phases of the repair process
28
Nicolas Favre-Felix – Cassandra London July 2013
“How can we write faster?”
29
Nicolas Favre-Felix – Cassandra London July 2013
Context
• Time-series data from 1 million sensors
• 40 data points (e.g. temperature, pressure...)
• Sent in one batch every 5 minutes
• 40M cols / 5 min = 133,000 cols/sec
• One node...
30
Nicolas Favre-Felix – Cassandra London July 2013
Data model 1
• One row per (sensor, day)
• Metrics columns grouped by minute within the row
• Range queries between minutes A and B within a day
CREATE TABLE sensor_data ( sensor_id text, day integer, hour integer, minute integer, metric1 integer, [...] metric40 integer, PRIMARY KEY ((sensor_id, day), minute);
31
Nicolas Favre-Felix – Cassandra London July 2013
Data model 1
• At 12:00, insert 40 cols into row (sensor1, 2013-07-11)
• At 12:05, insert 40 cols into row (sensor1, 2013-07-11)
• These columns might not be written to the same file
• Compaction process needs to merge them together:
• Large amounts of overlap between SSTables
• Rate is around 500 KB/sec
• 30% CPU usage spent compacting; no issues with I/O
32
Nicolas Favre-Felix – Cassandra London July 2013
Data model 2
• One row per (sensor, day, minute)
• No range query within the day (need to enumerate)
• Compaction now reaching 7 MB/sec
• Tests show a 10-20% increase in throughput
- PRIMARY KEY ((sensor_id, day), minute);+ PRIMARY KEY ((sensor_id, day, minute));
33
Nicolas Favre-Felix – Cassandra London July 2013
Next steps
• Workload is CPU-bound, disks are not a problem
• Larger memtable mean lower write amplification
• Managed to flush after 400k ops instead of 200k
• Track time spent in GC with jstat -gcutil
• At this rate, consider adding more nodes
34
Nicolas Favre-Felix – Cassandra London July 2013
1. Interactions between Cassandra and the hardware
2. Implications of a bad data model at the storage layer
3. Internal data structures and processes
4. Work involved in arranging data on disk
Four problems, four solutions
35
Nicolas Favre-Felix – Cassandra London July 2013
Guidelines
• Monitor Cassandra, OS, JVM, hardware
• Learn how to use nodetool
• Follow best practices in data modelling and sizing
• Keep an eye on the Cassandra logs
• Consider available resources as sharing “work”
36
Nicolas Favre-Felix – Cassandra London July 2013
Thank you!
37