understanding cassandra internals to solve real-world problems

37
Cassandra Internals Cassandra London Meetup – July 2013 Nicolas Favre-Felix Software Engineer @yowgi – @acunu 1

Upload: acunu

Post on 10-May-2015

5.827 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Understanding Cassandra internals to solve real-world problems

Cassandra Internals

Cassandra London Meetup – July 2013

Nicolas Favre-FelixSoftware Engineer

@yowgi – @acunu

1

Page 2: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

A lot to talk about• Memtable

• SSTable

• Commit log

• Row Cache

• Key Cache

• Compaction

• Secondary indexes

• Bloom Filters

• Index samples

• Column indexes

• Thrift

• CQL

2

Page 3: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

1. High latency in a read-heavy workload

2. High CPU usage with little activity on the cluster

3. nodetool repair taking too long to complete

4. Optimising for the highest insert throughput

Four real-world problems

3

Page 4: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

• Acunu professional services for Apache Cassandra

• 24x7 support for questions and emergencies

• Cluster “health check” sessions

• Cassandra Training & Workshop

Context

4

Page 5: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

“Reading takes too long”

5

Page 6: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Symptoms

• High latency observed in read operations

• Thousands of read requests per second

6

Page 7: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Staged Event-Driven Architecture (SEDA)

7

Page 8: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

SEDA in Cassandra

• Stages in Cassandra have different roles

• MutationStage for writes

• ReadStage for reads

• ... 10 or so in total

• Each Stage is backed by a thread pool

• Not all task queues are bounded

8

Page 9: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

ReadStage

• Not all reads are equal:

• Some served from in-memory data structures

• Some served from the Linux page cache

• Some need to hit disk, possibly more than once

• Read operations can be disk-bound

• Avoid saturating disk with random reads

• Recommended pool size: 16×number_of_drives

9

Page 10: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

nodetool tpstatsPool Name Active Pending CompletedReadStage 16 3197 733819430RequestResponseStage 0 0 3381277MutationStage 5 0 1130984ReadRepairStage 0 0 80095473ReplicateOnWriteStage 0 0 4728857GossipStage 0 0 20252373AntiEntropyStage 0 0 2228MigrationStage 0 0 19MemtablePostFlusher 0 0 839StreamStage 0 0 40FlushWriter 0 0 2349MiscStage 0 0 0commitlog_archiver 0 0 0AntiEntropySessions 0 0 11InternalResponseStage 0 0 7HintedHandoff 0 0 6018

10

Page 11: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Solution• iostat: little I/O activity

• free: large amount of memory used to cache pages

• → Increased concurrent_reads to 32

• → Latency dropped to reasonable levels

• Recommendations:

• Reduce the number of reads

• Keep an eye on I/O as data grows

• Buy more disks or RAM when falling out of cache

11

Page 12: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

“Cassandra is busy doing nothing”

12

Page 13: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Context

• 2-node cluster

• Little activity on the cluster

• Very high CPU usage on the nodes

• Storing metadata on published web content

13

Page 14: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

nodetool cfhistograms

• Node-local histogram stored per CF, per node

• Distribution of number of files accessed per read

• Distribution of read and write latencies

• Distribution of row sizes and column counts

• Buckets are approximate but still very useful

14

Page 15: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

SSTables accessed per read

0

1,000,000

2,000,000

3,000,000

0 1 2 3 4 5 6 7 8 9 10

Number of reads

SSTables accessed

15

Page 16: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Row size distribution (bytes)

0

1

2

3

4

5

0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000

Number of rows

Row size in bytes

16

Page 17: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Column count distribution

0

2

4

6

8

10

0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000

Number of rows

Number of columns

17

Page 18: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Read latency distribution (µsec)

0

180,000

360,000

540,000

720,000

900,000

1 100 10,000 1,000,000

Number of reads

Number of reads

Latency (µsec)

18

Page 19: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Data model issue• Row key was “views”

• Column names were item names, values counters

• Cassandra stored only a few massive rows

• → Reading from many SSTables

• → De-serialising large column indexes

views post-1234: 77: post-1240: 8 post-1250: 3

19

Page 20: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

CF read latency & column index(taken from Aaron Morton’s talk at Cassandra SF 2012)

0

1,500

3,000

4,500

6,000

85th 95th 99th

Late

ncy

(mic

rose

cond

s)

Percentile

First column from 1,200First column from 1,000,000

20

Page 21: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Solution• “Transpose” the table:

• Make the item name the row key

• Have a few counters per item

• Distribute the rows across the whole cluster

post-123 : views: 9078 comments: 3

21

Page 22: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

“nodetool repair takes ages”

22

Page 23: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Nodetool repair

• “Active Anti-Entropy” mechanism in Cassandra

• Synchronises replicas

• Running repair is important to replicate tombstones

• Should run at least once every 10 days

• Repair was taking a week to complete

23

Page 24: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Two phases

1. Contact replicas, ask for Merkle Trees

1. They scan their local data and send a tree back

2. Compare Merkle Trees between replicas

1. Identify differences

2. Stream blocks of data out to other nodes

3. Stream data in and merge locally

24

Page 25: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Merkle Treestop hash

hash-0 hash-1

hash-00 hash-01 hash-10 hash-11

data block 0

data block 1

data block 2

data block 3

•Hashes of hashes of ... data

•215 = 32,768 leaf nodes

(memory)

(disk)

25

Page 26: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Cassandra logs

• Merkle Tree requests and responses

• Check how long it took

• Differences found, in number of leaf nodes

• More differences ⇒ more data to stream

• Streaming sessions starting and ending

26

Page 27: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Diagnostic

• Building Merkle Trees: 20-30 minutes

• “4,700 ranges out of sync” (~14% of 32,768)

• Streaming session to repair the range: 4.5 hours

• Much slower rate than expected

27

Page 28: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Solutions

• Increase consistency level from ONE

• Rely on read repair to decrease entropy

• Fix problem of dropped writes

• Review data model and cluster size

• Add more disks and RAM, maybe more nodes

• Investigate network issues (speed, partitions?)

• Monitor both phases of the repair process

28

Page 29: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

“How can we write faster?”

29

Page 30: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Context

• Time-series data from 1 million sensors

• 40 data points (e.g. temperature, pressure...)

• Sent in one batch every 5 minutes

• 40M cols / 5 min = 133,000 cols/sec

• One node...

30

Page 31: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Data model 1

• One row per (sensor, day)

• Metrics columns grouped by minute within the row

• Range queries between minutes A and B within a day

CREATE TABLE sensor_data ( sensor_id text, day integer, hour integer, minute integer, metric1 integer, [...] metric40 integer, PRIMARY KEY ((sensor_id, day), minute);

31

Page 32: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Data model 1

• At 12:00, insert 40 cols into row (sensor1, 2013-07-11)

• At 12:05, insert 40 cols into row (sensor1, 2013-07-11)

• These columns might not be written to the same file

• Compaction process needs to merge them together:

• Large amounts of overlap between SSTables

• Rate is around 500 KB/sec

• 30% CPU usage spent compacting; no issues with I/O

32

Page 33: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Data model 2

• One row per (sensor, day, minute)

• No range query within the day (need to enumerate)

• Compaction now reaching 7 MB/sec

• Tests show a 10-20% increase in throughput

- PRIMARY KEY ((sensor_id, day), minute);+ PRIMARY KEY ((sensor_id, day, minute));

33

Page 34: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Next steps

• Workload is CPU-bound, disks are not a problem

• Larger memtable mean lower write amplification

• Managed to flush after 400k ops instead of 200k

• Track time spent in GC with jstat -gcutil

• At this rate, consider adding more nodes

34

Page 35: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

1. Interactions between Cassandra and the hardware

2. Implications of a bad data model at the storage layer

3. Internal data structures and processes

4. Work involved in arranging data on disk

Four problems, four solutions

35

Page 36: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Guidelines

• Monitor Cassandra, OS, JVM, hardware

• Learn how to use nodetool

• Follow best practices in data modelling and sizing

• Keep an eye on the Cassandra logs

• Consider available resources as sharing “work”

36

Page 37: Understanding Cassandra internals to solve real-world problems

Nicolas Favre-Felix – Cassandra London July 2013

Thank you!

37