cassandra tk 2014 - large nodes
DESCRIPTION
A discussion of running cassandra with a large data load per node.TRANSCRIPT
CASSANDRA TK 2014
LARGE NODES WITH CASSANDRA
Aaron Morton @aaronmorton
!
Co-Founder & Principal Consultant www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
About The Last Pickle. Work with clients to deliver and improve
Apache Cassandra based solutions.
Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid
Committer. Based in New Zealand & USA.
Large Node? !
“Avoid storing more than 500GB per node” !
(Originally said about EC2 nodes.)
Large Node? !
“You may have issues if you have over 1 Billion keys per node.”
Before version 1.2 large nodes had operational and
performance concerns.
After version 1.2 large nodes have fewer operational and
performance concerns.
Issues Pre 1.2 Work Arounds Pre 1.2
Improvements 1.2 to 2.1 !
Memory Management. Some in memory structures grow with number of rows and size of
data.
Bloom Filter Stores bitset used to determine if a key exists
in an SSTable with a certain probability. !
Size depends on number of rows and bloom_filter_fp_chance.
Bloom Filter Allocates pages of 4096 longs in a
long[][] array.
Bloom Filter Size Bl
oom
File
r Size
in M
B
0
300
600
900
1,200
Millions of Rows
1 10 100 1,000
0.01 bloom_filter_fp_chance 0.10 bloom_filter_fp_chance
Compression Metadata Stores long offset into compressed -
Data.db file for each chunk_length_kb (default 64) of uncompressed data.
!
Size depends on the uncompressed data size.
Compression Metadata Allocates pages of 4096 longs in a
long[][] array.
Compression Metadata SizeCo
mpr
ess M
etad
ata
Size
in M
B
0
350
700
1,050
1,400
Uncompressed Size in GB
1 10 100 1,000 10,000
Snappy Compressor
Index Samples Stores offset into -Index.db for every
index_interval (128) keys. !
Size depends on the number of rows and the size of the keys.
!
Index Samples Allocates long[] for offsets and byte[]
[] for row keys. !
(Version 1.2 using on heap structures)
Index Samples Total SizeIn
dex
Sam
ple T
otal
Size
in M
B
0
75
150
225
300
Millions of Rows
1 10 100 1,000
Position Offset Keys (25 bytes long)
Memory Management. Larger Heaps (above 8GB) take
longer to GC. !
Large working set results in frequent prolonged GC.
Bootstrap. The joining node requests data from one replica of each token range it will own.
!
Sending is throttled by stream_throughput_outbound_mega
bits_per_sec (default 200/25MB).
Bootstrap. With RF 3, only three nodes will send data to
a bootstrapping node. !
Maximum send rate is 75 MB/sec (3*25MB).
Moving Nodes. Copy data from existing node to new node.
!
At 50 MB/s transferring 100GB takes 33 minutes.
Disk Management. Need a multi TB volume or use multiple
volumes.
Disk Management with RAID-0. Single disk failure results in total node failure.
Disk Management with RAID-10. Requires double the raw capacity.
Disk Management with Multiple Volumes. Specified via data_file_directories
!
Write load not distributed. !
Single failure will shut down node.
Repair. Compare data between nodes and exchange
differences. !
Comparing Data for Repair. Calculate Merkle Tree hash by reading all
rows in a Table. (Validation Compaction)
!
Single comparator, throttled by compaction_throughput_mb_per_sec
(default 16).
Comparing Data for Repair. Time taken grows as the size of the data per
node grows.
Exchanging Data for Repair. Ranges of rows with differences are
Streamed. !
Sending is throttled by stream_throughput_outbound_mega
bits_per_sec (default 200/25MB).
Compaction. Requires free space to write new SSTables.
SizeTieredCompactionStrategy. Groups SSTables by size, assumes no
reduction in size. !
In theory requires 50% free space, in practice can work beyond 50% though not
recommended.
LeveledCompactionStrategy. Groups SSTables by “level” and groups row
fragments per level. !
Requires approximately 25% free space.
Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1
!
Memory Management Work Arounds. Reduce Bloom Filter size by increasing
bloom_filter_fp_chance from 0.01 to 0.1.
!
May increase read latency.
Memory Management Work Arounds. Reduce Compression Metadata size by
increasing chunk_length_kb. !
May increase read latency.
Memory Management Work Arounds. Reduce Index Samples size by increasing
index_interval to 512. !
May increase read latency.
Memory Management Work Arounds. When necessary use a 12GB
MAX_HEAP_SIZE. !
Keep HEAP_NEWSIZE “reasonable” e.g. less than 1200MB.
Bootstrap Work Arounds. Increase streaming throughput via
nodetool setstreamthroughput whenever possible.
Moving Node Work Arounds. Copy nodetool snapshot while the
original node is operational. !
Copy only a delta when the original node is stopped.
Disk Management Work Arounds. Use RAID-0 and over provision nodes
anticipating failure. !
Use RAID-10 and accept additional costs.
Repair Work Arounds. Only use if data is deleted, rely on Consistently Level for distribution.
!
Frequent small repair using token ranges.
Compaction Work Arounds. Over provision disk capacity when using SizeTieredCompactionStrategy.
!
Reduce min_compaction_threshold (default 4) max_compaction_threshold (default 32) to reduce number of SSTables per compaction.
Compaction Work Arounds. Use LeveledCompactionStrategy
where appropriate.
Issues Pre 1.2 Work Arounds Pre 1.2
Improvements 1.2 to 2.1
Memory Management Improvements. Version 1.2 moved Bloom Filters and
Compression Meta Data off the JVM Heap to Native Memory.
!
Version 2.0 moved Index Samples off the JVM Heap.
Bootstrap Improvements. Virtual Nodes increases the number of Token
Ranges per node from 1 to 256. !
Bootstrapping node can request data from 256 different nodes.
Disk Layout Improvements. “JBOD” support distributes concurrent
writes to multiple data_file_directories.
Disk Layout Improvements. disk_failure_policy adds support for
handling disk failure. !
ignore stop
best_effort
Repair Improvements. “Avoid repairing already-repaired data by
default” CASSANDRA-5351 !
Scheduled for 2.1
Compaction Improvements. “Avoid allocating overly large bloom filters”
CASSANDRA-5906 !
Included in 2.1
Thanks. !
Aaron Morton @aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License