Cassandra Summit 2014: Performance Tuning Cassandra in AWS

Download Cassandra Summit 2014: Performance Tuning Cassandra in AWS

Post on 06-Dec-2014

716 views

Category:

Technology

6 download

DESCRIPTION

Presenters: Michael Nelson, Development Manager at FamilySearch A recent research project at FamilySearch.org pushed Cassandra to very high scale and performance limits in AWS using a real application. Come see how we achieved 250K reads/sec with latencies under 5 milliseconds on a 400-core cluster holding 6 TB of data while maintaining transactional consistency for users. We'll cover tuning of Cassandra's caches, other server-side settings, client driver, AWS cluster placement and instance types, and the tradeoffs between regular & SSD storage.

TRANSCRIPT

  • 1. Performance Tuning Cassandra In AWS" Cassandra Summit 2014! Michael Nelson! 1! 2014 by Intellectual Reserve, Inc. All rights reserved.!
  • 2. 2! Outline! The App: FamilySearch Family Tree! The Test: Borland Silk Performer! The Findings:! Row Cache! Token Aware Driver! Networking Issues! Etc.!
  • 3. 3! What Is FamilySearch?! Familysearch.org Website! Very Large Single Pedigree (Family Tree)! Largest Collection of Free Genealogical Records! Largest Genealogical Library! The Church of Jesus Christ of Latter-day Saints (Mormons)!
  • 4. 4! Why does FamilySearch exist?! Visit http://mormon.org/family-history/! !
  • 5. 5! Family Tree Data! Family Tree: ! 900M+ Person Records, Open-Edit! 500M+ Relationships, Open-Edit! 8.4B Change Log Entries, ~1M / day! 7TB in Cassandra (13TB in Oracle)! Dynamic OLTP system! Data-dependent performance issues!
  • 6. 6! Family Tree: Example 9 Gen Pedigree! up to 511 person slots Dynamic content!
  • 7. 7! Family Tree: Example Pedigree App! 31+ persons per sec0on Dynamic content!
  • 8. 8! Family Tree: Example Ancestor Page! 10+ persons in families 100-1000+ changes Dynamic content!
  • 9. 9! Cassandra Reimplementation! Event-Sourced Data Model journal / views! New Data Model no indexes! New Consistency Model satisfies consistency! P1 JE #8 P1 Views A B P2 P2 Views JE #6 A B
  • 10. 10! 77% Reads / 23% Writes! Reads:! LOCAL_ONE! Simple Queries! Writes:! LOCAL_QUORUM! Atomic Batches! Multiple Tables! Multiple Rows! Business Logic!
  • 11. A Little Optimization Goes A Long Way! 11! 28 Node Cluster! 250,000 op/sec! Optimized App! 8 Node Cluster! 200,000 op/sec! Optimized App! Row Cache! Token Aware Driver!
  • 12. 12! Test System! Cassandra (Community Ed. 2.0.5) Family Tree App Servers (Datastax 2.0.0) Silk Performer Load Agents 8 hi1.4xlarge: 16 CPU 61 GB RAM 2 TB SSD 10 Gb net 60 m2.2xlarge: 4 CPU 34 GB RAM moderate net 25 m2.xlarge: 2 CPU 17 GB RAM moderate net
  • 13. 13! 2x Throughput Increase! 200,000 150,000 100,000 50,000 0 Defaults Row Cache Token Aware concurrent_reads op / sec Reads Writes
  • 14. 14! Row Cache = 35% More Throughput! Default Key Cache:! Cached Disk Location! Data From Disk Cache! ~11ms Reads! Row Cache:! Cached Row Contents! ~7ms Reads!
  • 15. 15! Configuring Row Cache! cassandra.yaml:! # Maximum size of the row cache in memory. # Default value is 0, to disable row caching. row_cache_size_in_mb: 32768 ! Enable For Each Table Explicitly:! ALTER TABLE person_view WITH caching = 'ALL'; !
  • 16. 16! 90% Row Cache Hit Rate!
  • 17. 17! Token Aware = 50% More Throughput! Default Round Robin:! Coordinator Middleman! Adds Network Hops! Load On Multiple Nodes! ~7ms! Token Aware:! Reads From Replicas! No Network Hops! ~2ms!
  • 18. 18! Configuring Token Aware! Default Load Balancing Policy:! new RoundRobinPolicy() Better:! new TokenAwarePolicy(new RoundRobinPolicy())
  • 19. concurrent_reads = 5% More Throughput! 19! Defaults:! concurrent_reads: 32 concurrent_writes: 32 native_transport_max_threads: 128 Improved:! concurrent_reads: 256 concurrent_writes: 256 native_transport_max_threads: 256
  • 20. 20! Now Wheres The Bottleneck?! 181,000 reads/sec; 21,000 writes/sec! CPU = 80%! Network = 10%! Disk < 5%!
  • 21. 21! Network Mystery: C* 800Mb! C* Never Exceeded 800Mb On 10Gb Network! ! !
  • 22. 22! Network Mystery: Cyclic Net Queues! About 5 Second Cycle of Net Queues Backing Up! Client Machines Seemed OK! Tweaking Network Stack Had No Impact:! net.core.wmem_max! net.core.rmem_max! net.ipv4.tcp_wmem! net.ipv4.tcp_rmem! net.core.somaxconn! net.core.netdev_max_backlog! net.ipv4.tcp_tw_recycle! net.ipv4.tcp_max_syn_backlog! net.ipv4.ip_local_port_range! txqueuelen!
  • 23. 23! Network Mystery: Cyclic Net Queues! Send-Qs Backup! !
  • 24. 24! Network Mystery: Cyclic Net Queues! Recv-Qs Backup! !
  • 25. 25! Network Mystery: Cyclic Net Queues! Somewhat Normal Then Starts Again! !
  • 26. 26! 2x Throughput Increase! 200,000 150,000 100,000 50,000 0 Defaults Row Cache Token Aware concurrent_reads op / sec Reads Writes
  • 27. 27! Contact Info! Michael Nelson" Development Manager! nelsonmi@familysearch.org! ! Thanks to FamilySearch team!! ! Thanks to the awesome presenters & organizers at #CassandraSummit!!