aerospike & gce (lspe talk)
TRANSCRIPT
Aerospike & GCE
Database Landscape
Response time: Hours, MinutesTB to PBCompute Intensive
TRANSACTIONS (OLTP)
Response time: SecondsGigabytes of data
Balanced Reads/Writes
ANALYTICS (OLAP)
STRUCTURED DATA
Response time: Seconds
Terabytes of dataRead Intensive
BIG DATA ANALYTICS
Real-time TransactionsResponse time: < 10 ms1-20 TBBalanced Reads/Writes24x7x365 Availability
UNSTRUCTURED DATA
REAL-TIME BIG DATA
Minimalistic Architecture
DNA: No Hotspots
• Data distribution• Node-Node communication• Node-Client communication• Thread level• CPU level• Network level• SSD level
GCE Network
• Andromeda - SDN• VPC : Virtual Private Cloud
• No multicast though
• KVM based virtio• DPDK but no SR-IOV
The Challenge (Oct 2014)
• 1 Million write TPS• Google’s Cassandra benchmark : 300
Nodes• Median Latency = 10.3ms• 95% <23ms latency
• Aerospike : 50 Nodes• Median Latency = 7ms• 83% <16ms latency & 96.5%<32ms
CPU
• Not able to use it fully. Kind of saturates at 50-60%.• Too much of CPU going to system• Software interrupts are high (upto 30%) with high
network load
top - 08:03:52 up 25 min, 2 users, load average: 2.25, 2.12, 1.57Tasks: 84 total, 1 running, 83 sleeping, 0 stopped, 0 zombie%Cpu0 : 22.8 us, 26.7 sy, 0.0 ni, 43.9 id, 0.0 wa, 0.0 hi, 6.7 si, 0.0 st%Cpu1 : 23.5 us, 24.5 sy, 0.0 ni, 44.8 id, 0.0 wa, 0.0 hi, 7.2 si, 0.0 st%Cpu2 : 24.0 us, 24.7 sy, 0.0 ni, 45.5 id, 0.0 wa, 0.0 hi, 5.7 si, 0.0 st%Cpu3 : 24.5 us, 23.8 sy, 0.0 ni, 45.5 id, 0.0 wa, 0.0 hi, 6.3 si, 0.0 st
Signature of Network Bottleneck (CPU-based, Non-GCE)
top - 12:51:38 up 5:40, 4 users, load average: 2.86, 2.13, 1.15Tasks: 152 total, 2 running, 150 sleeping, 0 stopped, 0 zombieCpu0 : 1.9%us, 4.4%sy, 0.0%ni, 88.7%id, 0.0%wa, 2.5%hi, 0.0%si, 2.5%stCpu1 : 5.0%us, 2.5%sy, 0.0%ni, 87.6%id, 0.0%wa, 2.5%hi, 0.0%si, 2.5%stCpu2 : 2.5%us, 4.4%sy, 0.0%ni, 88.7%id, 0.0%wa, 2.5%hi, 0.0%si, 1.9%stCpu3 : 5.0%us, 2.5%sy, 0.0%ni, 87.0%id, 0.0%wa, 3.1%hi, 0.0%si, 2.5%stCpu4 : 1.3%us, 15.3%sy, 0.0%ni, 0.3%id, 0.0%wa, 1.3%hi, 81.7%si, 0.0%stCpu5 : 2.0%us, 0.7%sy, 0.0%ni, 92.8%id, 0.0%wa, 2.6%hi, 0.0%si, 2.0%stCpu6 : 2.5%us, 3.2%sy, 0.0%ni, 89.8%id, 0.0%wa, 2.5%hi, 0.0%si, 1.9%stCpu7 : 0.3%us, 14.8%sy, 0.0%ni, 0.3%id, 0.0%wa, 1.0%hi, 83.6%si, 0.0%stCpu8 : 1.2%us, 1.9%sy, 0.0%ni, 92.5%id, 0.0%wa, 1.9%hi, 0.0%si, 2.5%stCpu9 : 1.9%us, 1.3%sy, 0.0%ni, 92.9%id, 0.0%wa, 1.9%hi, 0.0%si, 1.9%stCpu10 : 1.3%us, 0.7%sy, 0.0%ni, 93.5%id, 0.0%wa, 2.0%hi, 0.0%si, 2.6%stCpu11 : 1.3%us, 1.3%sy, 0.0%ni, 94.2%id, 0.0%wa, 1.3%hi, 0.0%si, 1.9%stCpu12 : 2.8%us, 4.8%sy, 0.0%ni, 88.3%id, 0.0%wa, 2.1%hi, 0.0%si, 2.1%stCpu13 : 0.6%us, 1.3%sy, 0.0%ni, 94.3%id, 0.0%wa, 1.9%hi, 0.0%si, 1.9%stCpu14 : 1.9%us, 2.5%sy, 0.0%ni, 91.8%id, 0.0%wa, 1.3%hi, 0.0%si, 2.5%stCpu15 : 2.9%us, 3.6%sy, 0.0%ni, 89.8%id, 0.0%wa, 1.5%hi, 0.0%si, 2.2%stMem: 30620324k total, 2384264k used, 28236060k free, 27308k buffersSwap: 0k total, 0k used, 0k free, 190364k cached
Tricks
• Use standard instances• Balances network & CPU
• Use taskset and leave out 1 or 2 cores• Result
• Latencies improve• Throughput marginally improved• Less CPU going to system
Network Virtualization
DPDK
3x Improvement (Aug 15)
• 20 Aerospike nodes• 1.2M wirte TPS, 94% < 4ms latency• 4.2M read TPS, 90% < 4ms latency
• Changes• DPDK• NIC queue depth : 256->16k• ??
• Takeaway• Don’t blindly trust top, iostat. • Keep pushing till you see a bottleneck (and resolve if possible)
Live Migrations
Live Migrations : Implications
• Blackout period depends on workload• Higher the memory dirty rate, longer the blackout• Timeouts in application code will get triggered
• Effects clustering based solutions (Aerospike,…)• Missing heartbeats
• Clocktimes of VMs jump• Implications on any code tightly dependent on
clocktime
Live Migrations : Handling
• Write better code• Scheduling policy offered by GCE
• onHostMaintenance : Migrate/Terminate• automaticRestart : True/False
• Since June 2016 : Live migrate notification• 60 seconds prior intimation• Via metadata server
• MIGRATE_ON_HOST_MAINTENANCE• SHUTDOWN_ON_HOST_MAINTENANCE
Local SSDs
• Similar to ephemeral SSDs in AWS (non persistent)• Not to be confused with persistent SSD (network attached)
• Good cost alternative for RAM• Can be attached to any instance types• Spec
• NVMe / SCSI options• Available in chunks of 375GB• ~1ms latency• 680k Read IOPS• 360K Write IOPS
Aerospike benchmark of Local SSD
• Summary : They are pretty good
Local SSD with Aerospike
• Use shadow device configuration in Aerospike• All reads are from local SSD• All writes (buffered) go to
both Local SSD & persistent HDD/SSD (network attached)
• Bcache is no longer recommended by Aerospike• Saw some kernel level
implementation bugs• Saw drive lockups in rare
occurences
Local
SSD
Network
storage
Aerospike
Work in progress
• Aerospike on dockers in GCE
Thank You