ship it!!! coding reliable couchbase applications for production: couchbase connect 2015
TRANSCRIPT
SHIP IT!!! CODING RELIABLE COUCHBASE APPLICATIONS FOR PRODUCTIONMatt Ingenthron, CouchbaseMichael Nitschinger, Couchbase
©2015 Couchbase Inc. 2
Warning
In this session you will hear stories of lost packets, corrupted data, confused administrators sending terabytes of logs to even more confused developers and many other insanely scary things. If the thought of a bit flip frightens you because you have only parity checking and no error correction, this session may not be for you.
Computers were harmed while preparing this talk.
If what you typically type after “catch” involves only the word “log”, this session may help you. If you hope to learn how an HTTP 503 can be useful, this presentation is for you.
©2015 Couchbase Inc. 6
Question One System: Virtual machines at a public cloud provider.
Node.js application. Observation: Under load testing, saw high latencies
(>100ms).
Causes?
Root cause: The ethernet device driver in the linux distro didn’t work that well with the virtualized hardware interface causing high latencies.
Solution: Swap out the Linux OS distribution. Went from one that was less common but had better user tooling to
one of the most common ones in production deployments
A) Bugs in Couchbase.
B) The system software wasn’t well matched and tested.
C) Running too many node.js processes for
the number of OS CPU cores.
D) It’s the “cosmic rays” man.
©2015 Couchbase Inc. 7
Question Two System: Private virtual machines on a private cloud. Strong
monitoring and control of the environment Observation: As daily load would ramp, latencies would rise
and failure to meet the SLA would consume.
Causes?
Root cause: Memory resources were overprovisioned on the private cloud.
Solution: Adjust the memory allocation within the environment. Also found that the number of tomcat workers was rather unusually
set; thousands of worker processes for systems with 8 virtual cores.
A) Bugs in Couchbase.
B) JVM Garbage Collection Pauses.
C) Virtualization is overprovisioned.
D) The NSA wiretap program was slowing
things down.
©2015 Couchbase Inc. 8
Question Three System: Database running on physical hardware, applications
on VMs across the network. SLA need was 50ms or less. Observation: Regular heartbeat of high latency in the 3-
400ms range.
Causes?
Root cause: The monitoring system was inspecting kernel counters on a regular basis and was somehow hitting a hot lock.
Solution: Disable that one poller in the monitor. There were no other apps in that environment that had the same latency
requirements, so it was assumed that the environment was clean.
A) Bugs in Couchbase.
B) Misconfigured load balancer
sending all traffic to one app JVM.
C) Monitoring system interrogating the kernel causing lock contention.
D) Standing waves from running a 50hz power supply under
60hz.
©2015 Couchbase Inc. 10
Define & Measure!
Develop
Test
Measure
Evaluate
Requirements
If it‘s not defined you can‘t measure it.
SLAs Throughput at max.
Latency
©2015 Couchbase Inc. 11
Define & Measure!
Develop
Test
Measure
Evaluate
Requirements
Ideally from the get-go:
Error Detection Error Recovery Error Mitigation
©2015 Couchbase Inc. 12
Define & Measure!
Develop
Test
Measure
Evaluate
Requirements
Not just unit testing.
Stress Tests Load Tests Failure Tests
©2015 Couchbase Inc. 13
Define & Measure!
Develop
Test
Measure
Evaluate
Requirements
You can‘t manage whatyou don‘t measure.
©2015 Couchbase Inc. 14
Define & Measure!
Develop
Test
Measure
Evaluate
Requirements
Evaluate, rinse, repeat.
©2015 Couchbase Inc. 15
Service Level Required
100% Uptime not easily achievable
For instance, is it 100% available if 50% of your users are leaving because it’s too slow?
The question must always be:
“At max latency, what throughput do I get?”
©2015 Couchbase Inc. 16
Avoid the Coffin Corner
http://de.wikipedia.org/wiki/Coffin_Corner#/media/File:CoffinCorner.png
Height
Speed
©2015 Couchbase Inc. 17
Avoid the Coffin Corner
Both airplanes and your applications do not like the extremes
Resource contention and overload conditions result in high latency
Keep some headroom to fly smoothly
©2015 Couchbase Inc. 18
Prepare for bad weather
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 19
with Error Detection
System MonitorsPeriodic Checking
WatchdogsVoting
Auditing
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 20
with Error Recovery
TimeoutsFailoverRetries
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 21
with Error MitigationIntelligent Data Structures
Failing FastCircuit BreakersBackpressure
https://stephenthepilot.files.wordpress.com/2015/03/aircraft_deicing.jpg
©2015 Couchbase Inc. 22
Timeouts
Are your last resort when calling external resources.
so: Always use them
©2015 Couchbase Inc. 25
Circuit Breakers
monitor traffic open if errors happen
Latency Throughput Wrong results
close in a controlledfashion
expose metrics
©2015 Couchbase Inc. 28
Backpressure
Allows for coordinated flow control under stress conditions
Is used to shed load and provide partial good experience
Source: http://mechanical-sympathy.blogspot.co.at/2011/10/smart-batching.html
©2015 Couchbase Inc. 32
Benchmarking
Benchmarks assert expectations while tests verfiy correctness
Like with statistics, almost always wrong and biased
Two hard problems in computer science: Cache Invalidation Naming Things
©2015 Couchbase Inc. 33
Benchmarking
Benchmarks assert expectations while tests verfiy correctness
Like with statistics, almost always wrong and biased
Two Three hard problems in computer science: Cache Invalidation Naming Things Benchmarking
©2015 Couchbase Inc. 34
Benchmarking
The appropriate Workload Concurrency Think Time
The right Environment Hardware, OS external effects
The proper Tool Measure NOOPs Be aware of GC, Coordinated Omission,...
©2015 Couchbase Inc. 35
And the industry?
Yahoo! Cloud Serving Benchmark (YCSB) Industry Standard Makes it easy to compare solutions Be aware of the (many) pitfalls!
Pioneering a new fork: https://github.com/YCSB/YCSB Maintained NoSQL versions Coordinated Omission fixes ...
©2015 Couchbase Inc. 36
And the industry?
Java Microbenchmarking Harness (JMH) (http://openjdk.java.net/projects/code-tools/jmh/)
http://shipilev.net/talks/jvmls-July2013-benchmarking.pdf
©2015 Couchbase Inc. 37
Load & Stress Testing
Load Testing Determine behaviour during normal traffic
Stress Testing Traffic heavily increased (to the “Coffin Corner“) Explicitly test edge cases Knowing where and how it breaks is important
©2015 Couchbase Inc. 38
Failure Testing
Test specific failure cases Node failures Netsplits Firewall issues
(dropped packets, closed sockets)
Failures will happen, better to prepare for it early.
http://www.bloomberg.com/ss/09/04/0427_mdea_awards/image/002_lifepak15monitorde_220a.jpg
©2015 Couchbase Inc. 40
Tools of the trade Run tools to validate a set
up with a reasonably known workload. libcouchbase’s cbc pillowfight Java’s RoadRunner .NET’s MeepMeep
Isolate performance statistics at different layers. libcouchbase and Java SDKs
have performance profiling abilities
Couchbase has cbstats timings