pagerduty: one year of cassandra failures
TRANSCRIPT
2015-09-30
PagerDuty (simplified, circa early 2014)
ONE YEAR OF CASSANDRA FAILURES
Monitoring system events.pagerduty.com
CassandraEnqueuer
Dequeuer
Event Processing
NotifierXtraDB
Phone
SMS
Push
HTTP
PagerDuty
Customer
2015-09-30
Background
ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
• Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outage 1 - Foreshadowing• Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
6 seconds
~25 ms
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
The Plan• Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems:
• Proactively disable non-critical services • Disable thrift
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Surprise!• Cron triggers a different repair • Plus a compaction for a large CF
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
Normal
Bad
Horrible
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
NormalBad
Horrible
:(
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Cassandra Pending Tasks: ReadStage (by host)
Over 9000
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Aftermath: The Investigation• Huge investigation • Silver lining: learned a lot • Host metrics (CPU, network, etc) fine most of the time • Need to look at Cassandra metrics for leading indicators
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Investigation Conclusion• Under-provisioned (mainly CPU) • No partial progress
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Lessons• Capacity planning
• Important even with low volume • Cassandra-specific monitoring • Isolation
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Lessons - Metrics For Cassandra• Dropped messages (leading) • Blocked flush writers (leading) • GC behavior (leading) • Pending tasks: ReadStage, ResponseStage, etc (lagging)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Changes• Isolated clusters for everyone • New service: heaviest Cassandra user so far • Upgrade Cassandra version
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Application LogsERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for
SimpleQueueable[entityId=deliveryProcessor_XXXXXXX]
com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException:
[host=##.###.##.1(##.###.##.1):9160, latency=24(24),
attempts=1]InvalidRequestException(why:(
String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed
validation)
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:
159)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl
$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
cassandra-cli - “describe cluster” - Bad Output[default@Artemis] describe cluster;
Cluster Information:
Name: prod-artemis
Snitch: org.apache.cassandra.locator.PropertyFileSnitch
Partitioner: org.apache.cassandra.dht.RandomPartitioner
Schema versions:
52eee0b6-dabb-3c44-af80-970b0e7f63ff: [##.###.##.1]
676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
cassandra-cli - “describe cluster” - Good Output[default@unknown] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.1, ##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Application-Measured Cassandra Call Latency (in ms, by CF)
15 seconds
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Pending Tasks: MutationStage
22,000
Should be small, < 5
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Actions17:01:21 disable thrift 17:02:08 kill repair 17:02:35 kill dash nine
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Cassandra Operations (cluster-wide, by CF)
disable thrift
kill repair
kill -9
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Puzzle• Why did one bad Cassandra node have such a huge effect?
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Bad CoordinatorTimeout vs average request 10,000 ms / 25 ms = 400
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Lessons• Isolated clusters pays off • How to do schema changes: 1. describe cluster; 2. <schema change for one CF> 3. describe cluster;
• Monitor for schema disagreement
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Application-Measured Cassandra Call Latency (ms, by CF)
8 seconds
Normal: ~25 ms
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs (on working hosts)INFO [HintedHandoff:2] 2014-12-18 03:21:39,396 HintedHandOffManager.java (line 427) Timed out replaying hints to /##.###.##.6; aborting (9079 delivered)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
commitlog Directoryls -la /var/lib/cassandra/commitlog/
total 1015360
drwxr-xr-x 2 cassandra root 4096 2014-12-18 03:36 .
drwxr-xr-x 6 cassandra root 4096 2014-08-19 17:00 ..
-- SNIP --
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533553.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533554.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:34 CommitLog-2-1418873533555.log
-rw-r--r-- 1 root root 33554432 2014-11-26 21:40 CommitLog-2-1418873533556.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737850.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737851.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737852.log
-rw-r--r-- 1 root root 33554432 2014-11-26 21:39 CommitLog-2-1418873737853.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873800630.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873812840.log
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
The CulpritNov 26 21:39:53 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db Nov 26 21:40:12 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2jsonsstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2jsonsstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/system/peer_events/system-peer_events-ic-57; partitioner org.apache.cassandra.dht.RandomPartitioner does not match system partitioner org.apache.cassandra.dht.Murmur3Partitioner. Note that the default partitioner starting with Cassandra 1.2 is Murmur3Partitioner, so you will need to edit that to match your old partitioner if upgrading.
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
(Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Success!
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump"MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
at
org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService.java:93)
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191)
at org.apache.cassandra.db.Table.apply(Table.java:375)
at org.apache.cassandra.db.Table.apply(Table.java:354)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283)
at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump"COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:126)
at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305)
at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44)
at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356)
at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService$1.runMayThrow(PeriodicCommitLogExecutorService.java:46)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs - Commit Log AllocatorException in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:95)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/CommitLog-2-1418868735344.log to 1418873812840 failed
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113)
... 4 more
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Lessons• Be careful what habits you develop • Tools should be as isolated & focused as possible • Process startup code can create time bombs