pagerduty: one year of cassandra failures

59
20150923 One Year of Cassandra Failures [email protected] #CassandraSummit

Upload: datastax-academy

Post on 15-Apr-2017

1.189 views

Category:

Technology


3 download

TRANSCRIPT

2015−09−23

One Year of Cassandra [email protected] #CassandraSummit

2015-09-30

PagerDuty (simplified, circa early 2014)

ONE YEAR OF CASSANDRA FAILURES

Monitoring system events.pagerduty.com

CassandraEnqueuer

Dequeuer

Event Processing

NotifierXtraDB

Phone

SMS

Email

Push

HTTP

PagerDuty

Customer

2015−09−23

Span the WAN? Yes you can!

Tomorrow at 9:50 AM

Paul Rechsteiner

2015−09−23

Outage 1

“The Backlog”

2015-09-30

Background

ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

• Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Outage 1 - Foreshadowing• Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Coordinator Read Latency (in ms, by host)

6 seconds

~25 ms

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Coordinator Read Latency (in ms, by host)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Coordinator Read Latency (in ms, by host)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Coordinator Read Latency (in ms, by host)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Coordinator Read Latency (in ms, by host)

2015−09−23

The Next Day…

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

The Plan• Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems:

• Proactively disable non-critical services • Disable thrift

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Surprise!• Cron triggers a different repair • Plus a compaction for a large CF

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Outgoing Notification Backlog Size

Normal

Bad

Horrible

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Outgoing Notification Backlog Size

NormalBad

Horrible

:(

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Cassandra Pending Tasks: ReadStage (by host)

Over 9000

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Cassandra CPU (by host)

100%

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Factory ResetSuccess… kind of

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Aftermath: The Investigation• Huge investigation • Silver lining: learned a lot • Host metrics (CPU, network, etc) fine most of the time • Need to look at Cassandra metrics for leading indicators

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Investigation Conclusion• Under-provisioned (mainly CPU) • No partial progress

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Lessons• Capacity planning

• Important even with low volume • Cassandra-specific monitoring • Isolation

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1

Lessons - Metrics For Cassandra• Dropped messages (leading) • Blocked flush writers (leading) • GC behavior (leading) • Pending tasks: ReadStage, ResponseStage, etc (lagging)

2015−09−23

Outage 2

“Aliens”

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Changes• Isolated clusters for everyone • New service: heaviest Cassandra user so far • Upgrade Cassandra version

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Application LogsERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for

SimpleQueueable[entityId=deliveryProcessor_XXXXXXX]

com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException:

[host=##.###.##.1(##.###.##.1):9160, latency=24(24),

attempts=1]InvalidRequestException(why:(

String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed

validation)

at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:

159)

at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)

at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)

at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl

$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

“Cassandra Danger Metrics” (Partial)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

cassandra-cli - “describe cluster” - Bad Output[default@Artemis] describe cluster;

Cluster Information:

Name: prod-artemis

Snitch: org.apache.cassandra.locator.PropertyFileSnitch

Partitioner: org.apache.cassandra.dht.RandomPartitioner

Schema versions:

52eee0b6-dabb-3c44-af80-970b0e7f63ff: [##.###.##.1]

676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

cassandra-cli - “describe cluster” - Good Output[default@unknown] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.1, ##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Notifications Sent

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Application-Measured Cassandra Call Latency (in ms, by CF)

15 seconds

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Pending Tasks: MutationStage

22,000

Should be small, < 5

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Actions17:01:21 disable thrift 17:02:08 kill repair 17:02:35 kill dash nine

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Cassandra Operations (cluster-wide, by CF)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Cassandra Operations (cluster-wide, by CF)

disable thrift

kill repair

kill -9

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Puzzle• Why did one bad Cassandra node have such a huge effect?

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Bad CoordinatorTimeout vs average request 10,000 ms / 25 ms = 400

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

What Happened To Cassandra?

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

What Happened To Cassandra?

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2

Lessons• Isolated clusters pays off • How to do schema changes: 1. describe cluster; 2. <schema change for one CF> 3. describe cluster;

• Monitor for schema disagreement

2015−09−23

Outage 3

“Human Error”

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Application-Measured Cassandra Call Latency (ms, by CF)

8 seconds

Normal: ~25 ms

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

“Cassandra Danger Metrics” (partial)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Cassandra Logs (on working hosts)INFO [HintedHandoff:2] 2014-12-18 03:21:39,396 HintedHandOffManager.java (line 427) Timed out replaying hints to /##.###.##.6; aborting (9079 delivered)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

commitlog Directoryls -la /var/lib/cassandra/commitlog/

total 1015360

drwxr-xr-x 2 cassandra root 4096 2014-12-18 03:36 .

drwxr-xr-x 6 cassandra root 4096 2014-08-19 17:00 ..

-- SNIP --

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533553.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533554.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:34 CommitLog-2-1418873533555.log

-rw-r--r-- 1 root root 33554432 2014-11-26 21:40 CommitLog-2-1418873533556.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737850.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737851.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737852.log

-rw-r--r-- 1 root root 33554432 2014-11-26 21:39 CommitLog-2-1418873737853.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873800630.log

-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873812840.log

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

The CulpritNov 26 21:39:53 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db Nov 26 21:40:12 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

sstable2jsonsstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

sstable2jsonsstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db

ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/system/peer_events/system-peer_events-ic-57; partitioner org.apache.cassandra.dht.RandomPartitioner does not match system partitioner org.apache.cassandra.dht.Murmur3Partitioner. Note that the default partitioner starting with Cassandra 1.2 is Murmur3Partitioner, so you will need to edit that to match your old partitioner if upgrading.

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log

at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)

at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84)

at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251)

at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49)

at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105)

at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log

(Permission denied)

at java.io.RandomAccessFile.open(Native Method)

at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)

at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

sstable2jsonexport CASSANDRA_CONF=/etc/cassandra sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Success!

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Cassandra Thread Dump"MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000]

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

- parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)

at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)

at

org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService.java:93)

at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191)

at org.apache.cassandra.db.Table.apply(Table.java:375)

at org.apache.cassandra.db.Table.apply(Table.java:354)

at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283)

at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56)

at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Cassandra Thread Dump"COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000]

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

- parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)

at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)

at

org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:126)

at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305)

at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44)

at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356)

at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService$1.runMayThrow(PeriodicCommitLogExecutorService.java:46)

at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

at java.lang.Thread.run(Thread.java:745)

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Cassandra Logs - Commit Log AllocatorException in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log

at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)

at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197)

at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:95)

at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/CommitLog-2-1418868735344.log to 1418873812840 failed

at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113)

... 4 more

2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3

Lessons• Be careful what habits you develop • Tools should be as isolated & focused as possible • Process startup code can create time bombs

2015−09−23

Concluding Thoughts

2015−09−23

[email protected]

Thank you.