webinar: diagnosing apache cassandra problems in production

37
©2013 DataStax Confidential. Do not distribute without consent. Jon Haddad, Technical Evangelist @rustyrazorblade Diagnosing Problems in Production 1

Upload: planet-cassandra

Post on 02-Jul-2015

1.255 views

Category:

Technology


2 download

DESCRIPTION

This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.

TRANSCRIPT

Page 1: Webinar: Diagnosing Apache Cassandra Problems in Production

©2013 DataStax Confidential. Do not distribute without consent.

Jon Haddad, Technical Evangelist @rustyrazorblade

Diagnosing Problems in Production

1

Page 2: Webinar: Diagnosing Apache Cassandra Problems in Production

First Step: Preparation

Page 3: Webinar: Diagnosing Apache Cassandra Problems in Production

DataStax OpsCenter•Will help with 90% of problems you

encounter • Should be first place you look when

there's an issue • Community version is free • Enterprise version has additional

features

Page 4: Webinar: Diagnosing Apache Cassandra Problems in Production

Server Monitoring & Alerts•Monit • monitor processes • monitor disk usage • send alerts

•Munin / collectd • system perf statistics

•Nagios / Icinga • Various 3rd party services • Use whatever works for

you

Page 5: Webinar: Diagnosing Apache Cassandra Problems in Production

Application Metrics• Statsd / Graphite • Grafana • Gather constant metrics from

your application •Measure anything & everything •Microtimers, counters • Graph events • user signup • error rates

• Cassandra Metrics Integration • jmxtrans

Page 6: Webinar: Diagnosing Apache Cassandra Problems in Production

Log Aggregation•Hosted - Splunk, Loggly • OSS - Logstash + Kibana, Greylog •Many more… • For best results all logs should be

aggregated here • Oh yeah, and log your errors.

Page 7: Webinar: Diagnosing Apache Cassandra Problems in Production

Gotchas

Page 8: Webinar: Diagnosing Apache Cassandra Problems in Production

Incorrect Server Times• Everything is written with a timestamp • Last write wins • Usually supplied by coordinator • Can also be supplied by client •What if your timestamps are wrong

because your clocks are off? • Always install ntpd!

server time: 10

server time: 20

INSERTreal time: 12

DELETEreal time: 15

insert:20

delete:10

Page 9: Webinar: Diagnosing Apache Cassandra Problems in Production

Tombstones• Tombstones are a marker that data

no longer exists • Tombstones have a timestamp just

like normal data • They say "at time X, this no longer

exists"

Page 10: Webinar: Diagnosing Apache Cassandra Problems in Production

Tombstone Hell• Queries on partitions with a lot of tombstones require a lot of filtering • This can be reaaaaaaally slow • Consider: • 100,000 rows in a partition • 99,999 are tombstones • How long to get a single row?

• Cassandra is not a queue!

read 99,999 tombstones

finally get the right data

Page 11: Webinar: Diagnosing Apache Cassandra Problems in Production

Not using a Snitch• Snitch lets us distribute data in a fault tolerant way • Changing this with a large cluster is time

consuming • Dynamic Snitching • use the fastest replica for reads

• RackInferring (uses IP to pick replicas) • DC aware • PropertyFileSnitch (cassandra-topology.properties) • EC2Snitch & EC2MultiRegion • GoogleCloudSnitch • GossipingPropertyFileSnitch (recommended)

Page 12: Webinar: Diagnosing Apache Cassandra Problems in Production

Version Mismatch• SSTable format changed between

versions, making streaming incompatible • Version mismatch can break bootstrap,

repair, and decommission • Introducing new nodes? Stick w/ the

same version • Upgrade nodes in place • One at a time • One rack / AZ at a time (requires proper snitch)

Page 13: Webinar: Diagnosing Apache Cassandra Problems in Production

Disk Space not Reclaimed•When you add new nodes, data is

streamed from existing nodes • … but it's not deleted from them after • You need to run a nodetool cleanup • Otherwise you'll run out of space just by

adding nodes

Page 14: Webinar: Diagnosing Apache Cassandra Problems in Production

Using Shared Storage• Single point of failure •High latency • Expensive • Performance is about latency • Can increase throughput with more

disks • Avoid EBS, SAN, NAS

Page 15: Webinar: Diagnosing Apache Cassandra Problems in Production

Compaction• Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction

cluster wide • nodetool • compactionhistory • getcompactionthroughput

• Leveled vs Size Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads

Page 16: Webinar: Diagnosing Apache Cassandra Problems in Production

Diagnostic Tools

Page 17: Webinar: Diagnosing Apache Cassandra Problems in Production

htop• Process overview - nicer than top

Page 18: Webinar: Diagnosing Apache Cassandra Problems in Production

iostat• Disk stats • Queue size, wait times

• Ignore %util

Page 19: Webinar: Diagnosing Apache Cassandra Problems in Production

vmstat• virtual memory statistics • Am I swapping? • Reports at an interval, to an optional count

Page 20: Webinar: Diagnosing Apache Cassandra Problems in Production

dstat• Flexible look at network, CPU, memory, disk

Page 21: Webinar: Diagnosing Apache Cassandra Problems in Production

strace•What is my process doing? • See all system calls • Filterable with -e • Can attach to running

processes

Page 22: Webinar: Diagnosing Apache Cassandra Problems in Production

tcpdump•Watch network traffic

Page 23: Webinar: Diagnosing Apache Cassandra Problems in Production

nodetool tpstats•What's blocked? •MemtableFlushWriter? - Slow

disks! • also leads to GC issues

• Dropped mutations? • need repair!

Page 24: Webinar: Diagnosing Apache Cassandra Problems in Production

Histograms• proxyhistograms • High level read and write times • Includes network latency

• cfhistograms <keyspace> <table> • reports stats for single table on a single

node • Used to identify tables with

performance problems

Page 25: Webinar: Diagnosing Apache Cassandra Problems in Production

Query Tracing

Page 26: Webinar: Diagnosing Apache Cassandra Problems in Production

JVM Garbage Collection

Page 27: Webinar: Diagnosing Apache Cassandra Problems in Production

JVM GC Overview•What is garbage collection? • Manual vs automatic memory management

• Generational garbage collection (ParNew & CMS) • New Generation • Old Generation

Page 28: Webinar: Diagnosing Apache Cassandra Problems in Production

New Generation•New objects are created in the new gen •Minor GC • Occurs when new gen fills up • Stop the world • Dead objects are removed • Live objects are promoted into survivor (S0 & S1) then old gen • Removing objects is fast, promoting objects is slow

Page 29: Webinar: Diagnosing Apache Cassandra Problems in Production

Old Generation• Objects are promoted to new gen from old gen •Major GC • Old generations fills up some percentage. • Mostly concurrent • 2 short stop the world pauses

• Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • These are bad!

Page 30: Webinar: Diagnosing Apache Cassandra Problems in Production

Problem 1: Early Promotion• Short lived objects being promoted into old gen • Lots of minor GCs • Read heavy workloads on SSD • Results in frequent full GC

new gen old gen (full of short lived objects)

early promotion

fills up quickly

Page 31: Webinar: Diagnosing Apache Cassandra Problems in Production

Problem 2: Long Minor GC•New gen too big? • Remember: promoting objects is slow! •Huge new gen = potentially a lot of promotion

new gen old gen

too much promotion

Page 32: Webinar: Diagnosing Apache Cassandra Problems in Production

GC Profiling• Opscenter gc stats • Look for correlations between gc spikes

and read/write latency

• Cassandra GC Logging • Can be activated in cassandra-env.sh

• jstat • prints gc activity

Page 33: Webinar: Diagnosing Apache Cassandra Problems in Production

GC Profiling•What to look out for: • Long, multi-second pauses • Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with

it. Typically means garbage is being promoted out of the new gen too soon • Long minor GC • Many of the objects in the new gen are being promoted to the old gen. • Most commonly caused by new gen being too big • Sometimes caused by objects being promoted prematurely

Page 34: Webinar: Diagnosing Apache Cassandra Problems in Production

How much does it matter?

Page 35: Webinar: Diagnosing Apache Cassandra Problems in Production

Stuff is broken, fix it!

Page 36: Webinar: Diagnosing Apache Cassandra Problems in Production

Narrow Down the Problem• Is it even Cassandra? Check your

metrics! •Nodes flapping / failing • Check ops center • Dig into system metrics

• Slow queries • Find your bottleneck • Check system stats • JVM GC • Compaction • Histograms • Tracing

Page 37: Webinar: Diagnosing Apache Cassandra Problems in Production

©2013 DataStax Confidential. Do not distribute without consent. 37