monitoring and scaling redis at datadog - ilan rabinovitch, datadog

76
Redis at RedisConf 2016 San Francisco, CA Ilan Rabinovitch Director, Community Datadog

Upload: redis-labs

Post on 16-Jan-2017

403 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Redis at

RedisConf 2016San Francisco, CA

Ilan Rabinovitch Director, CommunityDatadog

Page 2: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

$ finger ilan@datadog

[datadoghq.com]Name: Ilan RabinovitchRole: Director, Technical CommunityInterests: * Open Source * Web Operations & Infra Automation * Monitoring and Metrics * FL/OSS Community Events

Page 3: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

$ cat ~/.plan

1. An Overview of Datadog

2. Monitoring 101

3. How Datadog uses Redis

4. Key Metrics and Examples

Page 4: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

• Infrastructure and App monitoring as a service. • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Powered by Redis! • We’re hiring! (www.datadoghq.com/careers/)

Datadog Overview

Page 5: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores, Caches, Queues and more...

Monitor Everything

Page 6: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 7: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 8: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Queueing Caching Session State

Redis Use Cases

Page 9: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Monitoring 101

Page 10: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 11: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Follow @honest_update on Twitter

Page 12: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 13: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 14: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Collecting data is cheap; not having it when you need it can be expensive

Page 15: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Instrument all the things!

Page 16: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity Increases with..

• Number of things to measure

• Velocity of change

Page 17: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

How much we measure?

1 instance• 10 metrics from CloudWatch

1 operating system (e.g., Linux)• 100 metrics

50~ metrics per redis instance

Page 18: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 19: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity

100instances

400containers

Page 20: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity: Scale

160metrics per host

640metrics per host

Assuming 4 containers per host

Page 21: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity: Scale

100instances

64,000metrics

Assuming 4 containers per host

Page 22: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

How much we measure?

1 instance• 10 metrics from CloudWatch

1 operating system (e.g., Linux)• 100 metrics

50~ metrics per application N containers

• 150*N metricsMetrics Overload!

Page 23: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity Increases with..

• Number of things to measure

• Velocity of change

Page 24: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Source: Datadog

Page 25: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Source: http://bit.ly/1qFylWK

Page 26: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Operational Complexity Increases with..

• Number of things to measure

• Velocity of change

Page 27: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 28: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 29: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

Page 30: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Finding Signal - Categorizing Your Metrics

Page 31: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 32: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 33: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 34: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 35: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 36: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Examples: Web Application

Work Metrics:

• Requests Per Second Dropped Connections

• Request Response Time • Error Rates (4xx or 5xx) • Success (2xx)

Resource Metrics:

• Disk I/O • Network Bandwidth • Memory • CPU • Queue Length

Page 37: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Examples: Web Application - Events

Work Metrics:

• Configuration Change • Code Deployment / Release • Add / Remove Nodes • Cache Purge

Page 38: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

When to let a sleeping engineer lie?

Page 39: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

When to alert?

Page 40: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Recurse until you find root cause

Page 41: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

The Life of a Metric

Page 42: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Monitor Everything

Page 43: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 44: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

• Billions of data points per day • Time series data (time stamps and values)

• Millions of events per day (text) • Metadata • 1 second resolution • Stored at full resolution for 13 months.

Metrics

Page 45: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Primarily

ElasticSearch

Page 46: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Examples: Events

Page 47: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

A metric is born…

• Open Source Python agent

• SDKs, Libraries, RESTful APIs

Page 48: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

A metric is born…

• Open Source Python agent

• SDKs, Libraries, RESTful APIs

• SaaS Integrations

Page 49: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

{ "series": [ { "metric": “<metric-name>”, "points": [ [ <timestamp>, <value> ] ], "type": "gauge", "host": “<host name>", "tags": [ “<tags>" ] } ]}

Page 50: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 51: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Tags All the Way Down

Page 52: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 53: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Asking Better Questions

“Monitor all containers running image web in region us-west-2 across all availability zones that use more than 1.5x the average memory on c3.xlarge”

Page 54: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Asking Better Questions

“90% of all web requests are taking more than 0.5s to process and respond.”

Page 55: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

{ "series": [ { "metric": “system.cpu.system”, "points": [ [ 419707187, 5 ] ], "type": "gauge", "host": "test.example.com", "tags": [ “environment:prod" ] } ]}

Page 56: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Intake APIs

• Constant stream of data

• Low latency requirements (30-60s)

• Data needed by multiple consumers/systems.

Page 57: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 58: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Redis Use Case #1: Queuing

Page 59: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Why Redis?

• Easy to scale vertically

• Simple push/pop interaction for queues

• Simple clustering via twemproxy

Page 60: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Work Metrics - Queuing

• Queue Depth • Message Latency • Cmd Latency • Read vs Write Calls per sec

Page 61: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Work Metrics - Queuing

Page 62: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Work Metrics - Queuing

Page 63: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Resource Metrics - Queuing

• Network Utilization • used_memory • Disk IO • Disk Space • connected_clients • keyspace • rejected_connection

Page 64: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 65: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Use Case #2 Caching

Page 66: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Initial Architecture

• Cache Aside

• Single Redis per Worker

• Local Cache

• LRU Caching (allkeys-lru)

Page 67: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Dogpound

Source: http://bit.ly/1NoW6aj

• Uniform API for caching backends • Pluggable Architecture

• memory • redis • sharded redis • tiered caching

• Service Discovery via Consul

Page 68: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Sharded Redis

HashRing

Page 69: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Tiered Cache Redis

• Local Redis • Shared Cache

• Tiered • HA Pair

• Fallback to primary data store

Page 70: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Work Metrics - Caching

• Cache Hit to Miss Ratio • cmds / second • Keys stored by host • request latency

Page 71: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Resource Metrics - Caching

• evicted_keys • used_memory • network utilization • connected_clients • keyspace • utilization on backend data store

Page 72: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 73: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Page 74: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Events

• Adding / Removing Nodes • Incidents • Code Deploys • Config Changes

Page 75: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Source: http://bit.ly/1NoW6aj

Page 76: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

ResourcesMonitoring 101: Alerting https://www.datadoghq.com/blog/monitoring-101-alerting/

Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/

Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/

The Power of Tagged Metrics https://www.datadoghq.com/blog/the-power-of-tagged-metrics/

Monitoring Redis: Collecting Performance Metrics https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/

HashRing https://pypi.python.org/pypi/hash_ring