(app309) running and monitoring docker containers at scale | aws re:invent 2014

51
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. November 12, 2014 | Las Vegas, NV APP309 Monitoring and Running Docker Containers at Scale Alexis Lê-Quôc, Datadog

Upload: amazon-web-services

Post on 01-Jul-2015

864 views

Category:

Technology


6 download

DESCRIPTION

If you have tried Docker but are unsure about how to run it at scale, you will benefit from this session. Like virtualization before, containerization (à; la Docker) is increasing the elastic nature of cloud infrastructure by an order of magnitude. But maybe you still have questions: How many containers can you run on a given Amazon EC2 instance type? Which metric should you look at to measure contention? How do you manage fleets of containers at scale? Datadog is a monitoring service for IT, operations, and development teams who write and run applications at scale. In this session, the cofounder of Datadog presents the challenges and benefits of running containers at scale and how to use quantitative performance patterns to monitor your infrastructure at this magnitude and complexity. Sponsored by Datadog.

TRANSCRIPT

Page 1: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 12, 2014 | Las Vegas, NV

APP309

Monitoring and Running

Docker Containers at ScaleAlexis Lê-Quôc, Datadog

Page 2: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

@alq — CTO at Datadog

Page 3: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Datadog

• Monitoring service

• Made for the cloud

• Aggregates everything

• Support for Docker

(since 1.0)

Page 4: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Goals1. Present key Docker metrics

2. Explain operational complexity

3. Rethink monitoring of Docker containers

Page 5: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Agenda• A (very) brief history of containers

• Docker containers on AWS

• Key Docker metrics

• Operational complexity

• Monitoring Docker effectively

• Demo

Page 6: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

A brief history of containers

Page 7: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Containers in a nutshell• Been around for a long time

– jails, zones, cgroups

• No full-virtualization overhead

• Used for runtime isolation (e.g., jails)

• Docker: escape from dependency hell

Page 8: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Escape from dependency hell

a.out

shared libs

packages

omnibus

Docker ~

Page 9: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Container ~ single static binary

Process Container Host

Source Dockerfile Chef/Puppet

Kickstart

.TEXT /var/lib/docker Full distro

PID Name/ID Hostname

Page 10: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Docker on AWS: some numbers

Page 11: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

(Some) Docker use cases • Continuous integration

– eliminate dependency variance

– same code from dev laptop to production

– Git-like workflow

• Continuous delivery

– (quasi) stateless components

– web workers, video encoders, etc.

– not for data stores (Amazon RDS a better fit)

Page 12: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Instance types

20% 20%19%

13%

8%

21%

c3.2xl m3.medium m3.large m3.xlarge m1.large the rest

Source: Datadog, October 2014

Page 13: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Containers per instance• Average: 5 (October, 2014)

• Highly dependent on the workload

• This is just the beginning…

• Expect higher container density going forward

Source: Datadog, October 2014

Page 14: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Key Docker metrics

Page 15: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Docker containers consume…• Memory

• CPU

• I/O

• Network

Page 16: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

MemoryName Why it matters

pgmajfault Paging to/from disk is slow

pgfault Context switches hurt

application performance

resident set size (rss) Too much RSS causes paging

and swapping

swap Swapping in/out is slow

Page 17: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

CPU

Name Why it matters

user Measures work being done

system System calls, a necessary evil

Page 18: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Block I/O

Name Why it matters

blkio.io_service_bytes I/O is (often) bottleneck

blkio.io_queued Measures saturation

Page 19: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

NetworkName Why it matters

tx/rx_errors Because…errors are bad

tx/rx_dropped Measures contention

tx/rx_bytes Measures traffic

Page 20: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

How to collect metrics• https://github.com/google/cadvisor

Page 21: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Operational complexity

Page 22: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Combinatorial multiplication

Hardware

OS

Off-the-shelf

Your Application

Hardware

Hypervisor

Off-the-

shelf

App

OS OS

Off-the-

shelf

App

Hardware

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 23: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Operational complexity• Average containers per instance: N (N=5, 10/2014)

• N-times as many “hosts” to manage

• Affects

– provisioning: prep’ing & building containers

– configuration: passing config to containers

– orchestration: deciding where/when containers run

– monitoring: making sure containers run properly

Page 24: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Monitoring: metric counts on Amazon EC2

• 1 Amazon EC2 instance

– 10 Amazon CloudWatch metrics

• 1 operating system (e.g., Linux)

– 100 metrics

• 1 container

– 50 metrics

• 1 off-the-shelf application

– ~50 metrics

Page 25: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Combinatorial multiplication

100 500instances containers

Assuming only 5 containers per instance

Page 26: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Combinatorial multiplication

160 410metrics

per instancemetrics

per instance

Assuming only 5 containers per instance

Page 27: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Velocity

hours,

days,

months

minutes,

hours,

days

EC2 instance half-life Container half-life

Page 28: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Aggravating factors• Hub-based provisioning

– new images every day

• Autonomic orchestration

– from imperative to declarative

– automated

– individual containers don’t matter

– e.g., Kubernetes, Mesos

Page 29: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

A lot more,

A lot faster.

Page 30: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

If your monitoring is still centered on individual hosts or instances…

Page 31: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Host-centric monitoring

Monitor

Monitor

GAP

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 32: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

A lot more pain,

A lot faster.

Page 33: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Monitoring containers effectively

Page 34: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

A new approach to container monitoring

Page 35: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Layers + Tags

Page 36: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Layers of monitoring

Monitor

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 37: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Layers of monitoring

CloudWatch

Infrastructure

Monitoring

APM

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 38: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Layers of monitoring

cpu/net/io

filesystem

docker mem

docker cpu

db queries

web requests

app throughput

CloudWatch

Infrastructure

Monitoring

APM

e.g.

Hypervisor

OS OS

A A A A

Containers

O O O O

Page 39: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Layers of monitoring• Access to metrics from all the layers

• Amazon CloudWatch, OS metrics, Docker metrics,

app metrics in 1 place

• Shared timeline

Page 40: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

If your monitoring

does not cover all layers,

pain.

Page 41: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Tags

You use them already

Page 42: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Tags• Monitoring is like Auto-Scaling Groups

• Monitoring is like Docker orchestration

• From imperative to declarative

• Query-based

• Queries operate on tags

Page 43: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Monitoring with tags and queries

“Monitor all Docker containers running image web”

“… in region us-west-2 across all availability zones”

“… and make sure resident set size < 1GB on c3.xl”

Page 44: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Monitoring with tags and queries

“Monitor all Docker containers running image web”

“… in region us-west-2 across all availability zones”

“… and make sure resident set size < 1GB on c3.xl”

Page 45: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Monitoring with tags and queries

“Monitor all Docker containers running image web”

“… in region us-west-2 across all availability zones”

“… that use more than 1.5x the average on c3.xl”

Page 46: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

“Dude, where’s my server?”

Page 47: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

“Dude, where’s my container?”

Page 48: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

If your monitoring

is not tag-based,

pain.

Page 49: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Demo

Page 50: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

Take-aways1. Docker increases operational complexity by an order

of magnitude unless…

2. You have layered monitoring, from the instance to

the container and to the application, and…

3. You monitor using tags and queries

Page 51: (APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014

http://bit.ly/awsevals