(app309) running and monitoring docker containers at scale | aws re:invent 2014
DESCRIPTION
If you have tried Docker but are unsure about how to run it at scale, you will benefit from this session. Like virtualization before, containerization (à; la Docker) is increasing the elastic nature of cloud infrastructure by an order of magnitude. But maybe you still have questions: How many containers can you run on a given Amazon EC2 instance type? Which metric should you look at to measure contention? How do you manage fleets of containers at scale? Datadog is a monitoring service for IT, operations, and development teams who write and run applications at scale. In this session, the cofounder of Datadog presents the challenges and benefits of running containers at scale and how to use quantitative performance patterns to monitor your infrastructure at this magnitude and complexity. Sponsored by Datadog.TRANSCRIPT
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
November 12, 2014 | Las Vegas, NV
APP309
Monitoring and Running
Docker Containers at ScaleAlexis Lê-Quôc, Datadog
@alq — CTO at Datadog
Datadog
• Monitoring service
• Made for the cloud
• Aggregates everything
• Support for Docker
(since 1.0)
Goals1. Present key Docker metrics
2. Explain operational complexity
3. Rethink monitoring of Docker containers
Agenda• A (very) brief history of containers
• Docker containers on AWS
• Key Docker metrics
• Operational complexity
• Monitoring Docker effectively
• Demo
A brief history of containers
Containers in a nutshell• Been around for a long time
– jails, zones, cgroups
• No full-virtualization overhead
• Used for runtime isolation (e.g., jails)
• Docker: escape from dependency hell
Escape from dependency hell
a.out
shared libs
packages
omnibus
Docker ~
Container ~ single static binary
Process Container Host
Source Dockerfile Chef/Puppet
Kickstart
.TEXT /var/lib/docker Full distro
PID Name/ID Hostname
Docker on AWS: some numbers
(Some) Docker use cases • Continuous integration
– eliminate dependency variance
– same code from dev laptop to production
– Git-like workflow
• Continuous delivery
– (quasi) stateless components
– web workers, video encoders, etc.
– not for data stores (Amazon RDS a better fit)
Instance types
20% 20%19%
13%
8%
21%
c3.2xl m3.medium m3.large m3.xlarge m1.large the rest
Source: Datadog, October 2014
Containers per instance• Average: 5 (October, 2014)
• Highly dependent on the workload
• This is just the beginning…
• Expect higher container density going forward
Source: Datadog, October 2014
Key Docker metrics
Docker containers consume…• Memory
• CPU
• I/O
• Network
MemoryName Why it matters
pgmajfault Paging to/from disk is slow
pgfault Context switches hurt
application performance
resident set size (rss) Too much RSS causes paging
and swapping
swap Swapping in/out is slow
CPU
Name Why it matters
user Measures work being done
system System calls, a necessary evil
Block I/O
Name Why it matters
blkio.io_service_bytes I/O is (often) bottleneck
blkio.io_queued Measures saturation
NetworkName Why it matters
tx/rx_errors Because…errors are bad
tx/rx_dropped Measures contention
tx/rx_bytes Measures traffic
How to collect metrics• https://github.com/google/cadvisor
Operational complexity
Combinatorial multiplication
Hardware
OS
Off-the-shelf
Your Application
Hardware
Hypervisor
Off-the-
shelf
App
OS OS
Off-the-
shelf
App
Hardware
Hypervisor
OS OS
A A A A
Containers
O O O O
Operational complexity• Average containers per instance: N (N=5, 10/2014)
• N-times as many “hosts” to manage
• Affects
– provisioning: prep’ing & building containers
– configuration: passing config to containers
– orchestration: deciding where/when containers run
– monitoring: making sure containers run properly
Monitoring: metric counts on Amazon EC2
• 1 Amazon EC2 instance
– 10 Amazon CloudWatch metrics
• 1 operating system (e.g., Linux)
– 100 metrics
• 1 container
– 50 metrics
• 1 off-the-shelf application
– ~50 metrics
Combinatorial multiplication
100 500instances containers
Assuming only 5 containers per instance
Combinatorial multiplication
160 410metrics
per instancemetrics
per instance
Assuming only 5 containers per instance
Velocity
hours,
days,
months
minutes,
hours,
days
EC2 instance half-life Container half-life
Aggravating factors• Hub-based provisioning
– new images every day
• Autonomic orchestration
– from imperative to declarative
– automated
– individual containers don’t matter
– e.g., Kubernetes, Mesos
A lot more,
A lot faster.
If your monitoring is still centered on individual hosts or instances…
Host-centric monitoring
Monitor
Monitor
GAP
Hypervisor
OS OS
A A A A
Containers
O O O O
A lot more pain,
A lot faster.
Monitoring containers effectively
A new approach to container monitoring
Layers + Tags
Layers of monitoring
Monitor
Hypervisor
OS OS
A A A A
Containers
O O O O
Layers of monitoring
CloudWatch
Infrastructure
Monitoring
APM
Hypervisor
OS OS
A A A A
Containers
O O O O
Layers of monitoring
cpu/net/io
filesystem
docker mem
docker cpu
db queries
web requests
app throughput
CloudWatch
Infrastructure
Monitoring
APM
e.g.
Hypervisor
OS OS
A A A A
Containers
O O O O
Layers of monitoring• Access to metrics from all the layers
• Amazon CloudWatch, OS metrics, Docker metrics,
app metrics in 1 place
• Shared timeline
If your monitoring
does not cover all layers,
pain.
Tags
You use them already
Tags• Monitoring is like Auto-Scaling Groups
• Monitoring is like Docker orchestration
• From imperative to declarative
• Query-based
• Queries operate on tags
Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… and make sure resident set size < 1GB on c3.xl”
Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… and make sure resident set size < 1GB on c3.xl”
Monitoring with tags and queries
“Monitor all Docker containers running image web”
“… in region us-west-2 across all availability zones”
“… that use more than 1.5x the average on c3.xl”
“Dude, where’s my server?”
“Dude, where’s my container?”
If your monitoring
is not tag-based,
pain.
Demo
Take-aways1. Docker increases operational complexity by an order
of magnitude unless…
2. You have layered monitoring, from the instance to
the container and to the application, and…
3. You monitor using tags and queries
http://bit.ly/awsevals