host health monitoring with docker run

Post on 12-Apr-2017

1.437 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Host Health Monitoring with `docker run`

Noah Zoschke @nzoschke

noah@convox.com 10 / 28 / 2015

Health Monitoring

circa 1999• Nagios Core

• Event scheduler • Event processor • Alert manager

• Host groups config • Ping • HTTP • SSH

• Nagios Remote Plugin Executor • SNMP • load • disk

photo credit: https://en.wikipedia.org/wiki/Nagios

Health Monitoring circa 2012

• AMI • Chef / Ansible

• ELB / Health Check • Protocol: HTTP (or HTTPS, TCP, SSL) • Port: 80 • Path: /index.html • Timeout / Interval: 5s / 30s • Unhealthy / Healthy Threshold: 2 / 10

• EC2 / Status Checks • Loss of network • Loss of power • Host software problems • Host hardware problems

• ASG photo credit: http://aws.amazon.com/architecture/ http://blog.domenech.org/2012/11/aws-ec2-auto-scaling-basic-configuration.html

But you probably still need…

• Nagios for monitoring

• or Zabbix, Ganglia, Sensu…

• or OpsView, SolarWinds…

• or Pingdom, Datadog…

• To provide system feedback

• ASG SetInstanceHealth

photo credit: http://itomibhaa.deviantart.com/art/Who-watches-the-Watchmen-276285938

Health Monitoring circa 2016, the age of containers

• Generic AMI • Docker

• ECS • Container scheduling and re-scheduling as a service

• ASG / EC2 / Status Checks • Simple monitoring container

photo credit: https://github.com/docker/swarm

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

Failure Scenarios• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

• Instance hardware fails

• Instance fails to register with ECS

• Instance userspace gets wacky

Failure Scenarios• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

photo credit: http://paper-replika.com/index.php?option=com_content&view=article&id=76&Itemid=207693

>rescheduletask

Container Schedulers are the new watchman

• Container process monitoring

• Service health check monitoring

• Automatic re-scheduling

photo credit: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_life_cycle.html

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

Failure Scenarios• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

• Instance hardware fails

• Instance fails to register with ECS

• Instance userspace gets wacky

Still need to configure an ASG to maintain capacity…

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

Failure Scenarios• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

• Instance hardware fails

• Instance fails to register with ECS

• Instance userspace gets wacky

Still need a monitor…

ecs-agent dockerd ecs-agent dockerd ecs-agent dockerd

api128 MB

registry256 MB

rails web.21024 MB

data worker.1512 MB

rails web.31024 MB

data worker.2512 MB

rails worker.2256 MB

rails worker.3256 MB

rails web.11024 MB

rails worker.1256 MB

rails worker.4256 MB

ECS

ASG

api ELB rails ELB

Health Monitoring circa 2016, the age of containers

• Schedule a monitor process in container cluster

• Describe ASG an ECS membership

• Mark all instances unregistered with ECS unhealthy

• `docker run` a user space health check on every instance

• Mark instances that fail to connect to Docker unhealthy

• Mark instances that fail user space health check unhealthy

No Nagios server + plugins!

Partial Failure Scenarios battle scars

• web.2 container crashes

• web.2 port unresponsive

• ecs-agent fails

• dockerd fails

• Instance hardware fails

• Instance fails to register with ECS

• Instance userspace gets wacky

• Disk full

• Disk partition corrupt / read-only

• Network packet loss

• CPU steal

• Kernel bugs triggered

• Security vulnerabilities

• Security breaches

• …

User Space Health Check

$dockerrunbusyboxsh-c\'dmesg|grep"Remountingfilesystemread-only"'

#whynot:$dockerrunhealth-check

To package, distribute and run common top, netstat, smartmontools, etc. binaries and scripts

Thanks!

Slides available on Medium / SlideSharehttps://medium.com/@nzoschke/host-health-monitoring-with-docker-run-46315eb38286

http://www.slideshare.net/nzoschke/host-health-monitoring-with-docker-run

Open source Golang monitor available on GitHubhttps://github.com/convox/rack/blob/master/api/workers/cluster.go

Questions / feedback to @nzoschke or noah@convox.com

top related