techtalks: taking docker to production
TRANSCRIPT
Taking Docker To productionJOSA TechTalk by Muayyad Saleh Alsadihttp://muayyad-alsadi.github.io/
What is Docker again? (quick review)
Containers
uses linux kernel features like:
● namespaces● cgroups (control
groups)● capabilities.
Platform
Docker is a key component of many PaaS. Docker provide a way to host images, pull them, run them, pause them, snapshot them into new images, view diffs ..etc.
Ecosystem
Like github, Docker Hub provide publicly available community images.
Containers vs. VMs
No kernel in guest OS (shared with host)containers are more secure and isolated than chroot and less isolated than VM
Why DevOps?
Devs
want change
Ops
wants stability (no change)
DevOps
resolve the conflict.
for devs: docker image contains the same os, same libraries, same version, same config, ...etc.
for admins: host is untouched and stable
Blame each other
Fight each other
Devs Heaven (not for production)
docker compose can bring everything up and connect them and link them with a single command. can mount local dir inside the image (so that developer can use his/her favorite IDE). The command is
docker-compose up
it will read “docker-compose.yml” which might look like:mywebapp: image: mywebapp volumes: - .:/code links: - redisredis: image: redis
Operations Heaven
Having a stable host!
CoreOS does not include any package manager. and does not even have python or tools installed. They have a Fedora based docker image called toolbox.
You can mix and match. Some containers runs Java 6 or Java 7. Some uses CentOS 6, others 7, others ubuntu 14.04 others Fedora 22 ..etc. in the same host.
Linking Containers
docker run -d --name r1 redisdocker run -d --name web --link r1:redis myweb
r1 is container nameredis is link aliasit will update /etc/hosts and set ENVs:
● <alias>_NAME = <THIS>/<THAT> # myweb/r1● REDIS_PORT=<tcp|udb>://<IP>:<PORT>● REDIS_PORT_6379_TCP_PROTO=tcp● REDIS_PORT_6379_TCP_PORT=6379● REDIS_PORT_6379_TCP_ADDR=172.17.1.15
Pets vs. Cattle vs.Ants
Pets (virtualization)
The VM has
● lovely distinct names
● emotions● many highly coupled
roles● if down it’s a
catastrophe
Cattle (cloud)
● no names● no emotions● single role● decoupled (loosely
coupled)● load-balanced● if down other VMs take
over.● VM failure is planned and
part of the process
Ants (docker containers)
containers are like cloud vms, no names no emotions, load balanced.
A single host (might be a VM) is highly dense. The host is stable. Large group of containers are designed to fail as part of the process.
What docker is not
● docker is not a hypervisor○ docker is for process containers not system containers○ example of system containers: LXD and OpenVZ
● no systemd/upstart/sysvinit in the container○ docker is for process containers not system containers○ just run apache, nginx, solr, whatever○ TTYs are not needed○ crons are not needed
● Docker is not for multi-tenant
HINT: LXD is stupid way of winning a meaningless benchmark
Docker ecosystem
● CoreOS, Atomic OS, Ubuntu Core● Openshift (redhat PaaS)● CloudFoundary● Mesos / mesosphere (by Twitter and now apache)● Google Kubernetes (scheduler containers to hosts)● Swarm● etcd/Fleet● Drone● Deis, Flynn, Rancher
Docker golden rules
by twitter@gionn:
● only one process per image● no embedded configuration● no sshd, no syslog, no tty● no! you don't touch a running container to adjust things● no! you will not use a community image
Theory vs. Reality
docker imaginary “unicorn” apps
● statically compiled (no dependencies)
● written in golang● container ~ 10MB
on real world
● interpreted application (python, php)
● system dependencies, config files, log files
● multiple processes (nginx, php-fpm)
● container image >500MB
12 Factor - http://12factor.net/
1. One codebase (in git), many deploys2. Explicitly declare and isolate dependencies3. get config from environment or service discovery4. Treat backing services as attached resources (Database, SMTP, S3, ..etc.)5. Strictly separate build and run stages (no minify css/js on run stage)6. Execute the app as one or more stateless processes (data and state are
persisted elsewhere apart from the app, no need for sticky sessions)7. Export a port (an end point to talk to)8. Scale out via the process model9. Disposability: Maximize robustness with fast startup and graceful shutdown
10. Keep development, staging, and production as similar as possible11. Logs: they are flow of events written to stdout that is captured by execution
env.
12 Factor
last factor is administrative processes● Run admin/management tasks as one-off processes
○ in django: manage.py migrate● One-off admin processes should be run in an identical
environment as the regular long-running processes of the app
● shipped from same code (same git repo)
Example of 12 Factor: bedrock - a 12 factor wordpresshttps://roots.io/bedrock/
12 Factor - Factorish
can be found on https://github.com/factorish/factorish
example:https://github.com/factorish/factorish-elk
Config
● confd○ written in go (a statically linked binary)○ input
■ env variables■ service discovery (like etcd and consul)■ redis
○ output ■ golang template with {{something}}
● crudini, jq● http://gliderlabs.com/registrator/latest/user/quickstart/
Config
● container’s entry point (“/start.sh”) calls REST API to add itslef to haproxy or anyother loadbalancer
● container’s entry point uses discovery service client (ex. etcdctl)
● something listen to docker events and send each container ENV and labels to discovery service
Multiple Process
● supervisord● runit● fake systemd
○ see free-ipa docker image○ https://github.com/adelton/docker-freeipa
Logging/Monitoring
● ctop● cadvisor: https://github.com/google/cadvisor● logstash● logspout - https://github.com/gliderlabs/logspout
Logging/Monitoring
nginx logging use “error_log /dev/stderr;” and “access_log /dev/stdout;” with daemon off. for example in supervisord[program:nginx]directory=/var/lib/nginxcommand=/usr/sbin/nginx -g 'daemon off;'user=rootautostart=trueautorestart=trueredirect_stderr=falsestdout_logfile=/dev/stdoutstderr_logfile=/dev/stderrstdout_logfile_maxbytes=0stderr_logfile_maxbytes=0
Logging/Monitoring
Web UI● tumtum● cockpit-project.org● Shipyard● FleetUI● CoreGI● SUSE/Portus
Web UI - cockpit-project
Web UI - shipyard
Web UI - tumtum
Building Docker Images
● Dockerfile and “docker build -t myrepo/myapp .”○ I have a proposal using pivot root inside dockerfile
(docker build will build the build environment then use another fresh small container as target, copy build result and pivot). Docker builder is frozen but details are here
● Dockramp○ https://github.com/jlhawn/dockramp○ external builder written in golang○ uses only docker api (needs new “cp” api)○ can implement my proposal
● Atomic app / Nulecule/ openshift have their ownway● Use Fabric/Ansible to build
Simple Duct tape launching.
Systemd @ magic. ex: have [email protected]# systemctl start container@myweb[Unit]Description=Docker Container for %IAfter=docker.serviceRequires=docker.service[Service]Type=simpleExecStartPre=bash -c “/usr/bin/mkdir /var/lib/docker/vfs/dir/%i || :”ExecStartPre=/usr/bin/docker kill %iExecStartPre=/usr/bin/docker rm %iExecStart=/usr/bin/docker run -i \ --name=”%i” \ --env-file=/etc/sysconfig/container/%i.rc --label-file=/etc/sysconfig/container/%i.labels -v /var/lib/docker/vfs/dir/%i:/data myrepo/%i
Seriously?Docker on production!
“Docker is about running random code downloaded from the Internet and running it as root.”[1][2]
-- a redhat engineer
Source 1, source 2
● host a private docker registry (so you don’t download random code from random people on internet)
● use HTTPS and be your own certificate authority and trust it on your docker hosts
● use registry version 2 and apply ACL on images○ URLs in v2 look /v2/<name>/blobs/<digest>
● use HTTP Basic Auth (apache/nginx) with whatever back-end you like (ex. LDAP or just plain files)
● have a Read-Only user as your “deployer” on servers● have a build server to push images (not developers)
Host your own private registry
“Containers do not contain.”
-- Dan Walsh (Redhat / SELinux)Seriously?
Docker on production!
in may 2015, a catastrophic vulnerability affected kvm/xen almost every datacenter.
Fedora/RHEL/CentOS had been secure because of SELinux/sVirt (since 2009)
AppArmor was a joke that is not funny.
http://www.zdnet.com/article/venom-security-flaw-millions-of-virtual-machines-datacenters/https://fedoraproject.org/wiki/Features/SVirt_Mandatory_Access_Control
Docker and The next Venom?
sVirt do support Docker
What happens in a container stays in the container.
● Drop privileges as quickly as possible● Run your services as non-root whenever possible
○ apache needs root to open port 80, but you are going to proxy the port anyway, so run it as non-root directly
● Treat root within a container as if it is root outside of the container
● do not give CAP_SYS_ADMIN to a container (it’s equivalent to host root)
Recommendations
Setting proper storage backend
● docker info | grep ‘Storage Driver’● possible drivers/backends:
○ aufs: a union filesystem that is so low quality that was never part of official linux kernel○ overlay: a modern union filesystem that was accepted in kernel 4.0 (too young)○ zfs: linux port of the well-established filesystem in solaris. the quality of the port and driver is still
questionable○ btrfs: the most featureful linux filesystem. too early to be on production○ devicemapper (thin provisioning): well-established redhat technology (already in production ex.
LVM)● do not use loopback default config in EL (RHEL/CentOS/Fedora)
○ WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use
● in EL edit /etc/sysconfig/docker-storage● http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/● http://www.projectatomic.io/blog/2015/06/notes-on-fedora-centos-and-docker-storage-drivers/● http://www.projectatomic.io/docs/docker-storage-recommendation/
Storage backend (using script)man docker-storage-setupvim /etc/sysconfig/docker-storage-setupdocker-storage-setup
● DEVS=“/dev/sdb /dev/sdc”○ list of unpartitioned devices to be used or added○ if you are adding more, remove old ones○ required if VG is specified and does not exists
● VG=“<my-volume-group>”○ set to empty to use unallocated space in root’s VG
Storage backend (manual)pvcreate /dev/sdcvgcreate direct-lvm /dev/sdclvcreate --wipesignatures y -n data direct-lvm -l 95%VGlvcreate --wipesignatures y -n metadata direct-lvm -l 5%VGdd if=/dev/zero of=/dev/direct-lvm/metadata bs=1M vim /etc/sysconfig/docker-storage # to add next line
DOCKER_STORAGE_OPTIONS = --storage-opt dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.datadev=/dev/direct-lvm/data
systemctl restart docker
Docker VolumesNever put data inside the container (logs, database files, ..etc.). Data should go to mounted volumes.
You can mount folders or files. You can mount RW or RO.
You can have a busybox container with volumes and mount all volumes of that container in another container.
# docker run -d --volumes-from my_vols --name db1 training/postgres
Everything is a child processes of a single daemon. Seriously!
Seriously?Docker on production!
Docker process model is flawedDocker daemon launches containers as attached child processes. if the daemon dies all of them will collapse in a fatal catastrophe. Moreover, docker daemon has so many moving parts. For example fetching images is done inside the daemon.Bad network while fetching an image or having an evil image might collapse all containers.https://github.com/docker/docker/issues/15328
An evil client, an evil request, an evil image, an evil contain, or an evil “inspect” template might cause docker daemon to go crazy and risk all containers.
Docker process model is flawedCoreOS introduced more sane process model in rkt (Rocket) an alternative docker-like containers run time. RedHat contributes to both docker and rocket as both has high potential. Rkt is just a container runtime where you can run containers as non-root and without being a child to anything (ex. rely on systemd/D-Bus). Rocket is not a platform (no layers, no image registry service, ..etc.)
https://github.com/coreos/rkt/
Docker might evolve to fix this, dockerlite is a shell script uses LXC and BTRFS
https://github.com/docker/dockerlite
For now just design your cluster to fail and use anti-affinity
Networking.
Linux Bridges, IPTables NATing, Export ports using a young proxy written in golang. Seriously!
Seriously?Docker on production!
Docker Networking nowDocker uses Linux bridges which only connect within same host.Containers on host A can’t talk to container on host B! And uses NAT to talk to outside world# iptables -t nat -A POSTROUTING -s 172.17.0.0/16 -j MASQUERADE
Exported ports in docker are done via a docker proxy process (written in go). check “netstat -tulnp”
Deprecated geard used to connect multiple hosts using NAT and configured each container to talk to localhost for anything (ex. talk to localhost MySQL and NAT will take it to MySQL container on another host):
# iptables -t nat -A PREROUTING -d ${local_ip}/32 -p tcp -m tcp --dport ${local_port} -j DNAT --to-destination ${remote_ip}:${remote_port}# iptables -t nat -A OUTPUT -d ${local_ip}/32 -p tcp -m tcp --dport ${local_port} -j DNAT --to-destination ${remote_ip}:${remote_port}# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source ${container_ip}
Docker Networking nowA Similar approach is manually hard-code and divide docker bridges on each host 172.16.X.y and where X is the host and y is the container and use NAT to deliver packets (or 172.X.y.y depending on number hosts and number of containers on each host).
http://blog.sequenceiq.com/blog/2014/08/12/docker-networking/
given a remote host with IP 192.168.40.12 and its docker0 bridge with 172.17.52.0/24, and given a host with docker0 on 172.17.51.0/24 in the later host type
route add -net 172.17.52.0 netmask 255.255.255.0 gw 192.168.40.12iptables -t nat -F POSTROUTING # or pass "--iptables=false" to docker daemoniptables -t nat -A POSTROUTING -s 172.17.51.0/24 ! -d 172.17.0.0/16 -j MASQUERADE
Docker Networking Alternatives● OpenVSwitch (well-established production technology)● Flannel (young project from CoreOS written in golang)● Weave (https://github.com/weaveworks/weave)● Calico (https://github.com/projectcalico/calico)
Docker Networking AlternativesOpenVSwitch:Just like a physical, this virtual switch connects different hosts.
One setup would be connecting each container to OVS without bridge. “docker run --net=none” then use ovs-docker script
The other setup just replace docker0 bridge with one that is connected to OVS. (no change need to be done to each container)
Docker Networking Alternatives# ovs_vsctl add-br sw0
or /etc/sysconfig/network-scripts/ifcfg-sw0then
# ip link add veth_s type veth peer veth_c# brctl addif docker0 veth_c # ovs_vsctl add-port sw0 veth_s
see /etc/sysconfig/network-scripts/ifup-ovs
http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob_plain;f=rhel/README.RHEL;hb=HEAD
Networking the futurein the feature libnetwork will allow docker to use SDN plugins.Docker acquired SocketPlane to implement this.
https://github.com/docker/libnetworkhttps://github.com/docker/libnetwork/blob/master/ROADMAP.md
Introducing Docker Glue● docker-glue - modular pluggable daemon that can run handlers and scripts● docker-balancer - a standalone daemon that just updates haproxy (a special case of glue)
https://github.com/muayyad-alsadi/docker-glue
autoconfigure haproxy to pass traffic to your containers
uses docker labels “-l” to specify http host or url prefix
# docker run -d --name wp1 -l glue_http_80_host='wp1.example.com' mywordpress/wordpress # docker run -d --name wp2 -l glue_http_80_host='wp2.example.com' mywordpress/wordpress # docker run -d --name panel -l glue_http_80_host=example.com -l glue_http_80_prefix=dashboard/ myrepo/control-panel
Introducing Docker Gluerun any thing based on docker events (test.ini)
[handler]class=DockerGlue.handlers.exec.ScriptHandlerevents=allenabled=1triggers-none=0
[params]script=test-handler.shdemo-option=some value
# it will runtest-handler.sh /path/to/test.ini <EVENT> <CONTAINER_ID>
Introducing Docker Glue#! /bin/bash
cd `dirname $0`
function error() { echo "$@" exit -1}
[ $# -ne 3 ] && error "Usage `basename $0` config.ini status container_id"ini="$1"status="$2"container_id="$3"ini_demo_option=$( crudini --inplace --get $ini params demo-option 2>/dev/null || : )echo "`date +%F` container_id=[$container_id] status=[$status] ini_demo_option=[$ini_demo_option]" >> /tmp/docker-glue-test.log
Resources
● http://opensource.com/business/14/7/docker-security-selinux
● http://opensource.com/business/14/9/security-for-docker
● http://www.projectatomic.io/blog/2014/09/yet-another-reason-containers-don-t-contain-kernel-keyrings/
● http://developerblog.redhat.com/2014/11/03/are-docker-containers-really-secure-opensource-com/
● https://www.youtube.com/watch?v=0u9LqGVK-aI● https://github.com/muayyad-alsadi/docker-glue● http://blog.sequenceiq.com/blog/2014/08/12/docker-
networking/● https://docs.docker.com/userguide/dockervolumes/● https://docs.docker.com/userguide/dockerlinks/● https://docs.docker.com/articles/networking/● https://github.
com/openvswitch/ovs/blob/master/INSTALL.Docker.md
● http://radar.oreilly.com/2015/10/swarm-v-fleet-v-kubernetes-v-mesos.html
Q & A