dockercon eu 2015: persistent, stateful services with docker cluster, namespaces and docker volume...

Persistent, stateful services with docker clusters, namespaces and docker volume magicMichael NealeCo-founder, CloudBees (that Jenkins company)

Agenda

Supercontainers and storagePrivilegesIt’s all files (part 2)Controlling the host and peer containersStorage engines

Stateful docker clusters“off the shelf” cluster schedulingThe solution chosenOther tools out thereCredits…

BackgroundUse-case for stateful servicesDocker volumesQuick namespaces revisionnsenter

Mounts and VolumesIt’s all files (part 1)the mount namespacecreating bind mountsdocker volume api (use it!)

BackgroundThe Need for Stateful Services

Basis of this presentation:

.. was learned while building an elastic and scalable Jenkins based product for multiple cloud

environments, on docker

—Michael Neale

“No containers were hurt as part of this

production.”

My history with docker

Ex Red Hat where I heard about “control groups”Starting CloudBees, looking at ways to fairly multi tenantLater would discover (and with much help) use LXCSaw a video of Solomon demoing docker and didn’t believe itStill didn’t believe itFor the longest time didn’t believe it

CloudBees & Docker

Actually spoke about this at DockerCon 2014 (the first one!)cgroups -> LXC -> LXC + ZFS copy-on-writeLike dotCloud - ran a PaaS (as well as CI/CD toolchain)In 2014 moved to focus on CI/CD (dotCloud focussed on docker)In 2014 moved to adopt docker over LXC (and ZFS)Using: Docker Hub (private repos), Private RegistryMany of our customers are commercial users of dockerDocker Jenkins plugins: docker hub, build and publish and many more

Put all the things (OSS and commercial) on docker hub

I started the “official” jenkins image early onupdated now ~weekly (with LTS images also)

one MEEELION ??

A stateless cluster of apps is the dream

But the reality is, many apps still need state, a diskDatabases for exampleHands up who would run Oracle on NFS?

Reality: local diskNetwork filesystems are great*But sometimes you need the data close to the processingEBS, HDFS, GCP, OpenStack block storage… BUT: how to balance this need for local state with “ephemeral” serversServers come and go, need to restore the data (fast)Need to backup the data (delta/snapshots - fast)Alternatives: SANs (reattach volumes to replacement nodes, some clouds also support this)Reason for backups: resilience. Volumes can disappear too.

Current product

Years of experience with containersEC2, ZFS, EBS, LXClearn from it to build something new and “turn key” installable, powered by dockerI accidentally created a cluster scheduler (it happens.. please don’t)An evolved “pre-docker” system

Aim: a new product

A distributed Jenkins cluster10000s of “masters”, 100000s of elastic build workersUtilise “Off The Shelf” expertise based around docker: Mesos, Docker Swarm, KubernetesWork within existing constraints of a lively and evolving open source project(this means accepting local disk state… for now)

Additional ConstraintsOnly want to depend on docker being present on “worker nodes”Off the shelf cluster schedulerUse local disk*Multiple target clouds to be supportedMultiple storage “engines” to be supported

* Would love to refactor to DB backed

“Storage engines?”“The thing that backs up and restores local disks”

eg: EBS (snapshots), rsync, NFS, ZFS send …

Same cluster management, same api, different storage tech for different clouds/needs.

Ensures volumes are backed up in a consistent state (using LVM snapshot, xfs_freeze, as needed)

Docker volumes

Docker helpfully lets you bind mount to hostGiving you a choice of ways to get data to the hostContainers can remain ephemeralHowever, you need to manage those underlying volumes

Note: you shouldn’t need to do what I did. Use something off the shelf if you can. If you must, there is an excellent docker plugin api and volume plugin api.

Solving local disk with docker

client cluster sched. docker host storage

request appfind free slot

ask for dataprovide data

Container fully running with data

Using “trickery”

request data

provide data, bind mount

container starts, asks for dynamic

bind mount, waits

With docker volume plugin api

jsonprovide datadocker calls

volume plugin BEFORE

container starts,

launches with bind mount

However: Docker plugin api did not exist yet!

I had to make do with “trickery”Other choices like powerstrip existed, but wanted “standard” dockerAnd you are here for namespace trickerySo lets learn from it…

—Unknown

“Hard work pays off eventually, but laziness

pays off right now.”

Namespaces - really quick…Along with cgroups are “foundational tech” for containers6 types: Mount, UTS, IPC, PID, Network and UserMy favourites: Mount: filesystem stuff (that I used)PID, Network and the exciting User namespaces!

https://lwn.net/Articles/531114/

How do we access these namespaces?

nsenter - command line toolnsenter allows you to “enter” a namespace and do something in the context of itAvailable out of the box in many linux distros now

https://github.com/karelzak/util-linux/blob/master/sys-utils/nsenter.c

https://blog.docker.com/tag/nsenter/

Mounts and VolumesIt’s all files in Linux - part 1

Mount namespace

Containers don’t see all mount points, all devices, just their ownAllows dockers “bind mount” to workA “bind mount” in linux is really an “alternative view of an existing directory tree”A docker bind mount takes that “alternative view” and makes it visible to the container (via its mount name space)Magic? No. Linux.

It’s all files, part 1

Start any container Access docker host and run this to get the pid of the whole container:

docker inspect --format {{.State.Pid}} <container id>

You can then see the 6 namespaces in /proc/<PID>/ns:

ls /proc/7865/ns/ipc mnt net pid user uts

/proc virtual filesystem and nsenter/proc is a virtual filesystem (http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html)

Run a command inside a given containers namespace:

nsenter --mount=/proc/$PID/ns/mnt -- /usr/bin/command param

RUN A COMMAND FROM HOST AS IF YOU ARE IN THAT CONTAINER

—Spidermans Uncle

“With great nsenter power, comes great

responsibility ”

Creating a bind mount on a running container( -v /var/foo:/var/bar ) High level steps:

Get the underlying device from the host, into the containermount the device in the containerbind mount in the container to the “directory you want”unmount the device in container remove the initial mount

What you are left with: a bind mount to the volume on the host you wanted in the first place, and only that path. Not the whole device/volume on host.

You don’t need to do all this yourself, ever!

# Using a device’s numbers we can create the same device in container

# use nsenter to create a device file IN the container (using its $PID): nsenter --mount=/proc/$PID/ns/mnt -- mknod --mode 0600 /dev/sda1 b 8 0

# Now we have the device ALSO in the container!# We can mount it (normal linux)# bind mount to the desired directory (also normal linux)!# all from the host

I told you not to panic!

Now we have a dynamic bind mount

As if we used -v /var/foo:/var/bar on startupRemember: DON’T DO THIS!Really: you shouldn’t need to do this yourself. Use the docker plugin volume api! (if you must)

Docker plugin API

Out of process JSON based api (but running on same host)plugins are installed by putting a file in a directory, and referred by name (minutes the extension)Well defined JSON protocol

https://docs.docker.com/extend/plugin_api/

Docker volume plugin API

docker run -v volumename:/data --volume-driver=mydriver ..

“volumename” is passed to the registered volume-driver(which is listening on http) volume-driver then prepares the data somewhere on the host, returns where it lives (via json)… docker then bind mounts it in as /dataAll happens BEFORE container startshttps://docs.docker.com/extend/plugins_volume/

Docker volume plugin API

Would not require messing with namespacesStill allow an out of process “volume service” to take care of messy volume detailsHowever - DOES require you to register the plugin with docker on the hostAnd less terrifying fun than nsenter and namespaces

If you really must

https://github.com/michaelneale/bind-mount-supercontainer

Sample python code that I prototyped this with. Use with care!

Supercontainersand storage enginesLike containers, only more… uh super…

Supercontainers - concept

Term came from Red Hat http://developerblog.redhat.com/2014/11/06/introducing-a-super-privileged-container-concept/You have heard of privileged containers?

docker run --privileged ..

Drops all namespace restrictions“Super privileged containers” add in more access to the underlying host…

It’s all files (part 2)

Add in the host root filesystem, docker daemon, and all the rest:

docker run -v /var/run/docker.sock:/var/run/docker.sock

—privileged

-v /:/media/host

my-super-container

Brings in docker socket, and root as /media/host/media/host then contains ALL devices, virtual files, /proc etc

Why? We can do everything we did with nsenter before but from WITHIN a “peer container”

We can do everything we did with nsenter before but from WITHIN a “peer container”Remember requirements: vanilla docker, only docker installed on hostUse super-container as a “agent” container, do all the automation you could wantNo need for extra bits on the host boxAllows using “off the shelf” cluster scheduling (only docker need be installed)

Controlling the host

Host can be accessed from super-container via nsenter PID of host is 1!

eg, from super-container, get all mounts: nsenter --mount=/media/host/proc/1/ns/mnt -- cat /proc/mounts

Run a command, from container, on the host (stuff after “--")/media/host lets us get to the host. Even devices.

Controlling the host

Host can be accessed from super-container via nsenter Do all the steps as before, but with “nsenter —mount=/media/host/proc/1/ns/mnt” prefixed

Controlling peer containers from supercontainer

Peers are other “ordinary” containers on the same host as the super containerPeers can be accessed from super-container also via nsenter Just like before, we use nsenter, with the peer containers $PIDBut prefix it with the hosts filesystem:

nsenter --mount=/proc/$PID/ns/mnt -- ..

becomes:

nsenter --mount=/media/host/proc/$PID/ns/mnt -- ..

Controlling peer containers

Why?Once again, use he super-container as the controlling agent on a hostLess bits to install on the host

Storage engines

My requirement: multiple implementations for different cloudsDifferent clouds have different storage enginesSuper container great place to host volume serviceDifferent implementations on service depending on what is on offerEBS, NFS, openstack rsync and moreThis “volume service super-container” is responsible for backup/restore

Storage engines - eg an AWS region

zone-1 zone-2

serverA

serverBserver

Aserver

Bvol-1 vol-2

vol-1vol-1 vol-1vol-2snapshots

request backup

Snapshots/backups

Snapshots a cheap and quickZone resilienceVolumes (ie: disks) are not as durable as snapshots/backupsSimilar in other platforms: GCP, OpenStack, Azure. Google compute persistent disks: does allow volumes read-only extra mounts across instances for redundancy of compute nodesIn our case: failing over is “restoring from backup” - always test your backups!

Supercontainers - summary

A useful tool for low level controlNo need to install bits on the hostCan control peers directlyCould be a great place to host a docker volume plugin implementation(not currently recommended in Docker plugin api docs)

Stateful clustersEveryone wants to be stateless…

What we built…

.. an elastic and scalable Jenkins based product for multiple cloud environments, on docker

Cluster schedulers/managers

Remember: I have build schedulers before, would rather not againDocker Swarm, Mesos/Marathon, Kubernetes etcSome have concepts of volumesAll can schedule “plain” docker containersSuper containers can give you a way to get lower level access

What we settled on

Super containers to implement volume serviceSupport for multiple storage engines for different cloudsScheduled via mesos+marathonOnly docker (+ mesos in this case) required on the hostsWhy mesos: practical choice for us but not a tight coupling(could mesos be in a super container? probably)Using containers for all the things: elastic search nodes, builds, even haproxyFor us, 5 minute or event based backups/snapshots are fine

Running supercontainers

Eg. marathon: schedule a super container to run on each host Constraint on volume service: one per host, size: number of servers in cluster (3 in this case):

vol service vol servicevol service

master masterelastic search

haproxy

(free)

Working with EBS (an example)

client container volume service EBS api

requests backup

freeze for snapshotinitiate snapshot

unfreeze backup delta,copy to s3

optimisation: use LVM snapshot instead of freeze

Backups, backups

Servers are ephemeralServers come and goDisks are fallible (even if cloud platforms call them “volumes”)Workload moves aroundRestore data when workload is moved to a new locationDelta backups are used to avoid full copies each time

Cluster schedulers/managers

Storage awareness is being built in increasingly (Kubernetes volumes, mesos storage awareness)Ideal world: your cluster manager will do all this for you. If you live in that world: congrats. Make yourself a cocktail:

My recipe for no-sugar old fashioned:https://gist.github.com/michaelneale/6034145

“off the shelf” stateful volume tools

Rexray: use volume plugin api for Amazon EBS, Rackspace and moreFlocker from ClusterHQKubernetes volume supportApache “Mysos”: MySQL service backed up to HDFS on mesosTutum from Docker! has support for persistent volumesWatch this space… (changing constantly)

https://docs.clusterhq.com/en/1.4.0/labs/docker-plugin.html

https://github.com/emccode/rexray

Stateful volumes summary

It is possible with dockerAvoid doing it yourself is someone else already hasUsing local filesystem directly does feel a bit like “legacy”But it is a reality for some apps (especially database services)Lovely to port everything to be stateless, database backed, blobstore backed, but it takes timeLean on the capabilities of the underlying platform where you can

Credits

Jérôme Petazzoni (@jpetazzo) - years of inspirational blog posts, hacks on linux/docker/volumes. And great hair. http://jpetazzo.github.io/2015/01/13/docker-mount-dynamic-volumes/ - BTW Jerome - it works for real!Red Hat for Super Container concepts: Daniel Walsh: http://developerblog.redhat.com/2014/11/06/introducing-a-super-privileged-container-concept/Trevor Jay from Red Hat for some final namespace tips https://securityblog.redhat.com/author/tjay/I really just mashed up the above concepts: https://michaelneale.blogspot.com.au/2015/02/mounting-devices-host-from-super.html

@jpetazzo’s hair - imminent singularity?

2012 2013 2014 20150

Region 1

Thank you!Michael Neale@michaelnealemneale@cloudbees.com

dockercon eu 2015: persistent, stateful services with docker cluster, namespaces and docker volume...

Technology

dockercon eu 2015: using docker with nosql

rooting out root: user namespaces in docker

dockercon sf 2015: docker at lyft

dockercon - lessons from using docker to improve web...

isolating processes using docker user namespaces and seccomp

dockercon eu 2015: docker universal control plane (gordon's...

dockercon sf 2015: getting started w/ docker

dockercon sf 2015: cultural change using docker

delivering ebay's ci solution with apache mesos & docker -...

docker hub breakout session at dockercon by ken cochrane

dockercon eu 2015: docker networking deep dive

dockercon eu 2015: monitoring docker

dockercon eu 2015: trading bitcoin with docker

2015 dockercon using docker in production at bity.com

dockercon eu 2015: sparebank; a journey towards docker

dockercon sf 2015: docker after launching 1 billion...

dockercon eu 2015: the latest in docker engine

cultural change using docker dockerizing demonware...

dockercon sf 2015: docker in the new york times newsroom

dockercon '17 feedback -user stories- at docker meetup tokyo