toward 10,000 containers on openstack

Toward 10,000 Containers on OpenStack

Ricardo RochaSpyros Trigazis(CERN)

Ton NgoWinnie Tsang(IBM)

Talk outline1. Introduction2. Benchmarks3. CERN Cloud result4. CNCF Cloud result5. Conclusion

• Acknowledgement: • CERN cloud team• CNCF Lab• IBM team: Douglas Davis, Simeon Monov• Rackspace team: Adrian Otto, Chris Hultin, Drago Rosson• Many thanks to the Magnum team for all the progress

About OpenStack Magnum

• Mission: management service for container infrastructure• Create / configure nodes (VM/baremetal), networking, storage • Deep integration with Openstack services• Lifecycle operation on cluster• Native container API

• Current support: • Kubernetes• Swarm• Mesos

Newton and Upcoming Release• Newton features:• Cluster and drivers refactoring• Documentation: user guide, installation guide • Baremetal: Kubernetes cluster • Storage: cinder volume, Docker storage • Networking: decouple lbaas, floating IP, Flannel overlay network• Distro: OpenSUSE• Internal: asynchronous operation, certificate DB storage, notification, rollback

• Upcoming release• Heterogeneous clusters• Cluster upgrades• Advanced container networking• Additional drivers: DC/OS, further baremetal support

Benchmarks

Rally An Openstack benchmark test tool• Easily extended by plugin• Test result in HTML reports• Used by many projects• Context: set up environment• Scenario: run benchmark• Recommended for a production serviceto verify that the service behaves asexpected at all time

Kubernetes Cluster

pods,containers

Rally

report

Rally Plugin for MagnumScenarios for cluster:• Create and list clusters(support k8s, swarm and mesos)• Create and list cluster templates

Scenarios for container:• Create and list pods(k8s)• Create and list rcs(k8s)• Create and list containers(swarm)• Create and list apps(mesos)

Sample Rally input task files

• ---• MagnumClusters.create_and_list_clusters:• -• args:• node_count: 4• runner:• type: "constant”• times: 10• concurrency: 2• context:• users:• tenants: 1• users_per_tenant: 1• cluster_templates:• image_id: "fedora-atomic-latest"• external_network_id: "public"• dns_nameserver: "8.8.8.8"• flavor_id: "m1.small"• docker_volume_size: 5• network_driver: "flannel"• coe: "kubernetes"

---K8sPods.create_and_list_pods:-args:manifest: "artifacts/nginx.yaml.k8s"runner:type: "constant"times: 20concurrency: 2context:users:tenants: 1users_per_tenant: 1cluster_templates:image_id: "fedora-atomic-latest"external_network_id: "public"dns_nameserver: "8.8.8.8"flavor_id: "m1.small"docker_volume_size: 5network_driver: "flannel"coe: "kubernetes"clusters:node_count: 2ca_certs:directory: "/home/stack"

loaddriver

Google/Kubernetes benchmarkSteady state performance in a large Kubernetes cluster• Create a Kubernetes cluster with 800 vcpu(e.g. 200 nodes x 4 cpu)

• Requires a DNS service, SkyDNS for k8s<=1.2, embedded in newer releases

• Launch nginx pods serving millions of HTTP requests per second

• It is possible to scale the load bots and the service pods as needed

• Google has published the configuration and result data, so we can compare with their results

Kubernetes Cluster

nginxmillions request/sec

CERN Cloud result

CERN OpenStack InfrastructureProduction since 2013

~190.000 cores ~4million VMs created ~200 VMs created / hour

CERN Container Use Cases• Batch processing• End user analysis / Jupyter Notebooks• Machine Learning / TensorFlow / Keras• Infrastructure Services

• Data Movement, Web Servers, PaaS, ...

• Continuous Integration / Deployment• And many others...

CERN Magnum Deployment• Integrate containers in the CERN cloud

• Shared identity, networking integration, storage access, …

• Agnostic to container orchestration engines• Docker Swarm, Kubernetes, Mesos

• Fast, Easy to use

Container Investigations Magnum Tests

Pilot Service Deployed

11 / 2015 02 / 2016

Production Service

CERN / HEP Service Integration, Networking, CVMFS, EOS

10 / 2016Mesos Support

Upstream Development

CERN Magnum Deployment• Clusters are described by cluster templates• Shared/public templates for most common setups, customizable by users

$ magnum cluster-template-list+------+---------------------------+| uuid | name |+------+---------------------------+| .... | swarm || .... | swarm-ha || .... | kubernetes || .... | kubernetes-ha || .... | mesos || .... | mesos-ha |+------+---------------------------+

CERN Magnum Deployment• Clusters are described by cluster templates• Shared/public templates for most common setups, customizable by users

$ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100

$ magnum cluster-list+------+----------------+------------+--------------+-----------------+| uuid | name | node_count | master_count | status |+------+----------------+------------+--------------+-----------------+| .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE |+------+----------------+------------+--------------+-----------------+

$ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster)

$ docker info / ps / ...$ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash [root@32f4cf39128d /]#

CERN Benchmark Setup• Setup in one dedicated cell• 240 hypervisors

• Each 32 cores, 64 GB RAM, 10Gb links

• Container images stored in Cinder volumes, in our CEPH cluster• Default today in Magnum

• Deployed / configured using puppet (as all our production setup)• Magnum / Heat Setup

• Dedicated controller(s), in VMs• Dedicated rabbitmq, clustered, in VMs

• Dropped explicit Neutron resource creation• Floating IPs, Ports, Private Networks, LBaaS

CERN Results• Several iterations before arriving at a reliable setup• First run: 2 million requests / s

• Bay of 200 nodes (400 cores, 800 GB Ram)

First Tests~100/200 node bays

Large TestsUp to 1000 node bays

CERN Results• Services coped with request increase

• x4 in Nova, x8 in Cinder, == in Keystone

• Almost business as usual… though• Keystone stores a revocation tree (memcache)• Populated on every project/user/trustee creation• And is checked for every token validation• -> Network traffic in one cache node (shard)• -> >12 seconds ave request time vs the average of 3ms

First Tests~100/200 node bays Large Tests

Up to 1000 node bays

CERN Results• Second run: rally and 7 million requests / sec• Lots of iterations! Example

Scale Magnum Conductor

Deploy Barbican

CERN Results● Second go: rally and 7 million requests / sec

○ Kubernetes 7 million requests / sec○ 1000 node clusters (4000 cores, 8000 GB / RAM)

Cluster Size (Nodes) Concurrency Deployment Time (min)

2 50 2.5

16 10 4

32 10 4

128 5 5.5

512 1 14

1000 1 23

CERN Tuning• Heat• Timeouts when contacting rabbitmq• Large stack deletion sometimes needs multiple tries

• Magnum• ‘Too many files opened’• 503s, scale the conductor• RabbitMQ instabilities• Flannel network config

• Keystone• Revocation tree can cause some scalability issues

ulimit -‐n 4096

max_stacks_per_tenant: 10000 was 100

max_template_size: 5242880 (*10 previous)

max_nested_stack_depth: 10 (was 5)

engine_life_check_timeout: 10 (was 2)

rpc_poll_timeout: 600 (was 1)

rpc_response_timeout: 600 (was 60)

rcp_queue_expiration: 600 (was 60)

disabled memcache

Deployed Barbican

Downgrade to 3.3.5

-‐-‐labels flannel_network_cidr=10.0.0.0/8,\ flannel_network_subnetlen=22,\ flannel_backend=vxlan

CERN Tuning (continued)

• Cinder• Slow deletion triggering heat stack deletion timeouts• Heat engine issues (too many retrials, timeouts)• Make Cinder optional? Lots of traffic with high load apps!

• Heat stack deployment scaling linearly• For large stacks >128 nodes• Summary of a 1000 node cluster: 1003 stacks, 22000 resources, 47000 events• That’s ~70000 records in the heat db for one stack

• Heat: Performance Scalability Improvements - Thu 27th 11:50 am

• Flannel backend tests• udp: ~450Mbit/s, vxlan: ~920 Mbit/s, host-gw: ~950Mbit/s• Change default? We set vxlan at CERN right now

CNCF Cloud Result

90computes

CNCF Benchmark Setup• Granted access 1 month ago and built with OpenstackAnsible with Newton release• On-going scalability study for Magnum, Heat and COEs

• Hardware configuration• 2x Intel E5-2680v3 12-core• 128GB RAM• 2x Intel S3610 400GB SSD• 10x Intel 2TB NLSAS HDD• 1x QP Intel X710"

• Cinder configured with the lvm-driver, disabled later

• Neutron configured with linux bridge

ha-proxy

5controllers

5controllers

3 neutron controllers

3 neutron controllers

90computes

90computes

CNCF resultsTwo rounds of tests:• 35 node cluster with one master, 24 cores and 120GB of ram, (840 cores)

• 80 node cluster with one master, 24 cores and 120GB of ram, (1920 cores)

Flannel backend configuration host-gw or udp) VS vxlan at CERN

nodes containers reqs/sec latency flannel

35 1100 1M 83.2 ms udp

80 1100 1M 1.33 ms host-gw

80 3100 3M 26.1 ms host-gw

Rally data at CNCF

Cluster creation

Cluster Size

(Nodes)Concurrency

Number of

ClustersDeployment Time (min)

2 10 100 3.02

2 10 1000 Able to create 219 clusters

32 5 100 Able to create 28 clusters

512 1 1 *

4000 1 1 *

COE Cluster Size (Nodes) Concurrency Number of

ContainersDeployment Time (sec)

K8S 2 4 8 2.3

Swarm 2 4 8 6.2

Mesos 2 4 8 122.0

Container creation

Tuning at CNCF• Apply the same improvements discovered at CERN

• Heat tuning• Cinder decoupling

• Disabled Floating IPs to create many large clusters concurrently• But we need Floating IPs for the master node or the load balancer

• Still working on tuning rabbit, adding separate clusters for each service (like at CERN)• Consider this option in OpenStack Ansible for large deployment

• Using database for certificates didn’t impact the overall performance:• Reasonable alternative to Barbican

Conclusion

Conclusions• Scalability:• Deploy clusters• Deploy containers• Steady state: app

• Good:• Nova and neutron were solid• Once the infrastructure is in place, we can match the performance published by Google• Magnum itself not a bottleneck: many tuning knobs for building complex cluster

• Need work: • Really an Openstack scaling and stability problem• Linear scaling in heat and keystone (when creating a large number of cluster and using uuid tokens, token validation in keystone becomes too slow)

• Did we hit 10,000 containers? • YES

Best practices How to avoid the bottlenecks for now• Tune your Openstack• Rabbit, Heat

• Consider trade-off in deploying cluster: • Local storage or cinder volume • Fewer larger nodes or more smaller nodes• Floating IP per node or not• Load balancer • Networking: udp, host-gw

Next steps• Rerun tests focusing on cluster lifecycle operations

• Rolling upgrades, node retirement / replacement, …

• Summarize best practices in Magnum documentation• Run similar application scaling tests for other COEs

• Swarm 3K, Mesos 50.000 containers in real time

• Decouple Cinder for container storage• Bugs:

• Floating IP handling, client, state synchronization with Heat

• Long term issue:• Developers use devstack• How can we discover bottlenecks, scaling problems in a systematic way?

Thank You

Ricardo [email protected]

Spyros [email protected]@strigazi

Ton Ngo [email protected]@tango245

Winnie [email protected]

toward 10,000 containers on openstack

Technology