toward 10,000 containers on openstack
TRANSCRIPT
Toward 10,000 Containers on OpenStack
Ricardo RochaSpyros Trigazis(CERN)
Ton NgoWinnie Tsang(IBM)
Talk outline1. Introduction2. Benchmarks3. CERN Cloud result4. CNCF Cloud result5. Conclusion
• Acknowledgement: • CERN cloud team• CNCF Lab• IBM team: Douglas Davis, Simeon Monov• Rackspace team: Adrian Otto, Chris Hultin, Drago Rosson• Many thanks to the Magnum team for all the progress
About OpenStack Magnum
• Mission: management service for container infrastructure• Create / configure nodes (VM/baremetal), networking, storage • Deep integration with Openstack services• Lifecycle operation on cluster• Native container API
• Current support: • Kubernetes• Swarm• Mesos
Newton and Upcoming Release• Newton features:• Cluster and drivers refactoring• Documentation: user guide, installation guide • Baremetal: Kubernetes cluster • Storage: cinder volume, Docker storage • Networking: decouple lbaas, floating IP, Flannel overlay network• Distro: OpenSUSE• Internal: asynchronous operation, certificate DB storage, notification, rollback
• Upcoming release• Heterogeneous clusters• Cluster upgrades• Advanced container networking• Additional drivers: DC/OS, further baremetal support
Benchmarks
Rally An Openstack benchmark test tool• Easily extended by plugin• Test result in HTML reports• Used by many projects• Context: set up environment• Scenario: run benchmark• Recommended for a production serviceto verify that the service behaves asexpected at all time
Kubernetes Cluster
pods,containers
Rally
report
Rally Plugin for MagnumScenarios for cluster:• Create and list clusters(support k8s, swarm and mesos)• Create and list cluster templates
Scenarios for container:• Create and list pods(k8s)• Create and list rcs(k8s)• Create and list containers(swarm)• Create and list apps(mesos)
Sample Rally input task files
• ---• MagnumClusters.create_and_list_clusters:• -• args:• node_count: 4• runner:• type: "constant”• times: 10• concurrency: 2• context:• users:• tenants: 1• users_per_tenant: 1• cluster_templates:• image_id: "fedora-atomic-latest"• external_network_id: "public"• dns_nameserver: "8.8.8.8"• flavor_id: "m1.small"• docker_volume_size: 5• network_driver: "flannel"• coe: "kubernetes"
---K8sPods.create_and_list_pods:-args:manifest: "artifacts/nginx.yaml.k8s"runner:type: "constant"times: 20concurrency: 2context:users:tenants: 1users_per_tenant: 1cluster_templates:image_id: "fedora-atomic-latest"external_network_id: "public"dns_nameserver: "8.8.8.8"flavor_id: "m1.small"docker_volume_size: 5network_driver: "flannel"coe: "kubernetes"clusters:node_count: 2ca_certs:directory: "/home/stack"
loaddriver
Google/Kubernetes benchmarkSteady state performance in a large Kubernetes cluster• Create a Kubernetes cluster with 800 vcpu(e.g. 200 nodes x 4 cpu)
• Requires a DNS service, SkyDNS for k8s<=1.2, embedded in newer releases
• Launch nginx pods serving millions of HTTP requests per second
• It is possible to scale the load bots and the service pods as needed
• Google has published the configuration and result data, so we can compare with their results
Kubernetes Cluster
nginxmillions request/sec
CERN Cloud result
CERN OpenStack InfrastructureProduction since 2013
~190.000 cores ~4million VMs created ~200 VMs created / hour
CERN Container Use Cases• Batch processing• End user analysis / Jupyter Notebooks• Machine Learning / TensorFlow / Keras• Infrastructure Services
• Data Movement, Web Servers, PaaS, ...
• Continuous Integration / Deployment• And many others...
CERN Magnum Deployment• Integrate containers in the CERN cloud
• Shared identity, networking integration, storage access, …
• Agnostic to container orchestration engines• Docker Swarm, Kubernetes, Mesos
• Fast, Easy to use
Container Investigations Magnum Tests
Pilot Service Deployed
11 / 2015 02 / 2016
Production Service
CERN / HEP Service Integration, Networking, CVMFS, EOS
10 / 2016Mesos Support
Upstream Development
CERN Magnum Deployment• Clusters are described by cluster templates• Shared/public templates for most common setups, customizable by users
$ magnum cluster-template-list+------+---------------------------+| uuid | name |+------+---------------------------+| .... | swarm || .... | swarm-ha || .... | kubernetes || .... | kubernetes-ha || .... | mesos || .... | mesos-ha |+------+---------------------------+
CERN Magnum Deployment• Clusters are described by cluster templates• Shared/public templates for most common setups, customizable by users
$ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100
$ magnum cluster-list+------+----------------+------------+--------------+-----------------+| uuid | name | node_count | master_count | status |+------+----------------+------------+--------------+-----------------+| .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE |+------+----------------+------------+--------------+-----------------+
$ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster)
$ docker info / ps / ...$ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash [root@32f4cf39128d /]#
CERN Benchmark Setup• Setup in one dedicated cell• 240 hypervisors
• Each 32 cores, 64 GB RAM, 10Gb links
• Container images stored in Cinder volumes, in our CEPH cluster• Default today in Magnum
• Deployed / configured using puppet (as all our production setup)• Magnum / Heat Setup
• Dedicated controller(s), in VMs• Dedicated rabbitmq, clustered, in VMs
• Dropped explicit Neutron resource creation• Floating IPs, Ports, Private Networks, LBaaS
CERN Results• Several iterations before arriving at a reliable setup• First run: 2 million requests / s
• Bay of 200 nodes (400 cores, 800 GB Ram)
First Tests~100/200 node bays
Large TestsUp to 1000 node bays
CERN Results• Services coped with request increase
• x4 in Nova, x8 in Cinder, == in Keystone
• Almost business as usual… though• Keystone stores a revocation tree (memcache)• Populated on every project/user/trustee creation• And is checked for every token validation• -> Network traffic in one cache node (shard)• -> >12 seconds ave request time vs the average of 3ms
First Tests~100/200 node bays Large Tests
Up to 1000 node bays
CERN Results• Second run: rally and 7 million requests / sec• Lots of iterations! Example
Scale Magnum Conductor
Deploy Barbican
CERN Results● Second go: rally and 7 million requests / sec
○ Kubernetes 7 million requests / sec○ 1000 node clusters (4000 cores, 8000 GB / RAM)
Cluster Size (Nodes) Concurrency Deployment Time (min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23
CERN Tuning• Heat• Timeouts when contacting rabbitmq• Large stack deletion sometimes needs multiple tries
• Magnum• ‘Too many files opened’• 503s, scale the conductor• RabbitMQ instabilities• Flannel network config
• Keystone• Revocation tree can cause some scalability issues
ulimit -‐n 4096
max_stacks_per_tenant: 10000 was 100
max_template_size: 5242880 (*10 previous)
max_nested_stack_depth: 10 (was 5)
engine_life_check_timeout: 10 (was 2)
rpc_poll_timeout: 600 (was 1)
rpc_response_timeout: 600 (was 60)
rcp_queue_expiration: 600 (was 60)
disabled memcache
Deployed Barbican
Downgrade to 3.3.5
-‐-‐labels flannel_network_cidr=10.0.0.0/8,\ flannel_network_subnetlen=22,\ flannel_backend=vxlan
CERN Tuning (continued)
• Cinder• Slow deletion triggering heat stack deletion timeouts• Heat engine issues (too many retrials, timeouts)• Make Cinder optional? Lots of traffic with high load apps!
• Heat stack deployment scaling linearly• For large stacks >128 nodes• Summary of a 1000 node cluster: 1003 stacks, 22000 resources, 47000 events• That’s ~70000 records in the heat db for one stack
• Heat: Performance Scalability Improvements - Thu 27th 11:50 am
• Flannel backend tests• udp: ~450Mbit/s, vxlan: ~920 Mbit/s, host-gw: ~950Mbit/s• Change default? We set vxlan at CERN right now
CNCF Cloud Result
90computes
CNCF Benchmark Setup• Granted access 1 month ago and built with OpenstackAnsible with Newton release• On-going scalability study for Magnum, Heat and COEs
• Hardware configuration• 2x Intel E5-2680v3 12-core• 128GB RAM• 2x Intel S3610 400GB SSD• 10x Intel 2TB NLSAS HDD• 1x QP Intel X710"
• Cinder configured with the lvm-driver, disabled later
• Neutron configured with linux bridge
ha-proxy
5controllers
5controllers
3 neutron controllers
3 neutron controllers
90computes
90computes
CNCF resultsTwo rounds of tests:• 35 node cluster with one master, 24 cores and 120GB of ram, (840 cores)
• 80 node cluster with one master, 24 cores and 120GB of ram, (1920 cores)
Flannel backend configuration host-gw or udp) VS vxlan at CERN
nodes containers reqs/sec latency flannel
35 1100 1M 83.2 ms udp
80 1100 1M 1.33 ms host-gw
80 3100 3M 26.1 ms host-gw
Rally data at CNCF
Cluster creation
Cluster Size
(Nodes)Concurrency
Number of
ClustersDeployment Time (min)
2 10 100 3.02
2 10 1000 Able to create 219 clusters
32 5 100 Able to create 28 clusters
512 1 1 *
4000 1 1 *
COE Cluster Size (Nodes) Concurrency Number of
ContainersDeployment Time (sec)
K8S 2 4 8 2.3
Swarm 2 4 8 6.2
Mesos 2 4 8 122.0
Container creation
Tuning at CNCF• Apply the same improvements discovered at CERN
• Heat tuning• Cinder decoupling
• Disabled Floating IPs to create many large clusters concurrently• But we need Floating IPs for the master node or the load balancer
• Still working on tuning rabbit, adding separate clusters for each service (like at CERN)• Consider this option in OpenStack Ansible for large deployment
• Using database for certificates didn’t impact the overall performance:• Reasonable alternative to Barbican
Conclusion
Conclusions• Scalability:• Deploy clusters• Deploy containers• Steady state: app
• Good:• Nova and neutron were solid• Once the infrastructure is in place, we can match the performance published by Google• Magnum itself not a bottleneck: many tuning knobs for building complex cluster
• Need work: • Really an Openstack scaling and stability problem• Linear scaling in heat and keystone (when creating a large number of cluster and using uuid tokens, token validation in keystone becomes too slow)
• Did we hit 10,000 containers? • YES
Best practices How to avoid the bottlenecks for now• Tune your Openstack• Rabbit, Heat
• Consider trade-off in deploying cluster: • Local storage or cinder volume • Fewer larger nodes or more smaller nodes• Floating IP per node or not• Load balancer • Networking: udp, host-gw
Next steps• Rerun tests focusing on cluster lifecycle operations
• Rolling upgrades, node retirement / replacement, …
• Summarize best practices in Magnum documentation• Run similar application scaling tests for other COEs
• Swarm 3K, Mesos 50.000 containers in real time
• Decouple Cinder for container storage• Bugs:
• Floating IP handling, client, state synchronization with Heat
• Long term issue:• Developers use devstack• How can we discover bottlenecks, scaling problems in a systematic way?
Thank You
Ricardo [email protected]
Spyros [email protected]@strigazi
Ton Ngo [email protected]@tango245
Winnie [email protected]