a ring to rule them all - revising openstack internals to ... · 1 introduction to satisfy the...

A Ring to Rule Them All - Revising OpenStack

Internals to Operate Massively Distributed Clouds

Adrien Lebre, Jonathan Pastor, Frederic Desprez

To cite this version:

Adrien Lebre, Jonathan Pastor, Frederic Desprez. A Ring to Rule Them All - Revising Open-Stack Internals to Operate Massively Distributed Clouds: The Discovery Initiative - Where DoWe Are ?. [Technical Report] RT-0480, INRIA. 2016, pp.1-24. <hal-01320235>

HAL Id: hal-01320235

https://hal.inria.fr/hal-01320235

Submitted on 24 May 2016

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

https://hal.archives-ouvertes.fr

https://hal.inria.fr/hal-01320235

ISS

N02

49-0

803

ISR

NIN

RIA

/RT-

-480

--FR

+EN

G

TECHNICALREPORTN° 480February 2016

Project-Team DISCOVERY IPL

A Ring to Rule Them All- Revising OpenStackInternals to OperateMassively DistributedCloudsA. Lebre, J. Pastor, Frederic Desprez

RESEARCH CENTRERENNES – BRETAGNE ATLANTIQUE

Campus universitaire de Beaulieu35042 Rennes Cedex

A Ring to Rule Them All - RevisingOpenStack Internals to Operate Massively

Distributed Clouds

A. Lebre∗†, J. Pastor∗†, Frederic Desprez∗

Project-Team DISCOVERY IPL

Technical Report n° 480 — February 2016 — 21 pages

Abstract:The deployment of micro/nano data-centers in network point of presence offers an opportunity to delivera more sustainable and efficient infrastructure for Cloud Computing. Among the different challenges weneed to address to favor the adoption of such a model, the development of a system in charge of turningsuch a complex and diverse network of resources into a collection of abstracted computing facilities thatare convenient to administrate and use is critical.In this report, we introduce the premises of such a system. The novelty of our work is that insteadof developing a system from scratch, we revised the OpenStack solution in order to operate such aninfrastructure in a distributed manner leveraging P2P mechanisms. More precisely, we describe howwe revised the Nova service by leveraging a distributed key/value store instead of the centralized SQLbackend. We present experiments that validated the correct behavior of our prototype, while havingpromising performance using several clusters composed of servers of the Grid’5000 testbed. We believethat such a strategy is promising and paves the way to a first large-scale and WAN-wide IaaS manager.

Key-words: Fog, Edge Computing, Peer To Peer, Self-*, Sustainability, Efficiency, OpenStack,Future Internet.

∗ Inria, France, Email: [email protected]† Mines Nantes/LINA (UMR 6241), France.

[email protected]

Revisiter les mecanismes internes du systemeOpenStack en vue d’operer des infrastructures

de type nuage massivement distribuees

Resume :La tendance actuelle pour supporter la demande croissante d’informatique util-

itaire consiste a construire des centres de donnees de plus en plus grands, dans unnombre limite de lieux strategiques. Cette approche permet sans aucun doute de sat-isfaire la demande actuelle tout en conservant une approche centralisee de la gestion deces ressources, mais elle reste loin de pouvoir fournir des infrastructures repondant auxcontraintes actuelles et futures en termes d’efficacite, de juridiction ou encore de dura-bilite. L’objectif de l’initiative DISCOVERY3 est de concevoir le LUC OS, un systemede gestion distribuee des ressources qui permettra de tirer parti de n’importe quelnœud reseau constituant la dorsale d’Internet afin de fournir une nouvelle generationd’informatique utilitaire, plus apte a prendre en compte la dispersion geographiquedes utilisateurs et leur demande toujours croissante.

Apres avoir rappele les objectifs de l’initiative DISCOVERY et explique pourquoi

les approches type federation ne sont pas adaptees pour operer une infrastructure

d’informatique utilitaire integree au reseau, nous presentons les premisses de notre

systeme. Nous expliquerons notamment pourquoi et comment nous avons choisi de

demarrer des travaux visant a revisiter la conception de la solution Openstack. De

notre point de vue, choisir d’appuyer nos travaux sur cette solution est une strategie

judicieuse a la vue de la complexite des systemes de gestion des plateformes IaaS et

de la velocite des solutions open-source.

Mots-cles : Calcul utilitaire base sur la localite, systemes pair-a-pair, self-*,durabilite,OpenStack Internet du futur

3http://beyondtheclouds.github.io

http://beyondtheclouds.github.io

The DISCOVERY Initiative 3

1 Introduction

To satisfy the escalating demand for Cloud Computing (CC) resources whilerealizing an economy of scale, the production of computing resources is concen-trated in mega data centers (DCs) of ever-increasing size, where the number ofphysical resources that one DC can host is limited by the capacity of its energysupply and its cooling system. To meet these critical needs in terms of energysupply and cooling, the current trend is toward building DCs in regions withabundant and affordable electricity supplies or taking advantage of free coolingtechniques available in regions close to the polar circle [13].

However, concentrating Mega-DCs in only few attractive places implies dif-ferent issues. First, a disaster1 in these areas would be dramatic for IT servicesthe DCs host as the connectivity to CC resources would not be guaranteed. Sec-ond, in addition to jurisdiction concerns, hosting computing resources in a fewlocations leads to useless network overheads to reach each DC. Such overheadscan prevent the adoption of Cloud Computing by several kind of applicationssuch as mobile computing or Big Data ones.

The concept of micro/nano DCs at the edge of the backbone [14] is a promis-ing solution for the aforementioned concerns. However, operating multiple smallDCs breaks somehow the idea of mutualization in terms of physical resourcesand administration simplicity, making this approach questionable. One way toenhance mutualization is to leverage existing network centers, starting from thecore nodes of the network backbone to the different network access points (a.k.a..PoPs – Points of Presence) in charge of interconnecting public and private insti-tutions. By hosting micro/nano DCs in PoPs, it becomes possible to mutualizeresources that are mandatory to operate network/data centers while deliveringwidely distributed CC platforms better suited to cope with disasters and tomatch the geographical dispersal of users. A preliminary study has establishedthe fundamentals of such an in-network distributed cloud referred by the au-thors as the Locality-Based Utility Computing (LUC) concept [3]. However, thequestion of how operating such an infrastructure still remains. Indeed, at thislevel of distribution, latency and fault tolerance become primary concerns, andcollaboration between components that are hosted on different location must beorganized wisely.

In this report, we propose to discuss some key-elements that motivateour choices to design and implement the LUC Operating System (LUC-OS),a system in charge of turning a LUC infrastructure into a collection of ab-stracted computing facilities that are as convenient to administrate and thatcan be used in the same way as existing Infrastructure-as-a-Service (IaaS) man-agers [9, 22, 23]. We explain, in particular, why federated approaches [4] arenot satisfactory enough to operate a LUC infrastructure and why designing afully distributed system makes sense. Moreover, we describe the fundamentalcapabilities the LUC OS should deliver. Because they are similar to those pro-vided by existing IaaS managers and because technically speaking it would be anon-sense to develop the system from scratch, we chose to instanciate the LUCOS concept on top of the OpenStack solution [23].

The main contribution is a proof of concept of the Nova service (the Open-

1On March 2014, a large crack has been found in the Wanapum Dame leading to emmer-gency procedures. This hydrolic plan supports the utility power supply to major data centersin central Washington.

RT n° 480

4 A. Lebre et al.

Stack compute element) that has been distributed on top of a decentralizedkey/value store (KVS). By such a mean, it becomes possible to efficiently op-erate several geographical sites by a single OpenStack system. The correctfunctioning of this proof of concept has been validated via several experimentsperformed on top of Grid’5000 [1]. In addition to tackling both the scalabilityand distribution issues of the SQL database, our KVS proposal leads to promis-ing performance. More than 80% of the API requests are performed faster thanwith the SQL backend without doing any modification in the Nova code.

The remaining of the report is as follows. Section 2 explains our designchoices. Section 3 describes OpenStack and how we revised it. The validationof our prototype focusing on the Nova service is presented in Section 4. InSection 5, we present the ongoing works towards a complete LUC OS leveragingthe OpenStack ecosystem. Finally Section 7 concludes and discusses futureresearch and development actions.

2 Design Considerations

The massively distributed cloud we target is an infrastructure that is composedof up to hundreds of micro DCs, which are themselves composed of up to tensof servers. While brokering and federated approaches are generally the solu-tions that have been investigated to operate such infrastructures, we explainin this section why revising OpenStack with P2P mechanisms is an interestingopportunity.

2.1 From Centralized to Distributed

Federations of clouds are the first approaches that are considered when it comesto operate and use distinct clouds. Each micro DC hosts and supervises its ownCC infrastructure and a brokering service is in charge of provisioning resourcesby picking them on each cloud. While federated approaches with a simple cen-tralized broker can be acceptable for basic use cases, advanced brokering servicesbecome mandatory to meet requirements of production environments (monitor-ing, scheduling, automated provisioning, SLAs enforcements . . . ). In additionto dealing with scalability and single point of failure (SPOF) issues, brokeringservices become more and more complex to finally integrate most of the mech-anisms that are already implemented by IaaS managers [5, 16]. Consequently,the development of a brokering solution is as difficult as the development ofan IaaS manager but with the complexity of relying only on the least commondenominator APIs. While few standards such as OCCI [20] start to be adopted,they do not allow developers to manipulate low-level capabilities of each system,which is generally mandatory to finely administrate resources. In other words,building mechanisms on top of existing ones, as it is the case of federated sys-tems, prevents them from going beyond the provided APIs (or require intrusivemechanisms that must be adapted to the different systems). The second wayto operate a distributed cloud infrastructure is to design and build a dedicatedsystem, i.e., an Operating System, which will define and leverage its own soft-ware interface, thus extending capacities of traditional Clouds with its API anda set of dedicated tools. Designing a specific system offers an opportunity to gobeyond classical federations of Clouds by addressing all crosscutting concerns

Inria


of a software stack as complex as an IaaS manager.The following question is to analyze whether collaborations between mecha-

nisms of the system should be structured either in hierarchical or in flat way viaa P2P scheme. During the last years few hierarchical solutions have been pro-posed in industry [6, 7] and academia [11, 12]. Although they may look easierat first sight than Peer-to-Peer structures, hierarchical approaches require addi-tional maintenance costs and complex operations in case of failure. Moreover,mapping and maintaining a relevant tree architecture on top of a network back-bone is not meaningful (static partitioning of resources is usually performed).As a consequence, hierarchical approaches do not look to be satisfactory to op-erate a massively distributed IaaS infrastructure such as the one we target. Onthe other side, P2P file sharing systems are a good example of software thatworks well at large scale in a context where Computing/Storage resources aregeographically spread. While P2P/decentralized mechanisms have been under-used for building operating system mechanisms, they have showed the potentialhandle the intrinsic distribution of LUC infrastructures as well as the scalabilityrequired to manage them [10].

To summarize, we advocate the development of a dedicated system, i.e.,the LUC Operating system that will interact with low level mechanisms on eachphysical server and leverage advanced P2P mechanisms. In the following section,we describe the expected LUC OS capabilities.

2.2 Cloud Capabilities

The LUC OS should deliver a set of high level mechanisms whose assemblyresults in a system capable of operating an IaaS infrastructure.

Recent studies have showed that state of the art IaaS managers [24] wereconstructed over the same concepts and that a reference architecture for IaaSmanagers can be defined [21].

This architecture covers primary services that are needed for building theLUC OS:

• The virtual machines manager is in charge of managing VMs’ cycleof life (configuration, scheduling, deployment, suspend/resume and shutdown).

• The Image manager is in charge of VM’ template files (a.k.a. VMimages).

• The Network manager provides connectivity to the infrastructure: vir-tual networks for VMs and external access for users.

• The Storage manager provides persistent storage facilities to VMs.

• The Administrative tools provide user interfaces to operate and use theinfrastructure.

• Finally, the Information manager monitors data of the infrastructurefor the auditing/accounting.

Thus the challenge is to guarantee for each of the aforementioned services,its decentralized functionning in a fully distributed way. However, as designing

RT n° 480

6 A. Lebre et al.

and developing the LUC OS from scratch would be an herculean work, wepropose to minimize both design and implementation efforts by reusing as muchas possible succesful mechanisms, and more concretly by investigating whethera revised version of the OpenStack [23] could fulfill requirements to operatea LUC infrastructure. In other words, we propose to determine which partsof OpenStack can be directly used and which ones must be revised with P2Papproaches. This strategy enables us to focus the effort on key issues such as thedistributed functioning and the organization of efficient collaborations betweensoftware components composing our revised version of OpenStack, a.k.a. theLUC OS.

3 Revising OpenStack

OpenStack [23] is an open-source project that aims at developing a completeCC management system. Its architecture is comparable with the reference ar-chitecture previously described (see Figure 1). Two kinds of nodes composean OpenStack infrastructure: compute and controller nodes. The former arededicated to the hosting of VMs while the latter are in charge of executing theOpenStack services.

Nova Nova

Compute manager

Swift Swift

Glance Glance

Storage manager

Neutron Neutron

Network manager

KeyStone KeyStone

Horizon Horizon

Administrative tools,Information manager,Accounting/Auditing

Figure 1: Core-Services of OpenStack.

The OpenStack services are organized following the Shared Nothing princi-ple. Each instance of a service (i.e., service worker) is exposed through an APIaccessible through a Remote Procedure Call (RPC) system implemented on topof a messaging queue or via web services (REST). This enables a weak couplingbetween services. During their life-cycle, services create and manipulate logicalobjects that are persisted in shared databases, thus enabling service workersto easily collaborate while keeping the compatibility with the Shared Nothingprinciple.

However, even if this organisation of services respects the Shared Nothingprinciple, the message bus and the fact that objects are persisted in shareddatabases limit the scalabilty of the system, as stated in its documentation:

OpenStack services support massive horizontal scale. Be aware thatthis is not the case for the entire supporting infrastructure. Thisis particularly a problem for the database management systems andmessage queues that OpenStack services use for data storage andremote procedure call communications.

Inria


As a consequence, the process of revising OpenStack towards a more decen-tralized and an advanced distributed functioning, should be carried out in twoways: the distribution of the messaging queue and the distribution of the sharedrelational databases.

3.1 Decentralizing the AMPQ Bus

As indicated, services composing OpenStack collaborate mainly through a RPCsystem built on top of an AMQP bus. The AMQP implementation used byOpenStack is RabbitMQ. While this solution is generally articulated around theconcept of a centralized master broker, it provides by default a cluster mode thatcan be configured to work in a highly available mode. Several machines, eachhosting a RabbitMQ instance, work together in an Active/Active functioningwhere each queue is mirrored on all nodes. While it has the advantage of beingsimple, it has the drawback of being very sensible to network latency, and thusit is not relevant for multi-site configurations at large scale. This limitation iswell known from the distributed messaging queue community. Few workaroundshave been proposed for solutions such as RabbitMQ and more recently, P2P-like systems such as ActiveMQ [25] or ZeroMQ [15] have been released. Suchbroker-less solutions satisfy the LUC requirements in terms of scalability andbecause an action already showed that it is feasible to replace RabbitMq withZeroMQ2, we chose to focus our efforts on the DB challenge.

3.2 Decentralizing the Databases

From today’s perspective, most of the OpenStack deployments are involving fewcompute nodes and do not require more than a single database (DB) node interms of scalability. Each instance of every service composing the OpenStackinfrastructure can collaborate with remote instances by sharing logical objects(inner-states) that are usually persisted in a single DB node. The use of a sec-ond DB is generally considered to satisfy the high availability constraint thatis mandatory in production infrastructures. In such a context, the OpenStackcommunity recommends the use of at least an active/passive replication strategy.A second DB acts as a failover of the master instance. However, when the infras-tructure becomes larger or includes distinct locations, it becomes mandatory todistribute the existing relational DBs over several servers.Two approaches areproposed by the OpenStack community. The first one consists in partitioningthe infrastructure into group called cells configured as a tree. The top-cell isgenerally composed of one or two nodes (i.e., the top-cell does not include com-pute nodes) and is in charge of redistributing requests to the child cells. Eachchild cell can be seen as an independent OpenStack deployment with its own DBserver and message queue broker. In addition to facing hierarchical approachissues we previously discussed (see Section 2.1), we highlight that additionalmechanisms are mandatory on top of the vanilla OpenStack code in order tomake collaboration between cells possible. Consequently, this approach lookscloser to a brokering solution than a native collaboration of a IaaS manage-ment system as we target. The second approach consists in federating severalOpenStack deployments throughout an active/active replication mechanism [18]

2https://wiki.openstack.org/wiki/ZeroMQ (valid on Dec 2015)

RT n° 480

https://wiki.openstack.org/wiki/ZeroMQ

8 A. Lebre et al.

provided by the Galera solution. By such a mean, when an instance of a ser-vice processes a request and performs some actions on one site, changes in theinner-states stored in the DB are also propagated to all the other DBs of the in-frastructure. From a certain point of view, it gives the illusion that there is onlyone unique DB shared by all OpenStack deployments. Although the describedtechnique has been used in production systems, most of them only involve a lim-ited number of geographical sites. Indeed, active replication mechanisms implyimportant overheads that limit the size of infrastructures. To sum up neitherthe hierarchical approach nor the active replication solution are suited to dealwith a massively distributed infrastructure as the one we target.

While not yet explored for the main OpenStack components, NoSQLdatabases seem to have more suitable properties for highly distributed context,providing a better scalability and built-in replication mechanisms. DistributedHash Tables (DHTs) and more recently key/value stores (KVSes) built on topof the DHT concept such as Dynamo [10] have demonstrated their efficiencyin terms of scalability and fault tolerance properties. The challenge consistsin analyzing how OpenStack internals can be revised to be able to manipulateinner-states through a KVS instead of a classical SQL system. In the nextsection we present how we performed such a change for the Nova component.

3.3 From MySQL to REDIS, The Nova POC

As illustrated in Figure 2, the architecture of Nova has been organized in a waywhich ensures that each of its sub-services does not directly manipulate the DB.Instead it calls API functions proposed by a service called “nova-conductor”.This service forwards API calls to the “db.api” component that proposes oneimplementation per database type. Currently, there is only one implementationthat works on relational DBs. This implementation relies on the SQLAlchemyobject-relational-mapping (ORM) that enables the manipulation of a relationaldatabase via object oriented code.

Figure 2: Nova - Software Architecture and DB dependencies.

Thanks to the usage of this ORM, the given implementation is not tightly-coupled with the relational model. Such a feature enabled us to develop ROME,a library that exposes the same functions as SQLAlchemy and performs thesame actions but on non-relational DBs. As ROME and SQLAlchemy are very

Inria


similar, we have been able to copy the existing implementation of “db.api”,and to replace every use of SQLAlchemy by a call to ROME, enabling Nova’sservices to work with a KVS, while limiting the number of changes in the originalsource code.

Thanks to this modification, it is possible to deploy an OpenStack infras-tructure that works with a natively distributed database, which gives a firstglimpse of large multi-site deployments. Figure 3 depicts such a deployment:each geographical site hosts at least one controller node and at least a part of theNoSQL DB, a.k.a. the KVS Controller nodes collaborate in a flat way thanksto the shared KVS and the shared AMQP bus. The number of controller nodeson each site can vary according to the expected demand created by end-users.Finally, a controller node can be deployed either on a dedicated node or bemutalized with a compute node as illustrated for Site 3. We higlight that anycontroller node can provision VMs by orchestrating services on the whole infras-tructure and not only on the site where it is deployed. Concerning the choice ofthe NoSQL database, we chose to use REDIS in our prototype because of its de-ployment/usage simplicity. However we are aware that databases that focus onhigh-availability and partition tolerance criteria, such as Cassandra [19], couldbe a good fit as they have already been deployed on production environments.

AMQPbus

AMQPbus

AMQPbus

Key/Value Store

NovaController 3

n-schedn-condn-apin-netn-cpuhorizon

NovaController 2n-sched

n-condn-apin-netn-cpuhorizon Nova

ComputeNodes

Nova Compute Nodes

NovaController 1


NovaController 5 n-sched

n-condn-apin-netn-cpuhorizon

Nova Controller 4and compute node


NovaComputeNode

Site 1

Site 2

Site 3

Site 4

Figure 3: Nova controllers (in light-red) are connected through a shared key/-value backend and the AMQP bus. Each controller runs all nova services andcan provision VMs on any compute node (in light blue).

4 Experimental Validation

The validation of our proof-of-concept has been done via three sets of experi-ments. The first one aimed at measuring the impact of the use of the REDISNoSQL solution instead of the MySQL system in a single site deployment. Thesecond set focused on multi-site scenarios by comparing the impact of the la-tency on our distributed Nova service with respect to an active/active Galeradeployment. Finally, the last experiment showed that higher level OpenStackmechanisms are not impacted by the use of the REDIS KVS.

All Experiments have been performed on Grid’5000 [1], a large-scale and ver-satile experimental testbed that enables researchers to get an access to a largeamount of computing resources with a very fine control of the experimental con-

RT n° 480

10 A. Lebre et al.

(a) Single centralized DB (b) REDIS Key/Value Store (c) Galera

Figure 4: Investigated deployments on top of G5K and role of each server node.

ditions. We deployed and configured each node involved in the experiment witha customized software stack (Ubuntu 14.04, a modified version of OpenStack“Devstack”, and REDIS v3) using Python scripts and the Execo toolbox [17].We underline that we used the legacy network mechanisms integrated in Nova(i.e.,, without Neutron) and deployed other components that were mandatoryto perform our experiment (in particular Keystone and Glance) on a dedicatednode belonging to the first site (entitled master node on Figure 4).

4.1 Impact of REDIS w.r.t MySQL

4.1.1 Time penalties

Changes made over Nova’s source code to support a NoSQL database as REDISis likely to affect its reactivity. The first reason is that a KVS does not providea support of operations like joining, and thus the code we developed to providesuch operations, creates a computation overhead. The second reason is relatedto networking. Unlike a single MySQL node, in a REDIS system data is spreadover several nodes. Thus, a request can lead to several network exchanges. Fi-nally, REDIS provides a replication strategy to deal with fault tolerant aspects,leading also to possible overheads.

Table 1: Average response time to API requests for a mono-site deployment(in ms).

Backend configuration REDIS MySQL1 node 83 374 nodes 82 -4 nodes + repl 91 -

Table 2: Time used to create 500 VMs on a single cluster configuration (in sec.)Backend configuration REDIS MySQL1 node 322 2984 nodes 327 -4 nodes + repl 413 -

Table 1 compares average response times used to satisfy API requests madeduring the creation of 500 VMs on an infrastructure deployed over one clus-

Inria


Figure 5: Statistical distribution of Nova API response time (in ms.).

(a) Single centralized DB (b) 4 nodes REDIS clusterwithout replication

(c) 4 nodes REDIS clusterwith replication (1 replica)

Figure 6: Amount of data exchanged per type of nodes, varying the DB config-uration (MySQL or REDIS).

ter (containing 1 controller node and 6 compute nodes), using either REDISor the MySQL backend under the three aforementioned scenarios. While thedistribution of REDIS between several nodes and the use of the replicationfeature do not significantly increase the response time (first column), the dif-ference between the average API response time of our KVS approach and thevanilla MySQL code may look critical at first sight (124% higher). However, itmust be mitigated with Figure 5 and Table 2. Figure 5 depicts the statisticaldistribution of the response time of each API call that has been made duringthe creation of the 500 VMs. It is noticeable that for a large part of them(around 80%), our Rome/REDIS solution delivers better performance than theSQLAlchemy/MySQL backend. On the other side, the 10% of the slowest APIcalls are above 222 ms with our proposal while they are are around 86 ms whenusing MySQL. Such a difference explains the averages recorded in Table 1 and

RT n° 480

12 A. Lebre et al.

we need to conduct deeper investigations to identify the kind of requests andhow they can be handled in a more effective way. Overall, we can see that evenwith these slow-requests, the completion time for the creation of 500 VMs iscompetitive as illustrated by Table 2. In other words, some API functions havea more significant impact than others on the VM creation time.

4.1.2 Networking penalties

As we target the deployment of an OpenStack infrastructure over a large numberof geographical sites linked together through the Internet backbone, the quantityof data exchanged is an important criterion for the evalution of our solution. Inparticular, as data is stored in the KVS with an object structure, it requires aserialization/deserialization phase when objects are stored/queried. To enablethis serialization, the addition of some metadata is required, which leads toa larger data footprint and thus a larger amount of data exchanged betweendatabase nodes and OpenStack nodes.

To determine wether the level of network-overhead is acceptable or not,networking data has been collected during the previous experiments. Table 3compares the total amount of data exchanged over network depending of thedatabase configuration that has been used. As MySQL does not store serializedobjects, i.e., objects are serialized at the client-side by the ORM, thus only rawdata is exchanged over the network, we consider the single node MySQL as theoptimal solution, which has been measured at 1794MB. Our solution deployedover a single REDIS node consumes 2190MB, which means that the networkingoverhead related to the combination of ROME and REDIS is estimated to bearound 22%. Doing the same experiment with a 4 nodes REDIS cluster with-out data replication leads to a 33% networking overhead compared to a singleMySQL node. Finally, when the data replication is enabled with one replica,the amount of data exchanged over the network is 76% higher than withoutreplication, which is intuitive.

However, the infrastructures that have been deployed contain different kindof nodes (controllers, compute nodes, database nodes, ...) and the aforemen-tioned data takes everything into account and thus it does not enable us to geta precise view of the origin of the networking overhead.

Table 3: Amount of data exchanged over the network (in MBytes)Backend configuration REDIS MySQL1 node 2190 17944 nodes 2382 -4 nodes + repl (1 replica) 4186 -

By using the iftop 3 tool, it has been possible to gather information aboutthe origin and destination of TCP/IP messages exchanged during the creationof VMs, and thus to get a precise idea of the origin of any networking overhead.Figure 6 depicts the data exchanges in function of the origin (horizontal axis)and the destination (vertical bars), varying the database configuration. It isnoticeable that there is no significant difference in terms of exchange pattern

3http://www.ex-parrot.com/pdw/iftop/

Inria


for OpenStack nodes, regardless the kind of nodes and regardless the DB con-figuration that has been used (either MySQL or REDIS). We can notice slightlydifferent patterns for DB Nodes: a comparison of Figure 6(a) and 6(b) confirmsthat the networking overhead depicted in Table 3 comes from the DB nodes.Finally, Figure 6(c) confirms that most of the overhead observed when enablingdata replication is also caused by additional data exchanged between databasenodes.

4.2 Multi-site Scenarios

The second experiment we performed consisted in evaluating a single Open-Stack deployment over several locations. Our goal was to compare the be-haviour of a single MySQL OpenStack with the advised Galera solution andour Rome+REDIS proposal. Figure 4 depicts how the nodes have been con-figured for each scenario. While the deployment of a single MySQL node is anon sense in a production infrastructure as discussed before, evaluating such ascenario enabled us to to get an indication regarding the maximum performancewe can expect. Indeed in such a scenario, the DB is deployed on a single serverlocated in one of the locations, without any synchronization mechanism andconsequently no overhead related to communications with remote DB nodeson the contrary to a clustered Redis or an infrastructure composed of severalMySQL DBs that are synchronized with the Galera mechanism. Moreover, con-ducting such an experiment at large scale enabled us to see the limit of such acentralized approach.

Regarding the experimental methodology, all executions have been con-ducted on servers of the same site (Rennes) in order to ensure reproducibility:distinct locations (i.e., clusters) have been emulated by adding latency betweengroup of servers thanks to the TC unix tool. Each cluster was containing 1controller node, 6 compute nodes, and one DB node when needed. Scenariosincluding 2, 4, 6, and 8 clusters have been evaluated, leading to infrastructurescomposed of up to 8 controllers and 48 compute nodes overall. The latencybetween each cluster has been set to 10 ms and then 50 ms. Finally, in orderto evaluate the capability of such infrastructures to distribute the workload onseveral controllers, and to detect concurrency problems inherent in using a nonrelational DB backend, the creation of the 500 VMs has been fairly distributedamong the available controllers in parallel.

Table 4: Time to create 500 VMs with a 10ms inter-site latency (in sec.).Nb of locations REDIS MySQL Galera2 clusters 271 209 21994 clusters 263 139 20116 clusters 229 123 18118 clusters 223 422 1988

Table 4 and Table 5 present the time to create the 500 VMs. As expected,increasing the number of clusters leads to a decrease of the completion time.This is explained by the fact that a larger number of clusters means a largernumber of controllers and compute nodes to handle the workload.

The results measured for a 10ms latency, show that our approach takes a

RT n° 480

14 A. Lebre et al.

Table 5: Time to create 500 VMs with a 50ms inter-site latency (in sec.).Nb of locations REDIS MySQL Galera2 clusters 723 268 *4 clusters 427 203 *6 clusters 341 184 -8 clusters 302 759 -

rather constant time to create 500 VMs, which stabilizes around 220 seconds.While a single MySQL node has better results until 6 clusters, one can see thelimitations of a single server with 8 clusters. In such a case, the single MySQLperforms 89% slower than our approach, while the advised Galera solution is891% slower than our approach.

With a 50ms inter-cluster latency, the difference between REDIS and MySQLis accentuated in the 8 clusters configuration, as MySQL is 151% slower thanour REDIS approach.

Regarding Galera, it is noteworthy that important issues related to concur-rent modifications of the databases appear with a 50 ms latency, preventingmany of the 500 VMs to be created (i.e., several bugs occur leading Nova toconsider many VMs as crashed). Such pathological behaviours are due to boththe important latency between clusters and the burst mode we used to create the500 VMs (for information, we succeeded to create 500 VMs but in a sequentialmanner for 2 and 4 clusters).

To summarize, in addition to tackling the distribution issue, the coupleRome+REDIS enables OpenStack to be scalable: the more controllers are tak-ing part to the deployment, the better the performance is.

4.3 Compatibility with Advanced Features

The third experiment aimed at validating the correct behaviour of existingOpenStack mechanisms while using our Rome+REDIS solution. Indeed, inorder to minimize the intrusion in the OpenStack source code, modificationshave been limited to the nova.db.api component. This component can be con-sidered as the part of Nova that has the most direct interaction with the DB.Limiting the modification to the source code of this component should enableus to preserve compatibility with existing mechanisms at higher level of thestack. To empirically validate such an assumption, we conducted experimentsinvolving multi-site and the usage of host-aggregate/availability-zone mecha-nism (one advanced mechanism of OpenStack that enables the segregation ofthe infrastructure). As with aforementioned experiments, our scenario involvedthe creation of 500 VMs in parallel on a multi-sites OpenStack infrastructuredeployed on top of Rome+REDIS with data replication activated (the replicafactor was set to 1). Two sets of experiments were conducted: a set where eachnode of a same geographical site was member of a same host aggregate (thatis the same availability zone) and a second set of experiments involving flatmulti-site OpenStack (i.e., without defining any availability zone).

Experimental results show that the host-aggregate/availability-zone mecha-nism behaves correctly on top of our proposal. While VMs are correctly balancedaccording to the locations where they have been started, the flat multi-site de-

Inria


ployment led to a non uniform distribution with respectively 26%, 20%, 22%,32% of the created VMs for a 4 clusters experiments.

5 On-going Work

While the Nova revision we presented in the previous sections is a promisingproof-of-concept toward widely distributed OpenStack infrastructures, there areremaining challenges that need to be tackle.

In this section, we present two on-going actions that aim at deepening therelevance of our approach. First, we show that existing seggregation mecha-nisms provided by OpenStack are not satisfactory when it comes to reducinginter-site communications. In response, two actions could be made: on onehand introducing networking locality in the shared databases and the sharedmessaging, on the other hand distributing remaining services of OpenStack.

Second, we discuss the preliminary study we made on the Glance imageservice to investigate whether it is possible to apply similar modifications tothe ones we performed on Nova. Such a validation is critical as Glance is a keyelement for operating a production-ready IaaS infrastructure.

These two actions clearly demonstrate that our approach is promising enoughto favor the adoption of the distributed cloud model supervised by a singlesystem.

5.1 Locality Challenges / µcro DCs Segregation

Deploying a massively multi-site Cloud Computing infrastructure operated byOpenStack is challenging as communication between nodes of different geo-graphical clusters can be subject to an important network latency, which canbe a source of disturbances for OpenStack. Experimental results presented inthe Table 5 of Section 4.2 clearly showed that an OpenStack distributed on topof our Rome+REDIS solution can already operate over an ISP network witha high inter-site latency (50 ms). While this result is positive and can indi-cate that such a configuration is appropriated for operating a distributed CCinfrastructure involving tens of geographical site, it is important to understandthe nature of network traffic. Table 6 shows the total traffic vs. the traffic be-tween the remote sites using a 4 sites OpenStack leveraging our Rome+REDISproposal and with the host-aggregate feature. The first line clearly shows thateven with the host-aggregate feature enabled, there is a dramatic amount ofcommunications (87.7%) made between nodes located in distinct geographicalsites.

Table 6: Quantity of data exchanged over network (in MBytes)Total Inter-site Proportion

4 clusters 5326 4672 87.7%

A quarter of these inter-site communications are caused by the isolation ofNova from other OpenStack services (i.e., Keystone and Glance) which weredeployed on a dedicated master node in our experiments. Indeed, operationslike serving VM images were naturally a source of artificial inter-site communi-cations. This situation clearly advocates in favor of massively distributing the

RT n° 480

16 A. Lebre et al.

remaining services, as we did with Nova. Finally, as instances of OpenStack ser-vices collaborate via a shared messaging bus and via a shared database, unlessthese two elements will be able to avoid the broadcasting of information by tak-ing advantage of network locality, the level of inter-site communication will re-main large. We are investigating two directions. First, we are studying whetherthe use of a P2P bus such as ZeroMQ[15] can reduce such a network overheadand second whether the service catalog of Keystone can become locality-awarein order to “hide” redundant services that are located remotely.

5.2 Revising Glance: The OpenStack Image Manager

Similarly to the Nova component (see Section 3.3), only the inner states ofGlance are stored in a MySQL DB, the VM images are already stored in afully distributed way (leveraging either SWIFT or Ceph/Rados solution [26]).Therefore, our preliminary study aimed at determining whether it was possibleor not to reuse the ROME library to switch between the SQL and NoSQL back-ends. As depicted by Figure 7, the Glance code from the software engineeringpoint of view is rather close to the Nova one. As a consequence, replacing theMySQL DB by a KVS system did not lead to specific issues. We underline thatthe replacement of MySQL with REDIS was even more straightforward than forNova as Glance enables the configuration of specific API for accessing persistentdata (data api in the Glance configuration file). We are currently validatingthat each request is correctly handled by Rome. Preliminary performance ex-periments are planned for the beginning of 2016.

Figure 7: Glance - Software Architecture and DB dependencies.

6 OpenStack for Operating Massively Dis-tributed Clouds

While advantages and opportunities of massively distributed clouds have beenemphasized several years ago [8, 14], delivering an OpenStack that can nativelybe extended to distinct sites will create new opportunities. We present in thissection the most important ones and also raise associated challenges (beyond the

Inria


technical issues we should resolve to finalize our LUC OS) that our communityshould tackle to be able to fully use the technical capabilities offered by LUCinfrastructures.

From the infrastructure point of view, a positive side-effect of our revisedversion of OpenStack is that it will natively allow the extension of a privatecloud deployment with remote physical resources. Such a scenario is a strongadvantage in comparison to the current hybrid offers as it does not break the no-tion of a single deployment operated by the same tenant. Each time a companywill face a peak of activity, it will be possible to provision dedicated servers andattach them to the initial deployment. Those servers can either be provided bydedicated hosting services that have DCs close to the institution or by directlydeploying transportable and containerized server rooms close to the private re-sources. This notion of WANwide elasticity can be generalized as it will bepossible to deploy such containerized server rooms whenever and wherever theywill be mandatory. As examples, we can envision to temporarily deploy ITresources for sport events such as olympic games or for public safety purposesin case of disasters. Network/Telecom operators will also be able to deploy ITresources on their radio base stations they operate in order to deliver Fog/Edgecomputing solutions. The common thread in these use-cases is the possibility ofextending an infrastructure wherever needed with additional resources, the onlyconstraint being to be able to plug the different locations with a backbone thatoffers enough bandwidth and quality of service to satisfy network requirements.The major advantage is that such an extension is completely transparent forthe administrators/users of the IaaS solution because they continue to super-vise/use the infrastructure as they are used to. The associated challenge ourcommunity should shortly address to deliver such a Wanwide elasticity is re-lated to the automatic installation/upgrade of the LUC OS (i.e., our revisedOpenStack software) throughout different locations. In such a context, scala-bility and geo-distribution make the installation/upgrade process more difficultas it would probably require to relocate computations/data between locationsin order to be able to update physical servers without impacting the executionof hosted applications. Another issue to investigate is whether it makes senseto extend a deployment between several network operators and how such exten-sions can be handled. We underline that even for such an extension (the termfederation is probably more appropriated in this situation), the two OpenStacksystems will join each other to form a single system. Of course security issuesshould also be addressed, but they are beyond the scope of this paper.

From the software point of view, developers will be able to design new ap-plications but also revise major cloud services in order to deliver more localityaware management of data, computation, and network resources. For instance,it will be possible to deploy on-demand Content Delivery Network solutionsaccording to specific requirements. Cloud storage services could be revised tomitigate the overheads of transferring data from sources to the different loca-tions where there are needed. New strategies can favor for instance a pullingmode instead of a pushing one. Nowadays data is mostly uploaded to the re-mote clouds without considering whether such data movements are effectivelysolicited or not. We expect that LUC infrastructures will enable data to stayas close as possible to the source that generates them and be transferred on theother side only when it will be solicited. Such strategies will mitigate the costof transferring data in all social networks for instance. Similarly, developers

RT n° 480

18 A. Lebre et al.

will be able to deliver Hadoop-like strategies where computations are launchedclose to data sources. Such mechanisms will be shortly mandatory to handle thehuge amount of data that Internet of Things will generate. However, deliveringthe LUC OS will not be sufficient to allow developers to implement such newservices. Our community should start as soon as possible to revise and extendcurrent interfaces (a.k.a. Application Programming Interfaces). In particular,the new abstractions should allow applications to deal with geo-distributionopportunities and contextual information by using them to specify deploymen-t/reconfigurations constraints or to develop advanced adaptation scenarios inorder to satisfy for instance SLAs.

Finally, the last opportunity we envision is related to the use of renewableenergies to partially power each PoP of a LUC infrastructure. Similarly tofollow-the-moon/follow-the sun approach, the use of several sites spread acrossa large territory will offer opportunities to optimize the use of distinct energysources (solar panels, wind turbines). While such an opportunity has beenalready underlined [2], the main advantage is once again related to the nativecapability of our revised OpenStack to federate distinct sites, allowing users touse such a widely distributed infrastructure in a transparent way while enablingadministrators to balance resources in order to benefit from green energy sourceswhen available (since the system is supervised by a single system we expect thatthe development of advanced load-balancing strategies throughout the differentservers composing the infrastructure would be simplified).

7 Conclusion

Distributing the management of Clouds is a solution to favor the adoption ofthe distributed cloud model. In this paper, we presented our view of how suchdistribution can be achieved by presenting the premises of the LUC OperatingSystem. We chose to develop it by leveraging the OpenStack solution. Thischoice presents two advantages. It minimizes the development efforts and max-imizes the chance of being reused by a large community. As a proof-of-conceptwe presented a revised version of the Nova service that uses a NoSQL back-end. We discussed few experiments validating the correct behavior and showedpromising performance over 8 clusters.

Our ongoing activities focus on two aspects. First, we expect to finalize thesame modifications on the Glance image service soon and start to investigatealso whether such a DB replacement can be achieved for Neutron. We highlightthat we chose to concentrate our effort on Glance as it is a key element tooperate an OpenStack IaaS platform. Indeed, while Neutron is becoming moreand more important, the historical network mechanisms integrated in Nova arestill available and intensively used. The second activity studies how it can bepossible to restrain the visibility of some objects manipulated by the differentcontrollers that have been deployed throughout the LUC infrastructure: ourPOC manipulates objects that might be used by any instance of a service, nomatter where it is deployed. If a user has build an OpenStack project (tenant)that is based on few sites, appart from data-replication, there is no need forstoring objects related to this project on external sites. Restraining the storageof such objects according to visibility rules would save network bandwidth andreduce overheads.

Inria


Although delivering an efficient distributed version of OpenStack is a chal-lenging task, we believe that addressing it is the key to go beyond classicalbrokering/federated approaches and to promote a new generation of cloud com-puting more sustainable and efficient. We are in touch with large groups such asOrange Labs and are currently discussing with the OpenStack foundation to pro-pose the Rome/REDIS Library as an alternative to the SQLAlchemy/MySQLcouple. Most of the materials presented in this article such as our prototype areavailable on the Discovery initiative website.

Acknowledgments

Since July 2015, the Discovery initiative is mainly supported through the InriaProject Labs program and the I/O labs, a joint lab between Inria and Or-ange Labs. Further information at http://www.inria.fr/en/research/research-fields/inria-project-labs

References

[1] D. Balouek, A. Carpen Amarie, G. Charrier, F. Desprez, E. Jean-not, E. Jeanvoine, A. Lebre, D. Margery, N. Niclausse, L. Nussbaum,O. Richard, C. Perez, F. Quesnel, C. Rohr, and L. Sarzyniec. AddingVirtualization Capabilities to the Grid’5000 Testbed. In I. Ivanov, M. Sin-deren, F. Leymann, and T. Shan, editors, Cloud Computing and ServicesScience, volume 367 of Communications in Computer and Information Sci-ence, pages 3–20. Springer International Publishing, 2013.

[2] J. L. Berral, I. n. Goiri, T. D. Nguyen, R. Gavalda, J. Torres, and R. Bian-chini. Building green cloud services at low cost. In Proceedings of the 2014IEEE 34th International Conference on Distributed Computing Systems,ICDCS ’14, pages 449–460, Washington, DC, USA, 2014. IEEE ComputerSociety.

[3] M. Bertier, F. Desprez, G. Fedak, A. Lebre, A.-C. Orgerie, J. Pastor,F. Quesnel, J. Rouzaud-Cornabas, and C. Tedeschi. Beyond the Clouds:How Should Next Generation Utility Computing Infrastructures Be De-signed? In Z. Mahmood, editor, Cloud Computing, Computer Communi-cations and Networks, pages 325–345. Springer International Publishing,2014.

[4] R. Buyya, R. Ranjan, and R. N. Calheiros. InterCloud: Utility-orientedFederation of Cloud Computing Environments for Scaling of ApplicationServices. In 10th Int. Conf. on Algorithms and Architectures for ParallelProcessing - Vol. Part I, ICA3PP’10, pages 13–31, 2010.

[5] R. Buyya, R. Ranjan, and R. N. Calheiros. Intercloud: Utility-OrientedFederation of Cloud Computing Environments for Scaling of ApplicationServices. In Algorithms and architectures for parallel processing, pages 13–31. Springer, 2010.

[6] Cascading OpenStack. https://wiki.openstack.org/wiki/OpenStackcascading solution.

RT n° 480

http://www.inria.fr/en/research/research-fields/inria-project-labs

http://www.inria.fr/en/research/research-fields/inria-project-labs

https://wiki.openstack.org/wiki/OpenStack_cascading_solution

https://wiki.openstack.org/wiki/OpenStack_cascading_solution

20 A. Lebre et al.

[7] Scaling solutions for OpenStack. http://docs.openstack.org/openstack-ops/content/scaling.html.

[8] K. Church, A. G. Greenberg, and J. R. Hamilton. On delivering embarrass-ingly distributed cloud services. In HotNets, Usenix, pages 55–60. Citeseer,2008.

[9] CloudStack, Open Source Cloud Computing. http://cloudstack.apache.org.

[10] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo:Amazon’s Highly Available Key-Value Store. In ACM SIGOPS OperatingSystems Review, volume 41, pages 205–220. ACM, 2007.

[11] F. Farahnakian, P. Liljeberg, T. Pahikkala, J. Plosila, and H. Tenhunen.Hierarchical VM Management Architecture for Cloud Data Centers. In 6thInternational Conf. on Cloud Computing Technology and Science (Cloud-Com), pages 306–311, Dec 2014.

[12] E. Feller, L. Rilling, and C. Morin. Snooze: A Scalable and AutonomicVirtual Machine Management Framework for Private Clouds. In 12thIEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (Ccgrid2012), pages 482–489, 2012.

[13] J. V. H. Gary Cook. How Dirty is Your Data ? Greenpeace InternationalReport, 2013.

[14] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The Cost of a Cloud:Research Problems in Data Center Networks. ACM SIGCOMM ComputerCommunication Review, 39(1):68–73, 2008.

[15] P. Hintjens. ZeroMQ: Messaging for Many Applications. ” O’Reilly Media,Inc.”, 2013.

[16] I. Houidi, M. Mechtri, W. Louati, and D. Zeghlache. Cloud Service DeliveryAcross Multiple Cloud Platforms. In Services Computing (SCC), 2011IEEE International Conference on, pages 741–742. IEEE, 2011.

[17] M. Imbert, L. Pouilloux, J. Rouzaud-Cornabas, A. Lebre, and T. Hirofuchi.Using the EXECO Toolbox to Perform Automatic and Reproducible CloudExperiments. In 1st Int. Workshop on UsiNg and building ClOud Testbeds(UNICO, collocated with IEEE CloudCom, Dec. 2013.

[18] B. Kemme and G. Alonso. Database Replication: A Tale of Research AcrossCommunities. Proc. VLDB Endow., 3(1-2):5–12, Sept. 2010.

[19] A. Lakshman and P. Malik. Cassandra: A Decentralized Structured StorageSystem. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.

[20] N. Loutas, V. Peristeras, T. Bouras, E. Kamateri, D. Zeginis, and K. Tara-banis. Towards a Reference Architecture for Semantically InteroperableClouds. In Cloud Computing Technology and Science (CloudCom), 2010IEEE Second International Conference on, pages 143–150. IEEE, 2010.

Inria

http://docs.openstack.org/openstack-ops/content/scaling.html

http://docs.openstack.org/openstack-ops/content/scaling.html

http://cloudstack.apache.org


[21] R. Moreno-Vozmediano, R. S. Montero, and I. M. Llorente. IaaS CloudArchitecture: From Virtualized Datacenters to Federated Cloud Infras-tructures. Computer, 45(12):65–72, 2012.

[22] Open Source Data Center Virtualization. http://www.opennebula.org.

[23] The Open Source, Open Standards Cloud. http://www.openstack.org.

[24] J. Peng, X. Zhang, Z. Lei, B. Zhang, W. Zhang, and Q. Li. Comparisonof Several Cloud Computing Platforms. In 2nd Int. Symp. on InformationScience and Engineering (ISISE), pages 23–27, 2009.

[25] B. Snyder, D. Bosnanac, and R. Davies. ActiveMQ in Action. Manning,2011.

[26] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn. Ceph:A scalable, high-performance distributed file system. In Proceedings of the7th symposium on Operating systems design and implementation, pages307–320. USENIX Association, 2006.

RT n° 480

http://www.opennebula.org

http://www.openstack.org

RESEARCH CENTRERENNES – BRETAGNE ATLANTIQUE

Campus universitaire de Beaulieu35042 Rennes Cedex

PublisherInriaDomaine de Voluceau - RocquencourtBP 105 - 78153 Le Chesnay Cedexinria.fr

ISSN 0249-0803

a ring to rule them all - revising openstack internals to ... · 1 introduction to satisfy the...

Documents