applying rina as an overlay virtual networking solution to ... · optimal local optimizations, not...

Applying RINA as an overlay virtual networking solution to support highly available distributed clouds

Bernat Gastón, Eduard Grasa Distributed Applications and Networks Area

Fundació i2CAT Barcelona, Spain

{bernat.gaston, eduard.grasa}@i2cat.net

Gabriel Monnerat Nexedi

Lille, France [email protected]

Jordi Perelló Department of Computer Architecture (DAC) Universitat Politècnica de Catalunya (UPC)

Bercelona, Spain [email protected]

Marc Suñé, Víctor Álvarez Berlin Institute of Software Defined Networks (BISDN)

Berlin, Germany {marc.sune, victor.alvarez}@bisdn.de

Fatma Hrizi Telecom Sud Paris

Évry, France [email protected]

Abstract—Distributed cloud systems offer important advantages over centralized ones. Intrinsic resilience to network and electricity cuts, green nature, scalability and a more jurisdictionally secure environment for trading secrets are some of these advantages. However, because of its open and exposed nature, networking in highly available distributed clouds have to address a number of challenges. In this paper, deep analysis of the networking aspects of a real distributed cloud use case is performed, identifying its most important issues and limitations. The core of the paper discusses a detailed case study of the application of the Recursive InterNetwork Architecture (RINA) as a (virtual) networking solution for the distributed cloud, tailoring routing, resource allocation and security policies to the specific needs of supported cloud service. The paper concludes by highlighting the advantages of the RINA approach over current network virtualization solutions in terms of simplicity, programmability, scalability, manageability and network-application integration.

Keywords—RINA; distributed cloud; network virtualization; cloud networking; routing and addressing; security; resource allocation; network management

I. INTRODUCTION: DECENTRALIZED INFRASTRUCTURE FOR RESILIENT CLOUD SERVICES

Cloud computing is based on assigning computing needs to a shared pool of distributed computing resources. These clouds of computing resources are usually located in data centers, which are dedicated computing facilities. Cloud companies take advantage of the centralized physical location of the cloud resources to control key aspects like security,

resource allocation or network management.

In contrast, there exist decentralized cloud architectures where resources are located not only in data centers but also in homes or offices. Decentralized clouds claim to achieve better resilience to electricity cuts or natural disasters, higher energy-efficiency (less replication, less cooling, less power to supply optical fibers, etc.), a reduction on the network use (resources are physically closer to the users), and also to be a better option for trading secrets, since jurisdiction protects the cloud from PRISM or other surveillance programs. In light of this, distributed cloud systems seem to be the next natural step in the evolution of computing, since they scale much better than their centralized version.

VIFIB [1] is an open source decentralized cloud system. It consists of computers that are located in data centers, in people’s home, in offices, etc. By hosting computers in many different locations and copying each associated database in at least three different distant sites, the probability of mass destruction of the whole infrastructure becomes extremely low.

SlapOS [2] is a commercial cloud computing solution that empowers the VIFIB decentralized cloud system. SlapOS includes a complete PaaS (Platform as a Service) solution and orchestrator. SlapOS defines two types of servers: SlapOS Nodes and the SlapOS Master. SlapOS Nodes can be installed inside data centers or at home. Their role is to install software and run processes. The SlapOS Master acts as a central directory of all SlapOS Nodes, knowing where each SlapOS Node is located and which software can be installed on each

This work is partly funded by the European Commission through the PRISTINE project (Grant 619305), part of the Future Networks objective of the Seventh Framework Programme (FP7).

mailto:@i2cat.net

mailto:[email protected]


mailto:@bisdn.de


node. The role of SlapOS Master is to allocate processes to SlapOS Nodes.

On each SlapOS node, VIFIB allocates 100 IPv6 addresses and 100 IPv4 addresses. Each service running in the computer is attached to a dedicated IPv4 address, as well as a globally routable IPv6 address. All services are interconnected across VIFIB nodes using "tunnels" (called stunnels [3]) that redirect local IPv4 to global IPv6, encrypt flows and redirect IPv6 to IPv4. In this way, two services running in different locations, compatible or not with IPv6, can be interconnected through a secure link that also provides mutual authentication through TLS X509 certificates. Even insecure services such as 'memcached' can be deployed over insecure networks through this approach, as if they were deployed in a local area network.

Using a tunnel replacement and a routing strategy, an overlay network is created. This secure, resilient and low latency overlay network is called re6stnet.

In this paper we provide a complete analysis of VIFIB’s decentralized cloud networking environment. Using the current VIFIB network as an instantiation example, we expose this decentralized cloud architecture and we analyze its underlying network to extract its main characteristics, advantages and limitations. From this analysis, we propose a new design of the VIFIB network based on a clean-slate networking architecture called Recursive InterNetwork Architecture (RINA) [4]. We show that by using RINA, the VIFIB system can be improved in many aspects such as addressing and routing, security and resource allocation.

This paper is organized as follows. In Section II, the standard re6stnet is analyzed and other virtual networking solutions are reviewed. In Section III, RINA is explained. In Section IV, RINA is analyzed as a possible base architecture for re6stnet. Finally, conclusions and further work are exposed in Section V.

II. NETWORKING SOLUTIONS FOR DISTRIBUTED CLOUDS

In this section, the re6stnet is explained and analyzed. We show how a typical re6stnet overlay works, explaining its internals and analyzing its strong and weak points. Moreover, we introduce other virtual networking solutions that could be used as alternatives to re6stnet.

A. The re6stnet

The current re6stnet is an overlay spanning approximately 100 VIFIB nodes running customer services spread in 35 locations around the world. Moreover it contains one gateway to the IPv6 Internet, four reverse proxies that allow IPv4 users to access the services, and one registry node which provides the initial point of contact for nodes joining the overlay.

Fig. 1 shows an overview of the re6stnet overlay structure. IPv6 connectivity between nodes can be provided: i) through Ethernet if two nodes share the same LAN; ii) through a VPN tunnel (OpenVPN [5]) that simulates an Ethernet LAN between nodes; and iii) through a sequence of both previous cases thanks to the Babel routing protocol. To join the network, a node has to contact the registry in order to get an

SSL certificate (required to create the OpenVPN tunnels), the static IPv4 address where the OpenVPN server of the node has to be listening to, and a random list of peers with its contact information (IPv4 address and UDP port of the OpenVPN server on those peers). In order to keep the list of peers updated, each network node contacts the registry twice a day to renew it. The registry keeps a cache of the network nodes that have contacted it recently, so that those inactive nodes can be removed from the cache (they are assumed to be down). Except for the registry, the network is decentralized.

With this information, a node can open tunnels to other nodes. In order to minimize the probability of network partitions [6], the re6stnet overlay organizes nodes in a flat random graph, using its own algorithm. The goal of this algorithm is to construct a robust network structure with a small diameter in order to minimize the latency between nodes. Since every node can host a limited number of tunnels, VIFIB considers a modification of algorithms based on k-regular undirected graphs [7], where k is the number of tunnels that each peer creates. To avoid the k-regular graph generator complexity while maintaining the resilience [7] and small diameter [8], each node opens k tunnels with other peers, but can additionally allocate up to 2*k tunnels asked by other nodes. Using only local information, the node decides which are the best tunnels and uses them. The least used tunnels (i.e., those that reach the least number of Nodes) are replaced by other random tunnels. This tunnel replacement strategy tends to sort out the Nodes into two categories: those that have a number of tunnels close to k, and those have a number of tunnels close to 3k. Nodes with 3k tunnels are in the center of the connectivity graph while nodes with k tunnels are in the borders. This scheme allows re6stnet to achieve the desired compromise between low latency and high resilience.

The routing between nodes of the re6stnet is done by the Babel protocol. It is a distance-vector routing protocol based on the Bellman-Ford algorithm [9]. Neighbors are detected by the mean of Hello and IHU (I Heard You) message exchanges and routing information is exchanged only between neighbors by means of update messages.

When a customer uses the cloud, it can either connect to a reverse proxy for web services (HTTP and HTTPS) that allow IPv4 users to access the IPv6 gateway and so the services deployed in the overlay; or use direct IPv6 connectivity to the services via the VIFIB gateway.

Despite the VIFIB/SlapOS solution is very robust and achieves most of the expected requirements, there are still some problems to solve:

Routing: a node must be “identified” in case that it changes its IP, something very usual given the dynamic IP assignation provided by most ISPs. Moreover,

Fig. 1. Layer structure of the re6stnet overlay

hierarchical routing is not implemented in re6stnet, so the routing table size and the routing information exchanged between neighbors is in O(n).

Security: the Babel protocol can be attacked by a compromised node flooding the network with bad routes.

Isolation of service trees: services belonging to the same tree have to be able to connect to each other, but it would be desirable to isolate these services from the other trees and offer them a customized networking environment. Moreover, there are potential congestion issues. For example, if one customer floods the network with data, this impacts other customers.

Tunnels: they consume several resources in the node even if the tunnel is not being used. Moreover, to determine the best tunnels to use is a complex decision that would require a lot of knowledge about the network. Taking this decision only based on local information leads to sub-optimal local optimizations, not globally optimal.

An improvement to the VIFIB/SlapOS solution will need to address some of these problems, while maintaining the main re6stnet characteristics: distributed cloud schema, high resiliency, small latency, fast routing convergence and at least the same level of security.

B. Alternatives to re6stnet in the state of the art

There exist other techniques to create a distributed network overlay (like re6stnet) across the Internet. In this subsection, we discuss virtual network overlays and distributed hash tables.

1) Virtual network overlays

Virtual network overlays comprise a number of similar technologies that share the same assumptions: the network provides IP connectivity between hypervisor hosts, which tunnel the traffic of virtual networks running in the virtual machines they host using different tunneling protocols (mainly VXLAN [10], STT [3] and NVGRE [11]). All the complex processing and tunneling tasks are performed in the Hypervisors machines.

Fig. 2 shows an example of how these technologies work. App 1 in Host A can send a communication to App 2 in Host B as if both of them were in the same network. Hypervisor A is the responsible for locating destination Hypervisor B and tunnel the communication through the underlying IP network via an Ethernet over IP encapsulation protocol (such as VXLAN). Then, Hypervisor B can deliver the communication to Host B, which is completely agnostic to the tunneling process.

Existing virtual overlay network solutions mainly differ on how they acquire and manage the distributed state to make the system work. In particular, how MAC addresses in the virtual Ethernet segment are mapped to UDP/IP endpoints in the “physical” IP layer. A simple yet poorly scalable choice is to emulate a MAC learning switch and use IP multicast to disseminate the mappings. More elaborated alternatives are based on a separate “control plane” that maintains the shared

state and populates the virtual MAC to IP resolution tables at the Hypervisors.

Virtual overlay networking technology has been initially designed for its application in datacenters, which are usually controlled environments belonging to a single administrative entity. Therefore they neither provide authentication of the tunnel endpoints, nor encrypt the traffic that goes through the tunnels. This is an issue in VIFIB’s environment, since many tunnels are operating over the public Internet. Apart from this, virtual network overlays provide a similar functionality than the OpenVPN tunnels, and suffer from similar limitations (multi-homing and mobility are cumbersome since they rely on IP, poor isolation of flows and scalability of the routing system).

2) Distributed Hash Tables

Distributed Hash Tables (DHT) can be used to create a virtual overlay. Each node in the network has an ID, which is a key of the hash table. Moreover, it has a set of keys for which the node’s ID is the closest ID measured according to some distance function d(key1, key2). For any key, each node is the owner of the key or has a link to a neighbor whose ID is closer to the key (according to the distance function d). Using this key-based routing, an overlay network is created. This kind of network is designed to be very flexible and scalable because it has to resist nodes joining and leaving the network frequently, and a number of popular DHT algorithms such as Kademlia or Chord have been used in a plethora of p2p applications.

There are two main issues with the typical use of DHTs: exploiting locality (since node identifiers are hashes) and the assumption that each node is one hop away from each other (the underlay – usually IP – provides connectivity between nodes). There are proposals in the literature to embed locality in node identifiers to minimize the overlay path extension [12], but are not the optimal solution when latency needs to be minimized – as in the decentralized cloud scenario. DHTs have also been proposed as a means to construct the routing table of an Internet (such as [13], although that examples of the routes generated by the DHT are not provided). The main benefit of applying DHTs at the heart of a routing system is to minimize the entries in routing tables, but it may also impose rules that are too restrictive on how nodes have to be connected (which may restrict the application of this approach to very specific cases).

Fig. 2. Example of a virtual network overlay using VXLAN

The use of DHTs just provides a way to construct the addressing and routing components of an overlay, which may be interesting for particular network scenarios but does not provide a perfect fit for the decentralized cloud scenario being addressed in this paper, due to its requirements in terms of minimizing latency and the flexibility in the overlay connectivity graph.

III. INTRODUCTION TO THE RESCURIVE INTERNETWORK ARCHITECTURE

RINA - the Recursive InterNetwork Architecture - is the result of an effort that tries to work out the general principles in networking that applies to everything. RINA is the specific architecture, implementation, testing platform and ultimately deployment of the theory. This theory is informally known as the “Inter-Process Communication (IPC) model” [4] although it also deals with concepts and results that are generic for any distributed application and not just for networking.

The IPC model captures the common elements of distributed applications, called DAFs (Distributed Application Facilities), as illustrated in Fig. 3. A DAF is composed by two or more Distributed Application Processes or DAPs, which collaborate to perform a task. These DAPs communicate using a single application protocol called CDAP (Common Distributed Application Protocol), which enables two DAPs to exchange structured data in the form of objects. All of the DAP’s externally visible information is represented by objects and structured in a Resource Information Base (RIB), which provides a naming schema and a logical organization to the objects known by the DAP (for example a naming tree). CDAP allows the DAPs to perform six remote operations on the peer’s objects (create, delete, read, write, start and stop).

In order to exchange information, DAPs need an

underlying facility that provides communication services to them. This facility is another DAF whose task is to provide and manage Inter Process Communication services over a certain scope; hence this DAF is called DIF: Distributed IPC Facility (the DIF can be thought of as a layer). A DIF enables a DAP to allocate flows to one or more DAPs, by just providing the names of the targeted DAPs and the characteristics required for the flow (bounds on data loss and delay, in-order delivery of data, reliability, etc.). DAPs may not trust the DIF they are using, therefore they may decide to protect their data before writing it to the flow - for example using encryption - via the SDU (Service Data Unit) Protection module.

DIFs can also be the users of other underlying DIFs called N-1 DIFs, creating in this way the recursive structure of the RINA architecture. The DAPs that are members of a DIF are called IPC Processes or IPCPs. They have the same generic DAP structure shown in Fig. 3, plus some specific tasks to provide and manage IPC. These tasks, as shown in Fig. 4, can

Fig. 4. Example of different RINA networks and components of an IPC Process

Fig. 3. Distributed application processes and their components

be divided into three categories: data transfer, data transfer control and layer management. The elements are ordered in increasing complexity and frequency of use. That is, elements at the far left are the most used but the least complex ones (per packet processing), while elements to the right side are the most complex ones, but not so often demanded. All layers provide the same functions and have the same structure and components. However these components are configured via policies in order to adapt to different operating environments.

As depicted in Fig. 4 RINA networks are usually structured in DIFs of increasing scope, starting from the so-called lower layers and going up closer to the applications. A provider network can be formed by a hierarchy of DIFs, multiplexing and aggregating traffic from upper layers into the provider’s backbone. None of the provider internal layers need to be externally visible. Multi-provider ISPs (such as the public Internet or others) float on top of the ISP layers. Only three types of systems are required: hosts (which contain applications), interior routers (systems that are internal to a layer) and border routers (systems at the edges of a layer, which go one layer up or down).

IV. APPLYING RINA AS THE NETWORKING SOLUTION FOR THE DISTRIBUTED CLOUD: A CASE STUDY

In order to apply the RINA architecture to the VIFIB decentralized cloud use case, the network architect has to first identify the different DIFs (layers) that the network will require. Then, once the requirements of each DIF are understood, the network architect can design the policies for the different components of the IPC Processes that are more adequate to meet these requirements. This is an important distinction with the current Internet: there is no need to design full new protocols, since all DIFs already provide the basic protocols and infrastructure. What needs to be done is to program the different IPC Process components (such as authentication, access control, routing, addressing, data transfer, etc) via policies in order to tailor the DIF to its operational environment. Finally, since RINA needs to be interoperable with existing technologies, the network architect needs to think about adaptation layers (called shim DIFs) and gateways to allow the deployment of RINA in current networks. We will follow this approach while discussing the distributed cloud case study

A. DIFs in the distributed cloud

The first immediate application of RINA to the VIFIB system comes with the replacement of the re6stnet overlay (with a RINA DIF, as seen in Fig. 5. We have called this DIF the SlapOS base DIF (SOS-DIF), since it provides connectivity to all the VIFIB nodes, the registry(ies) and the gateway(s). The characteristics of this DIF should be similar to those of the re6stnet overlay: i) random connectivity graph following the underlying flow generation and replacement strategy (i.e., the flows between IPCPs), ii) IPCPs that have an underlying flow in common should authenticate each other, and iii) all traffic exchanged between IPCPs via the underlying flows should be encrypted.

There are some structural characteristics of RINA that already mitigate some of the issues with the re6stnet overlay.

First of all, application names are independent of the application location, thus decoupling applications from their network point of attachment. Moreover, IPCP addresses are location dependent but route independent, facilitating multi-homing and mobility [14]. Besides, IPCP addresses can be dynamically changed during the operation of the DIF without impacting the existing flows provided to applications. Policies in the SOS-DIF can be designed to further overcome the limitations of the re6stnet overlay. Instead of the Babel routing protocol, a dynamic routing policy that exploits topological addresses could minimize the size of routing tables and allow for scalability. The SOS-DIF can provide support for multiple QoS classes supported by resource allocation policies capable of enforcing strong flow isolation. We will elaborate more on these policies in the next sub-section.

Last but not least, if users of the services supported by the SOS-DIF are also RINA-enabled, it some cases it can make sense to exploit the recursivity of the RINA architecture by deploying dedicated DIFs per customer on top of the SOS-DIF. These “VPN-DIFs”, depicted in Fig. 6, would enable a seamless integration of applications deployed on the VIFIB infrastructure with others running at the customer’s premises. Moreover, policies in these DIFs could be tailored to the customer’s needs, for example providing an enhanced level of security (authentication, encryption) or data transfer policies customized to the applications deployed on the “VPN-DIF”.

B. Policies of the SOS-DIF

We have divided the discussion about the policies in the SOS DIF in three areas: security, addressing and routing and resource allocation. As explained in subsection IV.A, the connectivity graph of the SOS-DIF is generated and maintained using the same strategy employed by re6stnet, resulting in a pseudo k-regular undirected graph where each IPCP establishes k underlying flows to other IPCPs randomly, and can accept up to 2k underlying flows from other IPCPs.

Fig. 5. Slap-OS Base DIF as a replacement of re6net

Fig. 6. VPN-DIFs spanning to the customers, floating on top of the SOS-DIF

1) Security policies: authentication, access control and confidentiality

Due to the fairly decentralized and open environment in which the VIFIB’s cloud service is operating, security is a challenging goal to be achieved. The first thread to be mitigated is the possibility that a rogue VIFIB node joins the decentralized cloud. IPCPs in the SOS-DIF have to authenticate each other after successfully establishing an underlying flow to a previously unknown IPCP. This authentication is done in the Common Application Connection Establishment Phase (CACEP) where authentication messages can be exchanged between IPC processes.

When two IPCPs want to establish a communication – provided that they have an underlying flow in common - the first thing they need to do is to establish an application connection. The application connection enables the two IPCPs to exchange enough information to be able to understand each other - they have to agree in the concrete syntax of the application protocol, as well as in the version of the objects that the protocol will carry - and also to mutually authenticate. The most common situation when this happens is when an IPCP joins a DIF. The joining IPCP requests the allocation a flow to a DIF member, then the DIF member authenticates the joining IPCP and decides whether it is allowed to join.

A typical method suitable for the decentralized cloud environment is the use of digital certificates, hence asymmetric cryptography, to authenticate the counterparty and to exchange a symmetric key that is later used for encryption of the data exchanged between both parties. These methods use public-key cryptography, where a key is composed of a public key and a private (secret) key. Any message encrypted with the public key can only be decrypted with its corresponding private key, and vice-versa. The authenticity is proved then by being able to sign messages, which means encrypting messages with a private key. In these cases, there is no secret that needs to be shared in advance, but it is still needed to trust the source of the public key to avoid man-in-the-middle attacks.

Fig. 7 shows an authentication policy based on this approach. The IPCP that wants to join the SOS-DIF (AP-B) needs to previously obtain a certificate signed by a Certificate Authority (CA) trusted by the IPCPs in the SOS-SIF. There are several ways in which these certificates could be obtained, distributed and installed into the nodes hosting IPCPs wanting to join the SOS-DIF (e.g., pre-installed by VIFIB personnel before shipping the node to destination, via USB keys, via SIM cards, etc.). Assuming the joining party has installed such a certificate, the flow of messages during the authentication exchange could be the following:

Member IPCP negotiates the version of the policy and the cipher settings with the joining IPCP.

Joining IPCP provides its certificate to member IPCP, and optionally requests member IPCP’s certificate (in case the joining IPCP wants to authenticate the member as well).

Member IPCP performs the authentication calculations in order to check if the certificate provided by the joining IPCP has really been signed by the trusted CA.

If successful, authentication is over and the CACEP phase can conclude.

After both IPCPs have successfully established an application connection, and therefore authenticated, they are ready to start exchanging data between them. However, since they may be operating over the public Internet, the SOS-DIF security policies should take into account the fact that data exchanged by the IPCPs may be subject to tampering or eavesdropping.

Confidentiality and integrity in the RINA architecture are achieved by configuring the SDU protection module of each process with proper policies. SDU Protection includes all checks necessary to determine whether or not an SDU should be processed further or to protect the contents of the SDU while in transit to another IPCP that is a member of the DIF. It may include but is not limited to checksums, CRCs, Hop Count/TTL, encryption, etc. SDU protection is performed on an underlying flow basis, meaning that SDU Protection policies can vary for each N-1 flow the DIF is using (since the level of trust and the characteristics of the underlying DIFs may be different). This is the case of the SOS-DIF, in which data exchange between IPCPs has to be encrypted if it travels through an underlying flow over the public Internet. However, this may not be necessary if the underlying flow is provided by a LAN.

Fig. 7 shows an overview of how the keys used for encryption could be generated after authentication is over (green rectangle). Once the keys had been agreed, the proper instance of the SDU protection module would be configured and activated, encrypting the traffic exchanged by the two IPCPs in the SOS-DIF.

2) Addressing and routing policies

Addresses are location-dependent but route-independent; that is, they provide an idea of where something is without indicating how to get there. In other words, given two addresses, one can detect if they are near each other for some definition of near. When applying the address definition to a network layer (a DIF), the address space should reflect an

Fig. 7. Authentication policy based on asymmetric criptography.

abstraction of the DIF connectivity graph that remains invariant to changes on it; in other words, an abstraction of the topology of the layer connectivity graph. The topological address space must be metrizable through a distance function or define an orientation so that it can be decided if two addresses are near each other or, given a destination address, which address in a certain set is closer to the destination. Also the granularity property is interesting in a topological space, which denotes its resolution: topologically speaking, which elements of the topological space can be considered at distance 0 between them (they are "in the same place").

In general, the effectiveness of the routing and resource management for the layer can be greatly enhanced if the topological address space is metrizable and has an orientation. Moreover, better routing scalability can be achieved with coarser granularity, but routing will probably tend to be less optimal, so there is a clear trade-off in this regard. Applying and maintaining a topological addressing scheme for VIFIB’s decentralized cloud environment can simplify routing and increase the scalability of the SOS-DIF.

Addresses in a DIF are assigned and maintained by an IPCP component called the NamesSpaceManager (NSM, Fig. 4). When an IPCP joins a DIF, it establishes an application connection to one of its members – as explained in the previous section - and gets an address assigned. Policies for address assignment within a DIF can range from a fairly centralized setup (where one or a few IPCPs maintain a fully replicated map of the address assignments), to more decentralized (the DIF address space can be divided in several subsets, and different IPCPs maintain the different subsets of the address space) to fully distributed (in the case where a DIF may have no organization such as in dynamic ad-hoc networks). Another interesting property of naming and addressing in RINA is that IPCPs are identified by its name, addresses are just synonyms of the IPCP name structured to facilitate routing within a DIF. This property implies that: 1) the same IPCP can have multiple addresses, and 2) the address of an IPCP can be dynamically renumbered during its lifetime (renumbering is no evil in RINA). As we will see in the next paragraph, this property makes topological addressing doable in a dynamic environment such as VIFIB’s distributed cloud.

A dynamic topological routing policy could work the following way. The IPCPs in the SOS-DIF are divided in different regions, as shown in Fig. 8 (disjoint subsets of the DIF). All IPC processes in the DIF are equal in terms or routing and the underlying flows connecting them are divided between in-region and out-region links, depending on whether they provide connectivity to IPC processes belonging to the same or other regions in the network, respectively. The address space is structured to reflect the DIF’s topological organization in regions, encoding the IPC Process address as <region, id>, where region identifies the region and id the IPCP within the region.

When an IPCP joins the SOS-DIF, the NSM assigns it an address, which involves taking the decision of assigning the IPC Process to a region. Maintaining regions that are well-balanced is a complicated problem taking into account that the underlying flows between IPCPs are constantly updated.

While regions could be randomly assigned, it is much preferable to assign the same region to IPC processes near each other, for some definition of “nearness”. In this way, we avoid communications between near IPC processes to travel unnecessarily long paths. To achieve this, information about the “location” of the new IPC processes is required (e.g., underlying-DIFs to which they are registered, round-trip-time to the IPCP, etc.). This information should be made known to the NSM by any means.

Each IPCP needs to maintain reachability information to all the remainder IPC processes inside its region, as well as to the other regions (the granularity of the inter-region routing is the region). For that, it maintains two routing tables, namely, an intra-region one and an inter-region one. Updates of the inter-region routing table are distributed to all neighbours. Conversely, intra-region routing table updates are only distributed through in-region links, leaving IPCPs outside its region unaware of the region’s internal topology. Note that if all region gateways where connected in a full-mesh fashion the inter-region routing table would not be necessary. However, the strategy to generate the connectivity graph of the SOS-DIF is not consistent with this assumption.

The routing policy can use either a link-state or a distance-vector approach to populate the routing tables. Given the network graph described by the underlying flows, different routing policies (e.g., to be assigned to different QoS classes) could be realized by employing different measures of distance, like bandwidth available, hop count, delay, and so on. It shall be mentioned, though, that IPC processes would need to maintain one entry per destination and measure of distance in both intra-region and inter-region routing tables. Moreover, in order to avoid loops, the split-horizon technique can be applied to routing updates (as typically employed in Routing Information Protocol, RIP, configurations), informing of the cost of the secondary option when sending routing updates for the primary option.

Each IPCP is assigned a hierarchical address, defined as the sequence of the region identifier and the node identifier inside the region. This hierarchical address is enough to route to any destination IPCP. Routing is performed to the node identifier of the destination IPC process if within the same region, or to the region identifier if within another region.

Fig. 8. Connectivity graph of the SOS-DIF, divided into 4 regions.

An aspect worth considering is the issue of maintaining the coherence of the network topology with the dynamic constraint of the distributed cloud systems. An IPC process may at any time disconnect from its region. Accordingly, it can join a new region and leave its old region. We are investigating the work in [15] were a set of balancing rules are proposed to deal with this problem to ensure maintaining a valid hierarchical topology.

3) Resource allocation and RMT scheduling policies

The Resource Allocation Task of an IPCP (Fig. 4) is in charge of managing the allocation of the set of resources needed to provide QoS aware delivery of PDUs. This includes the size and number of QoS classes available in each DIF, as well as the assignment of the flows provided by the DIF to QoS classes. In addition, it holds and maintains the PDU Forwarding table and is aware of the number of RMT input/output queues per underlying flow and the policies servicing these queues. A DIF’s resource allocation strategy depends on the stochastic properties of the incoming traffic, as well as on the different characteristics of the classes of flows it wants to support.

The DIF resource allocation strategy influences the number of queues in the RMT, as well as the processing of those queues. In one extreme, there can be a single queue per input or output N-1 flow. This queue is shared by all the PDUs of all the N-flows provided by the DIF, therefore treating all the flows in the same way and providing no isolation between them (best-effort, equivalent to the current re6stnet). In the other extreme, each flow can be isolated from each other and treated individually by having a dedicated input/output queue and a proper scheduling algorithm (virtual circuit-approach). In-between these two approaches, separate input/output queues per each QoS class provide isolation and differential treatment to flows belonging to different QoS classes - limiting the resource sharing to flows belonging to the same QoS class.

We are exploring a resource allocation policy capable of supporting multiple levels of QoS based on the urgent/cherishing multiplexing [16] approach. This work originated from trying to understand how to deliver predictable quality in networks working close to or at saturation conditions (when the offered traffic load is close to or even higher than the network’s traffic processing capacity). This work originated the delta-Q model for reasoning about network quality [17], in which quality attenuation is defined as the quality degradation (in terms of loss and delay) that each packet flow perceives in transversing the network. Each network element in the path of a packet introduces a certain quality attenuation. Part of this quality attenuation is due to the fact that packets contend for shared resources with other packets (queues and scheduling capacity). Quality attenuation cannot be ‘undone’, but it can be differentially allocated to the different flows traversing a network element. Therefore, for a fixed throughput, loss and delay can be differentially allocated to different traffic classes. This differential allocation is achieved by the urgent/cherishing multiplexing model, which defines two explicit orderings: loss and delay. Both are combined to provide an overall quality partial order.

Fig. 9 shows an example of the urgent/cherishing multiplexing with 9 categories of quality traffic and Best Effort traffic. The different columns define an order for delay requirements, keeping the same loss: traffic of the category A1 will be served with strict priority over traffic of the category of A2, A2 over A3, etc. In contrast, different rows define different levels of loss, with a common delay: traffic of category A1 will experience less loss than traffic of category B1; B1 less loss than C1, etc. This generalizes to N*M categories (plus best effort). The full details on how to mathematically model the system and calculate the quality attenuation that traffic flows will experience are provided in [18]. This multiplexing model can be decomposed in policies for the RMT task of the IPCP (Fig. 4), allowing the SOS-DIF to define a set of categories of loss and delay to better support the applications deployed over the decentralized cloud.

C. Tooling required for RINA deployment in VIFIB’s operational environment

After designing the different DIFs required in the decentralized cloud environment and exploring potential policies for them, it is time to look at all the tools that are needed in order to deploy RINA seamlessly over the VIFIB infrastructure, supporting legacy applications and interoperating with Ethernet and IP technologies.

1) Shim DIFs (adaptation layers)

Shim DIFs are used to transition from DIFs to a legacy technology or a physical medium. Shim DIFs wrap the legacy technology or physical medium with the RINA API, in order to allow DIFs on top to operate seamlessly. The scenario under study requires at least two shim DIFs: i) shim DIF over 802.1Q, allows the deployment of RINA over a VLAN in a LAN environment; ii) shim DIF over TCP/UDP allows the deployment of RINA over an IP network using TCP or UDP on top; and iii) shim DIF for HyperVisors, not required, but can be an optimization for VIFIB nodes that provide Virtual Machines, since it enables the communication between Guest VMs and its hosts using shared memory.

2) Gateways

The gateways of the VIFIB infrastructure allow customer traffic from the IPv6 Internet to access the services deployed in VIFIB’s distributed cloud. After applying RINA, the gateway needs to maintain a mapping between IPv6 addresses of services deployed in the VIFIB infrastructure and the DIF names that these services are available through. Using this information the gateway will: i) terminate incoming TCP or UDP flows; ii) check the destination IPv6 address, and find

Fig. 9. Example of U/C multiplexing

out which is the DIF through which the service with this IPv6 address is available (the IPv6 address of the service is used as the application process name); iii) allocate a flow to the service over the service-tree DIF identified in the previous step; iv) write the data from the TCP or UDP flow to the RINA flow, and vice-versa; and v) when the TCP or UDP flow are terminated, deallocate the flow in the service-tree DIF.

3) Application APIs

Since the applications deployed in this use case cannot be modified, the VIFIB nodes running the RINA stack need to support the faux sockets API, which converts the calls to sockets into invocations to the native RINA API. Applications can be used on top of DIFs untouched, at the price of keeping the limitations of the sockets API.

V. CONCLUSIONS AND FUTURE WORK

In this paper we have explained in detail the VIFIB decentralized cloud architecture. We have also exposed other solutions to the problem of creating a virtual overlay over the Internet. Then, we have presented a fresh, clean-slate network architecture called RINA. Finally, we have studied how to apply RINA to provide a better service for the distributed cloud users and easier management and operation of the distributed cloud nodes.

There exist several benefits in adopting RINA as the network architecture for the distributed cloud scenario, which also apply to the use of RINA for designing and building overlays in general: i) the ability to use as many layers as required using the same concepts and infrastructure; ii) the ability to customize each layer to the specific requirements of its operational environment; iii) the freedom of programming the policies of each layer without being tied to any assumptions, such as particular address formats, data transfer protocols or resource allocation schemes; iv) the transition from designing full protocols to policies that re-use the common infrastructure provided by the DIF; v) the ability to exploit some of the structural advantages of RINA to facilitate routing; vi) the clean security model of RINA which facilitates the placing of the security functions in RINA (authentication, access control, confidentiality, etc).

Our current work is now focused in fully specifying the policies introduced in this paper, simulating them using a RINA simulator based on OMNeT++ [19] being developed in the FP7 PRISTINE project [20]. Once the simulation results are well understood and the policies have been validated, some of them will be implemented in the RINA prototype for Linux initially developed by the FP7 IRATI project [21] and currently being enhanced by PRISTINE with the development of a Software Development Kit. The SDK will facilitate the programmability of the IRATI RINA implementation [22].

Finally, the implemented solution will be deployed over a subset of the VIFIB infrastructure for its experimental evaluation and comparison with the re6stnet overlay.

REFERENCES [1] “VIFIB web page,” available online at http://www.vifib.com/. [2] SlapOS Community, “SlapOS documentation,” available online at

http://community.slapos.org/wiki/osoe-Lecture.SlapOS.Extended. [3] B. Davie, J. Gross, “A stateless transport tunneling protocol for Network

Virtualization,” March 2012, available online at http://tools.ietf.org/html/draft-davie-stt-01.

[4] J. Day, “Patterns in Network Architecture: A Return to Fundamentals,” Prentice Hall, 2007. ISBN 978-0-13-225242-3.

[5] Openvpn website, available online at http://openvpn.net/. [6] P. Mahlmann, C. Schindelhauer, “Random graphs for peer to peer

overlays,” available online at http://archive.cone.informatik.uni-freiburg.de/pubs/delis-barcelona.pdf

[7] B. Bollobas, Random graphs, Cambridge University Press, October 2001. ISBN: 9780521797221.

[8] W. Fernandez de la Vega and B. Bollobas, “The diameter of random regular graphs,” Combinatorica, 1981.

[9] A.S. Tanenbaum, Computer networks, 4th ed., Prentice Hall Professional Technical Reference, 2002, Section 5.2, ISBN: 0130661023.

[10] M. Mahalingam, D. Dutt, K. Duda, P. Argawal, L. Kreeger, T. Sridhar, M. Bursell, C. Wright, “VXLAN: A framework for overlying virtualized L2 networks over L3 networks,” August 2011, available online at http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00.

[11] M. Sridharan, A. Greenberg, Y. Wang, P. Garg, N. Venkataramiah, K. Duda, I. Ganga, G. Lin, M. Pearson, P. Thaler, C. Tumuluri, “NVGRE: Network Virtualization Using Generic Routing Encapsulation,” August 2013, available online at http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-03.

[12] S. Zhou, G. Ranger, P. Alfons “Location-based Node IDs: enabling explicit locality in DHTs”, 2003. Available at http://repository.cmu.edu/cgi/viewcontent.cgi?article=3206&context=compsci

[13] O. Hanka, C. Spleiss, G. Kuzmann, J. Eberspacher; “A Novel DHT-Based Network Architecture for the Next Generation Internet”, Eigth International Conference on Networks (ICN), 2009.

[14] J. Saltzer, "On the Naming and Binding of Network Destinations," RFC 1498 (Informational), August 1993.

[15] A.L. Pumo. "Scalable Mesh Networks and The Address Space Balancing Problem," Master Thesis, 2010, available online at http://www.cl.cam.ac.uk/~ey204/pubs/ACS/andrea_master_thesis.pdf.

[16] N. Davies, J. Holyer and P. Thompson, “An operational model to control loss and delay of traffic in a network switch,” in third IFIP Workshop on Traffic Management and Design of ATM Networks, 1999.

[17] N. Davies, “Delivering predictable quality in saturated networks,” Technical Report, September 2003, available online at http://www.pnsol.com/public/TP-PNS-2003-09.pdf

[18] D. C. Reeve, “A New Blueprint for Network QoS”, Ph.D Thesis, August 2003, available online at http://www.cs.kent.ac.uk/pubs/2003/1892

[19] “OMNeT web page,” available at http://www.omnetpp.org/. [20] FP7 PRISTINE project website, available online at http://ict-pristine.eu [21] FP7 IRATI project website, available online at http://irati.eu [22] IRATI RINA implementation, available online at

http://irati.github.io/stack

http://www.vifib.com/.

http://community.slapos.org/wiki/osoe-Lecture.SlapOS.Extended.

http://tools.ietf.org/html/draft-davie-stt-01.

http://openvpn.net/.

http://archive.cone.informatik.uni-

http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00.

http://tools.ietf.org/html/draft-sridharan-

http://repository.cmu.edu/cgi/viewcontent.cgi?article=3206&context=co

http://www.cl.cam.ac.uk/~ey204/pubs/ACS/andrea_master_thesis.pdf.

http://www.pnsol.com/public/TP-PNS-2003-09.pdf

http://www.cs.kent.ac.uk/pubs/2003/1892

http://www.omnetpp.org/.

http://ict-pristine.eu

http://irati.eu

http://irati.github.io/stack

applying rina as an overlay virtual networking solution to ... · optimal local optimizations, not...

Documents