flexible large-scale data storage · a hybrid approach inspired by the software pattern called...

Flexible Large-Scale Data Storage

António José Freitas Pinheiro

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisor(s): Prof. Rodrigo Seromenho Miragaia RodriguesDr. Marcelo Lebre

Examination Committee

Chairperson: Prof. Miguel Pupo CorreiaSupervisor: Prof. Rodrigo Seromenho Miragaia Rodrigues

Member of the Committee: Prof. João Coelho Garcia

November 2018

Acknowledgments

First and foremost, I would like to convey a special appreciation to my thesis advisor Rodrigo Rodrigues,

for the dedication and endless support. My sincere gratitude also goes to my co-advisor Marcelo Lebre,

for all the enthusiasm, motivation and wise suggestions throughout the last year.

A very special gratitude goes out to all down at Unbabel, for the support and friendship. With a

special note to Eduardo Pereira and Emanuel Velho, for the insightful comments during this journey.

A special thanks to Sergio Mendes, Steven Silva, Rui Venancio and Paulo Figueira for the construc-

tive criticism on the manuscript of this dissertation.

Finally, I must express my profound gratitude to my family and friends, for their wise consul and

guidance. It was their support and encouragement that got me through these hard-working years.

iii

Resumo

Com o aumento da quantidade de informacao processada, as empresas sao forcadas a aumentar as

suas infraestruturas de modo a continuarem a prestar um servico de qualidade aos seus clientes. Para

alem disso, diferentes classes de pedidos tem diferentes requisitos de performance e consistencia, o

que cria tensao entre a simplicidade de utilizar o mesmo sistema para lidar com todos os pedidos e a

capacidade de diferenciar como estes pedidos sao processados. No entanto, empresas que nao tem a

dimensao de grandes entidades, sentem dificuldades em garantir que a qualidade do servico prestado

nao seja afetada.

Neste trabalho, foi feita uma analise meticulosa em relacao aos diferentes tipos de solucoes capazes

de lidar com o aumento da quantidade de informacao, para projetar uma solucao tao capaz quanto

aquelas, mas a um custo reduzido. Apos esta analise, esta dissertacao propoe uma abordagem hıbrida,

inspirada no padrao de software Command Query Responsibility Segregation. No seu estado mais puro,

este padrao segrega os pedidos de acordo com o seu proposito, leitura ou escrita de dados. No entanto,

nos iremos mais longe e teremos em conta os diferentes de consistencia de cada pedido.

Para testar a nossa solucao, colaboramos com a Unbabel, uma startup que fornece traducao como

um servico, implementando-a num modulo especıfico do seu sistema. Os resultados desta avaliacao

demonstram que a nossa solucao e capaz de melhorar o tempo de resposta dos pedidos de leitura que

possuem requisitos de consistencia mais relaxados, atingindo, em media, um speedup de 3.30.

Palavras-Chave: CQRS, Armazenamento em larga escala, Disponibilidade, Consistencia

v

Abstract

The ever increasing amount of data that companies have to process forces their infrastructure to grow to

continue serving the client’s demands. Furthermore, different classes of storage requests have different

requirements in terms of performance and consistency, and this leads to a tension between the simplicity

of using the same system for handling all requests, and the ability to differentiate how these requests are

handled. However, companies that lack the dimension of major Internet players find it difficult to address

this problem without reducing the quality of the service provided to the client.

In this work, we make a thorough analysis on the state-of-art regarding the different types of so-

lutions capable of dealing with request differentiation in large scale storage, to design a solution as

capable as those but at a severely reduced cost. After analyzing the state-of-art, this thesis proposes

a hybrid approach inspired by the software pattern called Command Query Responsibility Segregation.

At its core, this pattern segregates requests according to its functionality, whether it reads or updates

data. However, in our solution, we will go further and include the different levels of consistency in this

differentiation.

In order to test our solution, we collaborated with Unbabel, a startup that offers translation as a

service, by implementing it on a specific module of their system. This solution managed to improve the

read response time of requests with more relaxed consistency requirements by achieving a speedup of

3.30 while maintaining the same performance for the remaining classes of requests.

Keywords: CQRS, Large scale storage, Availability, Consistency

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1

1.1 Current Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

2.1 Large-scale storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Combining different storage systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Quality of service in storage systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Proposed Solution 21

3.1 Requirements: the Unbabel use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Adapting the CQRS pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.2 Replicator queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.1 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

ix

4 Implementation 29

4.1 System interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 API Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.2 Write Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.3 Read Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.4 Replication Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Evaluation 39

5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.1 Read Layer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.2 Write Layer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Conclusions 47

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography 49

x

List of Tables

2.1 Storage systems comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Cloud storage systems comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Available tasks count response times summary . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Available tasks count under pressure summary . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Write layer response times summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Available tasks count throughput summary . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.5 Throughput summary under high load pressure . . . . . . . . . . . . . . . . . . . . . . . . 46

xi

List of Figures

2.1 Conductor’s system overview [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Unbabel’s Internal Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 API endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Available tasks count endpoint class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Stored procedure: notify trigger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Stored procedure: get available tasks count . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Function: pg listen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Function: replicate available tasks count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Function: setup periodic tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.8 Function: get editor id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.9 Redis database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.10 PostgreSQL database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Available tasks count response time over time . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Available tasks count response time over time under high load pressure . . . . . . . . . . 42

5.3 Create tasks response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4 Get task response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 Submit task response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 Available tasks count transactions per second . . . . . . . . . . . . . . . . . . . . . . . . . 45

xiii

Nomenclature

API Application Programming Interface

CQRS Command Query Responsibility Segregation

xv

Chapter 1

Introduction

Nowadays, startups that process huge amounts of data, but lack the tremendous scale and economic

power that companies like Google1 or Facebook2 have, find some difficulty in storing and processing

that amount of information in a way that the quality of service provided to the user is not affected.

Typically, these startups use open-source storage systems, for example MongoDB [1] or Cassandra

[2], which makes them capable of coping with the continuous growth of data through the usage of

sharding techniques, for instance. So far this methodology has worked relatively well. However, for

simplicity, the same storage system is being used for dealing with different data types as well as for

different kinds of requests.

Although MongoDB and Cassandra scale well in terms of the amount of data stored, they don’t have

a way of differentiating the data types and request types that the system has to deal with. Without this

differentiation the load pressure related to the read requests will affect the response time of the write

requests and vice-versa.

In some cases, such as Unbabel3, the read load is higher than the write load. Hence, the differen-

tiation of request type would allow them to tackle that problem by, for instance, scaling the side which

has the most load pressure instead of their entire infrastructure which naturally would reduce the cost

associated to this procedure.

We aim to bring together the best of both worlds by using open source storage systems, benefiting

from the work done by the open source community, and by introducing the concept of request and data

type differentiation where requests are segregated into two different flows according to its requirements

and type.

1.1 Current Solutions

Currently, there are open source scalable storage systems capable of dealing with the large amount

of data, that some enterprises have to deal with, such as MongoDB [1], Cassandra [2] or Riak KV [3].

1https://www.google.com/2https://www.facebook.com/3https://unbabel.com/

1

However, they do not offer data and request differentiation, which is useful to improve the quality of

service for different types of workloads.

On the other hand, there are systems that combine storage systems with different characteristics

to build a global storage system better than each one of them individually, for instance RACS [4] or

HyRD [5]. However, these systems have some objectives that differ from ours, for example, avoiding

vendor-lock in.

Lastly, there are storage systems that focus on quality of service requirements. However, such sys-

tems, like Hippodrome [6] or Ergastulum [7], are often old prototypes that lack the widespread adoption

or constant enhancement of a solution like MongoDB or Cassandra.

1.2 Challenges

Having the ability to deal with large amounts of data as well coping with sudden workload spikes, inherent

to almost every huge enterprise on the internet today, has became more difficult overtime virtue of the

pace at which information is growing.

A possible solution here would be to change the storage system itself to incorporate a quality of

service concept. This would be within the reach of big companies such as Google or Facebook, but for

a startup, which needs to focus on the rapid growth of its business, it does not make sense to develop

in-house large-scale storage.

Even adopting a storage system capable of coping with the workload demand, instead of developing

in-house storage, would be troublesome due to the amount of effort and time that refactoring the existing

code would consume. Consequently preventing the productivity and growth of startups, which is crucial

in its earlier stages.

Another important factor is the ability to scale over time. Albeit current solutions manage to effectively

deal with the increase of processed data, it is of great importance to be able to easily adapt the system

in case the problem emerges again in the future.

1.3 Objectives

The goal is to have the capacity to evolve along with the data growth without compromising the response

times of the system. Moreover, we should be able to employ this solution with the least amount of

refactoring possible.

The main idea behind the solution is to combine different storage systems with different character-

istics to create an enhanced global storage system capable of handling each request better than if we

were to use the same storage system. Having this in mind, we will be exploring the CQRS [8] pattern.

This pattern states that the write and read responsibilities are segregated and that should be individually

addressed by different models. However, we will not only segregate requests based on their purpose,

read-only or write, but also taken into account the different levels of consistency.

2

1.4 Thesis outline

We start this thesis by, in Chapter 2, presenting our research on the related work, followed by the pro-

posed solution in Chapter 3. Next, in Chapter 4, we describe how our solution is implemented. Chapter

5 demonstrates the results of the evaluation performed on our solution. We finalize the document, in

Chapter 6, by presenting our conclusions and addressing future work.

3

Chapter 2

Related Work

In this chapter we present our survey on the most relevant topics related to our work. We start, in

Section 2.1, by addressing systems that were built to store large quantities of data. Next, in Section 2.2,

we present a set of systems that combine different storage systems to build a global system that takes

advantage of the strengths of each storage system. We finalize, in Section 2.3, by addressing studies

and proposals for adding quality of service to storage systems.

2.1 Large-scale storage

We can broadly divide storage systems into two classes: SQL or NoSQL. NoSQL storage systems can

be further differentiated according to how they choose to store data. In particular, within the NoSQL

category, we can find key-value stores, wide-columns stores, document databases or graph databases.

In this section, we will focus on storage systems that are designed to store large quantities of data while

providing high performance and availability guarantees.

SQL Storage Systems

SQL storage systems are those that are based on the language supported by relational databases,

which is by large a variation of the Structured Query Language (SQL). This corresponds to a rigid,

structured way of storing data, like a phone book. Initially, relational databases are based on an im-

plementation for a single machine. However, there are several ways to allow relational databases to

be more available or store more data and still allow for good performance. Below, we describe recent

research proposals for scaling this approach.

BlinkDB

BlinkDB [9] is an approximate query engine for running interactive SQL queries on large volumes of

data. It aims to enhance the performance on large data by trading query accuracy for response time.

Storage systems that use approximation techniques heavily depend on the workload of the samples

used. We can divide the workload in four categories: Predictable Queries, Predictable Query Predicates,

5

Predictable Query Column Sets and Unpredictable Queries. Predictable queries assume that all future

queries are known in advance, Predictable Query Predicates assume that the frequency of group and

filter predicates do not change over time, Predictable Query Column Sets assume that the frequency

of the sets of columns used for grouping and filtering does not change over time and Unpredictable

Queries assume that queries are unpredictable. BlinkDB uses the Predictable Query Column Sets

model because they found these type of queries to be the most common ones on large scale analytical

clusters. These systems also depend on the type of sampling performed on that workload. In the case of

BlinkDB, they resort to stratified sampling [10] and uniform sampling, depending on the user constraints.

For example, uniform sampling works well with queries that do not filter or group subsets of the table,

but otherwise stratified sampling is the best choice.

Regarding the data model, BlinkDB is a SQL database that supports a constrained set of SQL-style

declarative queries. This storage system can provide approximate results for standard SQL aggregate

queries such as COUNT, AVG, SUM and QUANTILE, but it doesn’t support arbitrary joins and nested

SQL queries. Queries can be annotated with either an error bound or a time constraint. As an example,

consider querying a table Person, with three columns, Name, Genre, Email. The query:

SELECT COUNT(*)

FROM Person

WHERE Genre = ’male’

ERROR WITHIN 5% AT CONFIDENCE 95%

will return the results for the count operation having relative error of at most 5% at a 95% confidence

level. While the query:

SELECT COUNT(*)

FROM Person

WHERE Genre = ’male’

WITHIN 2 SECONDS

will return the most accurate results in 2 seconds.

Architecturally, BlinkDB extends the Apache Hive Framework [11] by adding two major components:

an offline sampling module, responsible for creating and maintaining the samples over time, and a

runtime sample selection module, responsible for creating an Error-Latency Profile that is later used to

select the most suitable sample depending on the type of query the user is requesting, either an error

bound one or a time constraint one.

NoSQL Key-Value Storage Systems

Key-value stores are storage systems that store data as an association between key and value. Each

item stored has an attribute name (or ”key”) coupled with its value. Below we describe a few storage

systems that uses this type of storage.

6

Riak KV

Riak KV [3] a NoSQL Key-Value database capable of running on commodity hardware while guarantee-

ing high availability and scalability. It was designed to survive network partitions and failures as well as

to scale seamlessly. It is capable of adding and removing capacity on demand without data sharding or

manually restructuring the cluster.

Architecturally, Riak uses a masterless, peer-to-peer design under a ring topology instead of a

master-slave design. Since all nodes are homogeneously capable of serving reads and writes, the

system has no points of failure. With this kind of design, the system provides high data availability under

many failure circumstances. To balance the load of the ring, this storage system uses consistent hashing

[12], which allows for simple, local decisions regarding where to store the data, and attempts to minimize

the number of keys that need to be remapped upon nodes joining or leaving the system.

Another important aspect of this system, which relates to all NoSQL systems, is the concept of

eventual consistency. This means that, when executing a write operation that updates the data, even

if only a subset of the nodes have confirmed to receive the data, then that transaction is considered to

be complete. By default, Riak replicas are eventually consistent, which makes data more available, but

sometimes inconsistent.

Regarding the data model, similarly to other key-value stores, data is stored as a collection of

key/value pairs. In Riak the key is a binary value and the value, associated to that key, any unstruc-

tured data or one of the following data types: Flags, Registers, Counters, Sets, Maps, HyperLogLog.

These data types, called Riak Distributed Data Types, were developed by Riak to help the developer and

described in [3].

Redis

Redis [13] is a NoSQL Key-Value database that aims to provide high availability and performance.

Regarding the data model, key-value databases are really easy to understand, since the approach

used is also known as hash addressing. Basically, a value is saved associated with a unique key.

Even though Redis does not support complex queries or indexing, it supports data structures as values.

These data structures in Redis are called Redis Datatypes, which includes strings, lists, sets, sorted

sets and hashes. As an example of this data model, ”HMSET Cat:2:attributes color black age 20” adds

two key-value pairs to the key Cat:2:attributes.

Architecturally, the whole dataset is kept in-memory and therefore it cannot exceed the amount of

physical RAM. Redis writes the entire dataset to disk at configurable intervals. When a replica restarts, it

just reads into memory the file containing the dataset. This system also uses a Master-Slave replication

design.

NoSQL Wide-Colum Storage Systems

Wide-Column storage system store data together as columns instead of rows and are optimized for

queries over large datasets. Below we describe a few systems that use this approach.

7

Bigtable

Bigtable [14] is a NoSQL distributed storage system designed to scale to a very large size: petabytes of

data.

Regarding the data model, Bigtable is a distributed persistent multi-dimensional sorted map. This

map is indexed by a row key, column key and a timestamp. Column keys are grouped into sets called

column families. All data stored in a column family is usually of the same type. A column key is named

using the following syntax: family:qualifier. In this storage system, each cell can store multiple versions

of the same data, which are indexed by timestamps. Timestamps can be represented as real time

microseconds or be explicitly defined by client applications. However, versioned data must be managed.

Bigtable supports two per-column-family settings used to garbage-collect cell versions automatically.

The client is able to specify either that only the last x versions of a cell be kept, or that only new versions

be kept, for instance, the values written in the last 3 days.

To be able to scale properly, this storage system introduced the concept of tablets. Tables are divided

into tablets, which contain a defined range of rows of that particular table. With this notion, tablets are

now able to be distributed accordingly through replicas making the system highly scalable. Initially, each

table consists of just one tablet. As a table grows, it is automatically split into multiple tablets.

Architecturally, this storage system uses a master-slave design with Chubby [15] as its lock service

and Paxos [16] as the algorithm responsible to keep replicas consistent when failures occur. The system

has three major components: a library linked into every client, one master server and many tablet

servers. Tablet servers are responsible for managing a set of tablets. Tablets then contain the information

that clients look for when issuing read and write requests. The master server, in turn, is responsible for

assigning tablets to tablets servers, balancing the tablet-server load and to perform garbage-collection.

Bigtable uses a three-level hierarchy to store the tablet location information. The first level is a file stored

in Chubby that contains the location of the root tablet. The root tablet has the location of all tablets using

a metadata table containing the location of the user tablets.

Cassandra

Cassandra [2] is a highly scalable NoSQL distributed storage system designed to run on cheap hardware

while being able to handle high write throughput without sacrificing read efficiency.

Architecturally, Cassandra uses a peer-to-peer design under a ring topology instead of a master-

slave or a sharded design, mainly due to its high availability demands. Even if a node fails, the other

nodes in the ring are able to deliver the data since they all eventually share the same information.

One of the key features of this system is the ability to scale incrementally. To achieve this, Cassan-

dra partitions data using consistent hashing [12], which allows the system to minimize the effects of a

departure or arrival of a node to the ring. In this approach, immediate neighbors are affected but the

others nodes remain unaffected. This system also strengthens this algorithm by analyzing load informa-

tion on the ring and moving lightly loaded nodes on the ring in order to help heavily loaded nodes. This

technique is described in [17].

8

As we mentioned previously, this is a distributed system, so replication is a very important part of

the system, since it is the key to achieve high availability. Depending on the type of requests, replication

is performed differently. For writes, the system routes the requests through the replicas and waits for a

quorum of replicas to acknowledge those writes. For reads, the system, depending on the consistency

guarantees established by the customers, routes the requests to the closest replica or to all replicas and

waits for a certain number of responses. The client chooses the replication policy that best suits the

application, with the options of Rack Unaware, Rack Aware or Datacenter Aware. These policies are

described in [2].

Regarding the data model of the system, it works as follows. A table is a distributed multi dimensional

map indexed by a key. Every operation performed under a single row key is atomic per replica no matter

how many columns are being accessed. Much similarly to Bigtable, it allows columns to be grouped

together as column families. These columns can be of two different types: Simple and Super column

families. Super column families are columns that can be visualized as a column family within a column

family. When receiving read requests, Cassandra allows columns to be sorted either by time or name.

NoSQL Document Storage Systems

Document storage systems pair each key with a complex data structure known as a document. Docu-

ments can contain many different key-value pairs, or key-array pairs, or even nested documents. Below

we describe a system that relies on this approach.

MongoDB

MongoDB [1] is a distributed NoSQL document database with high availability, scalability and perfor-

mance guarantees. Unlike NoSQL databases, this system is not limited to key-value queries, since it is

possible to also use, to name a few, Range queries, Geospatial queries or MapReduce queries.

Regarding the data model, this storage system stores data as documents in a binary representation

called BSON (Binary JSON). BSON documents contain one or more fields, and each field can be of

different data types, including arrays or even other documents.

Scalability in MongoDB is ensured by using the technique called sharding. Sharding distributes

data across multiple partitions called shards. As the data grows or the size of the cluster increases,

this storage system is capable of automatically balancing the load between the replicas. Interestingly,

MongoDB provides multiple sharding policies that will be used to define how the data will be distributed

across the cluster. Developers can opt between: Range-based Sharding, Hash-based Sharding and

Location-aware Sharding. Range-based Sharding states that documents are partitioned according to

the shard key value. Hash-based Sharding says that documents are uniformly distributed according to

an MD5 hash of the shard key value. Location-aware Sharding means that documents are partitioned

according to a user-specified configuration that associates shard key ranges with specific shards and

hardware.

Data availability is achieved using replica sets. A replica set consists of two or more copies of the

9

data. There is always a primary replica that is responsible to handle write and read requests; however,

by default, it is possible to read eventually consistent data from secondary replicas. When the primary

replica crashes, a new primary replica is elected from the ones available in the replica set.

Architecturally, MongoDB uses a Master-slave design.

Other types of NoSQL storage systems

Even though messaging systems aren’t capable of storing persistently data, if properly adapted, it can

become one. For instance, we have Kafka described below.

Kafka

Kafka [18] is a distributed messaging system for log processing capable of collecting and delivering high

volumes of data with low latency while guaranteeing fault-tolerance and scalability. It can also be used

as a storage system.

Architecturally, Kafka uses a publish-subscribe design. It has three major components: producers,

brokers and consumers. Producers are responsible to publish topics into the brokers, and the consumers

are responsible for retrieving the topics from the brokers by subscribing for those particular topics. Since

this is a distributed system, a topic is divided into multiple partitions and each broker stores one or more

of those partitions. Each partition is also replicated across a configurable number of servers (brokers)

for fault tolerance.

Any message queue that allows publishing messages decoupled from consuming them is effectively

acting as a storage system. This is the main reason why we can also use Kafka not only as a distributed

messaging system but also as a storage service. Data written in this system is written to disk and

replicated for fault-tolerance. In addition, Kafka allows producers to wait for acknowledgements of their

writes, allowing them to consider a write to be complete only when it has already been fully replicated

and it is guaranteed that, no matter what happens, that particular write will be persistent. Regarding

reads, you may imagine that since this is a messaging system, consumers aren’t capable of choosing

what particular messages to read. However, with Kafka, consumers are capable of controlling their read

position.

Summary of comparison to previous systems

Focus

Like the systems covered above, we focus on providing performance and availability guarantees when

dealing with large quantities of data. Consistency is also addressed and, in some storage systems, the

customer can choose the level of consistency.

10

Design

To ensure performance and availability, these systems differ whether in terms of the replication architec-

ture, for example, a peer to peer design or a master slave one, and in terms of data model, for instance,

key-value stores and wide-columns stores.

Each one of them uses different approaches. For example, if consistency and accuracy is not a pri-

mary concern for the customer, using an approximate query engine can easily improve the performance

of the system by trading accuracy for performance. However, most of the time, customers care about

consistency. So, approaches like key-value stores, wide-column stores or document stores are more

suitable for them.

To better understand the differences and similarities between the different storage systems described

above, Table 2.1 provides an overview of the comparison between them.

System Main goals ArchitectureBlinkDB Performance UnknownRiak KV Availability and Scalability Peer to PeerRedis Availability and Performance Master-Slave

Bigtable Scalability Master-SlaveCassandra Scalability Peer to PeerMongoDB Availability, Scalability and Performance Master-Slave

Kafka Scalability, Availability and Performance Publisher-Subscribe

Table 2.1: Storage systems comparison

In contrast to these storage systems, where the same system is replicated to achieve availability and

performance guarantees when reading or writing data, our proposed solution will separate these two

models, e.g., by using different storage systems to handle writes and reads. This approach allow us

to select the most suited storage system for each request type. This approach is better described in

Chapter 3.

2.2 Combining different storage systems

In this section we will cover several systems where the idea is to combine several systems with different

characteristics in order to obtain a single global storage system with better characteristics than if we

were using each one of those systems individually. These storage systems are mainly cloud storage

systems, which are notorious for achieving higher availability and performance rates. Even though

combining different storage systems to achieve an enhanced global storage system is the main idea of

these proposals, these systems are also concerned with complying with the customer’s service level

agreements, for instance, performance, cost, throughput, consistency or availability.

CosTLO

Cost-effective Tail Latency Optimizer [19] is a system that aims to reduce the high latency variance asso-

ciated with cloud storage services while still supporting the demands, SLOs (Service Level Objectives),

11

of the clients.

When an end user needs to fetch or send data to the cloud, the requests responsible to perform

this operation are always associated with latency, which determines how fast the operation is executed.

In order to reduce the latency variance linked to cloud services, we need to identify the main factors

that lead to the increase of the response time of every request. The following factors are the ones that

CosTLO identified as responsible for latency variance:

• Internet latencies

• Latency spikes in a data center’s network

• Latency spikes within the storage service

Since tail latency samples extracted from any of the previous factors are mostly caused by isolated

latency spikes, they opted to implement redundancy, which is a well-known approach for reducing vari-

ance [20] [21]. By augmenting each GET and PUT request with a set of concurrent redundant requests,

they increase the probability of receiving one response faster than the other. The redundancy approach

of CosTLO can be exploited in two ways. It can implicitly take advantage of load balancing in the in-

ternet, due to the fact that a request can take different paths to arrive to the same data center, or it

can explicitly issue concurrent requests to a set of requests that are the same but stored in different

locations. Depending on the latency factor, their approach changes between the two.

The following techniques explain how to reduce latency caused by the factors previously identified.

To tackle spikes in internet latencies CosTLO issues multiple requests to the client’s closest data center.

If the client desires to reduce even more the tail latency, than CosTLO concurrently issues requests

to the two closest data centers to the client which, interestingly, can be from different cloud providers.

Even though datacenter’s inside the same cloud provider are spread away in the globe to achieve more

coverage, datacenter’s of different cloud providers are probably closer to each other in the globe, thus

leading to reduced latency. For latency spikes in a data center’s network, this system issues multiple

requests to the storage service in that data center. For the last factor, latency spikes within the storage

service, CosTLO issues multiple requests either to the same object or to different objects that are copies

of the object that the client wants to access.

However, while redundant requests are able to considerably reduce the latency variance, it also

doubles the monetary cost for GET operations and network bandwidth. This tradeoff is studied by

CosTLO in order to find the best suitable configuration that fits the client’s demands. This is due to the

fact that, even though the main goal of this system is to reduce the latency variance on cloud storage

services, they also have to take into account the users demands and how to make it cost-effective.

Architecturally speaking, CosTLO has a component called ConfSelector that is responsible to serve

PUTs and GETs between the client and the storage services available in the cloud. ConfSelector divides

time into epochs, and at the start of every epoch it links a configuration separately for each IP prefix. The

advantage of using epochs is that the system can now detect, at the start of each epoch, if the client’s

SLOs are being ensured. This configuration is identified as follows. ConfSelector starts by setting each

12

IP prefix to issue only a single request to the data center closest to the client. Finally, ConfSelector

maintains a pool of candidate configurations, that is gradually iterated in the increasing order of cost,

until the minimum cost configuration capable of meeting the client’s demands is found.

The techniques used by CosTLO to minimize cost to the client while keeping the latency variance

low are also expressed by the timeout period imposed before starting to issue redundant requests.

Concurrent redundant requests are only issued if, during a predefined time, a request issued by a client

has no response.

RACS

Redundant Array of Cloud Storage [4] is a proxy that focus on allowing customers to avoid vendor lock-in,

reduce cost when switching between cloud providers, and better tolerate provider outages or failures.

Cloud storage providers distinguish themselves by allowing data to be accessed all over the globe

as well as to integrate this service, storage, with other cloud computing services, for instance, Amazon’s

EC2 with S3. However, storage is the most wanted service of them all, thus leading to very competitive

pricing schemes among cloud providers. They also compete in other aspects, such as availability and

uptime guarantees, ensured by systems capable of handling load balancing, recurring data backup and

redundancy, to more efficiently minimize cost of failure.

Having in mind that clients are concerned about cost and availability, this system identified two types

of failures: outages and economic failures. Outages are described as a series of improbable events that

leads to data loss in the cloud, and economic failures as the constant change in the pricing schemes of

cloud providers. The last being the main concern of RACS.

To address these failures and avoid vendor lock-in, hence minimizing the cost of failure, RACS pro-

poses to redundantly distribute its data through different cloud providers. This strategy guarantees mar-

ket mobility but at a very high storage and bandwidth cost. Since cost and availability is the main goal

of the system, they opted to change its approach by striping data across multiple providers. They argue

that this technique added agility to the clients when responding to provider changes or to new providers

that emerge in the market with more attractive pricing schemes. This redundancy was ensured by the

use of erasure coding. Erasure coding allowed RACS to avoid strict replication because it only requires

a subset of the data to compute its original state.

In the end, RACS managed to tolerate outages, tolerate data loss, adapt to price changes and even

to implement data recovery. Nonetheless, while it was able to mitigate vendor lock-in, it incurred in

higher overhead costs in storage, throughput, bandwidth and latency. This tradeoff can be adjusted by

the consumer when setting up RACS.

Architecturally, RACS decided to mimic the interface of Amazon S3 by using the same data model. It

stores data in buckets, where each bucket contains keys associated to objects. This option was mostly

driven by S3’s popularity, hence augmenting RACS compatibility to existing client applications. In terms

of design, RACS is a proxy that links clients to a set of n repositories, which are cloud storage locations.

Basically, it is responsible to split the object into m data shares of equal size, when a user issues a PUT,

and to recompute the data coming from those n repositories, when a client issues a GET. Since all data

13

must pass through a RACS proxy, it can become a bottleneck. To solve this, this systems proposes a

distributed architecture where multiple proxies are synced using ZooKeeper [22].

SPANStore

Storage Provider Aggregating Networked Store [23] is a system that focus on minimizing cloud storage

costs while respecting the application’s SLOs, such as data consistency, fault tolerance and latency.

This system aims to overcome different challenges in order to better satisfy the client’s latency, con-

sistency and fault tolerance requirements. Firstly, the system needs to figure out how to handle the

interdependencies between these requirements. For example, to minimize cost for an application that

desires strongly consistent data, the best option is to store all of the application’s data in the cheapest

storage service. However, this may violate the latency requirements. Secondly, they have to take into

account the workload of the application. Even though different applications have the same latency, fault

tolerance and consistency demands, their optimal configurations may differ depending whether the ap-

plication is dominated by PUTs or GETs. Finally, they need to consider the multi-dimensional pricing of

cloud storage services due to the several metrics used, for instance, the amount of stored data or the

number of PUTs and GETs issued.

Like the previous systems that we analyzed, SPANStore seeks to provide low latency access to

storage. The best way to achieve this is to replicate every object through each data center. However,

while replicating all objects to all data centers may ensure low latency access, this approach is costly

and sometimes inefficient. For instance, some objects may be requested more times in some regions

than others. SPANStore decided to leverage multiple cloud providers, since different providers have

data centers in different locations throughout the globe, which allows clients and other data centers

to communicate with the closest data center possible thus lowering the associated latency of users

requests. While the primary goal of this approach is to reduce latency, it also helps to minimize cost. It

exploits the discrepancies in price schemes of cloud providers for the same services. For instance, if an

application is more PUT demanding, then SPANStore opts to use the data centers of the cloud provider

that has the less expensive PUTs, to serve the PUTs of that particular application. Thus, by judiciously

combining resources from multiple providers, SPANStore is able to use a set of inexpensive operations

to reduce costs.

To prevent the system from failing to meet the clients SLOs, due to, for instance, changes in the price

schemes of providers, they keep track of the application’s workload and latencies. If they notice that the

system isn’t meeting the client’s demands, they modify the system behavior towards that goal.

Architecturally, SPANStore is deployed at every data center in which the application is deployed. The

application issues PUT and GET requests for objects to a SPANStore library that this application is linked

to. This library then serves these requests by looking up in-memory metadata stored in the SPANStore

VMs of that data center, and thereafter issuing the requests to the respective storage services. The

manner in which SPANStore VMs should serve these requests for any particular object is defined by a

central component called PlacementManager. This component divides time into fixed-duration epochs,

as described previously. At the start of every epoch, all SPANStore VMs transmit to this component

14

a summary of the application’s workload and latencies measured in the previous epoch. With this

information, it computes the optimal replication policies to be used for the next epoch, in order to keep

meeting the client’s SLOs.

HyRD

HyRD [5] is a system that focuses on improving cloud storage availability, avoiding vendor lock-in, re-

ducing the user access latency and improving the cost-efficiency to avoid violating the service level

agreements.

The reliability of data stored in the cloud is threatened by outages that cloud storage services face,

and this becomes an even more important problem when using a single cloud storage service to store

our data. These outages may lead to data loss in the cloud or unavailability of the service to the users.

Even in the case when strict Service Level Agreements (SLAs) are set between the cloud provider

and the user, service failures and outages are almost unavoidable and may jeopardize such targets.

For instance, a series of high-profile cloud outages in the year of 2014 [24] happened in major cloud

providers, such as Amazon, Microsoft and Google, from a 5-minute failure that cost half a million dollars

to a week-long disruption that cost an immeasurable amount of brand damage. These are only some

examples of problems that were faced when storing data in the cloud. I previously addressed with more

detail this subject when describing SPANStore.

To solve this problem, HyRD proposes, similarly to other systems we surveyed, to redundantly dis-

tribute the data across multiple providers by means of data redundancy schemes, such as replication

and erasure codes. With this solution, users are able to avoid vendor lock-in, since the cost of switching

is now lower than before, while also protecting against outages of a single cloud provider. However,

deciding the best redundancy scheme to achieve this is not straightforward. On the one hand, while

replication increases availability and durability, it introduces a set of challenges. For example, to achieve

high availability for large systems, it is necessary to increase the number of replicas, which introduces

extra bandwidth and storage overhead to the system. This is more problematic when dealing with files of

large proportions. However, for smaller files, it is still the best approach to provide the best performance

with small bandwidth and storage overhead. On the other hand, erasure code provides redundancy with

less space overhead. As explained in RACS, it divides an object into m fragments and recode them into

a larger n fragments such that the original fragments can be recovered from a subset of the n fragments.

This implies that when storing a file it will require extra time to record the redundancy information, es-

pecially when dealing with small files. Another problem is the large amount network traffic required to

reconstruct data. In summary, replication-based schemes perform better when storing small files and

file metadata, and erasure-code-based schemes perform better and are more cost-efficient when deal-

ing with larger files. We also must take into consideration that the size of the files and metadata may

vary. Previous studies on the workload characteristics have shown that file metadata accesses are more

frequent than file accesses [25] [26], thus the performance of file metadata accesses is critical to the

overall system performance. HyRD, similarly to the systems that have been previously analyzed, no-

ticed that different cloud providers offer different price schemes as well as different access latency times.

15

Based on this analysis, HyRD decides to use the redundancy scheme that best fits each file. Large files

are redundantly stored in cost-efficient cloud storage providers using erasure-code-based schemes,

while small files are stored in multiple high-performance cloud storage providers using replication-based

schemes. This way, this system is able to mitigate the disadvantages of both schemes.

Architecturally, HyRD is divided in three main modules:

1. Workload Monitor

2. Cost and Performance Evaluator

3. Request Dispatcher

The first module is responsible for classifying the incoming data as file metadata, small file or large

file. This classification is based only on the access latency of the data. The second module evaluates

the cloud storage services based on its performance and price schemes. The last module uses the

classification, done by the workload monitor, to decide which redundancy scheme is more appropriate

to that data. This decision is supported by an evaluation, done by the cost and performance evaluator.

Finally, the data is then distributed to the corresponding cloud storage providers.

HyRD decided to interact with the cloud providers using their standard interfaces, which facilitates

the inclusion of any cloud storage provider.

When dealing with service outages, small files and file metadata are fetched from the replicated

providers, while large files are reconstructed using the erasure-coding scheme.

Conductor

Conductor [27] is a system that aims to deploy and execute MapReduce computations in the cloud

by choosing the best cloud providers/services that enable meeting the requirements presented by the

clients (the computation to be executed in the cloud, set of cloud services that could be used, set of

goals to optimize the execution of that computation).

The motivation of this system is that, when using cloud services, customers face the challenge of

choosing the cloud provider that suits them best according to their performance or price needs. However,

there are many providers that offer similar services that only differ in price schemes and performance

characteristics. And even after picking a provider, sometimes there are different types of resources for

the same service. An example of this is Amazon’s EC2 service, where you can choose among a variety

of types of virtual machine instances. Pricing schemes of cloud services is another issue that customers

need to take into consideration. These schemes aren’t static, which adds another layer of complexity.

They may vary over time as providers adjust their pricing models, or when new providers emerge in

the market. Estimating the cost of alternative deployment strategies is also challenging because the

information available about the services used to deploy a system may not correspond to what is actually

observed. This may also happen due to problems in the cloud service, where work throughput may

drop due to congestion. Customers also need to take into account the cost of transferring data between

computation and storage locations. Last but not least, the customer also has to decide whether it prefers

16

to pay less but have lower reliability, lower replication factors, or pay more and grant more reliability

against faults. All of these challenges are problems that Conductor tries to solve.

Even though these problems are Conductor’s main challenges, transparency, efficiency, adaptivity

and flexibility are concerns that are also taken into consideration. Their system is able to leverage

different types of services without having to adapt their applications to the interface that is provided by

that service; to allow the customer to specify goals; to detect when the system is no longer meeting the

customer’s SLAs and react accordingly; and to easily incorporate new services.

To accomplish all of this, Conductor chooses the services that are able to meet the customer-defined

goals, (for instance, minimizing cost or execution time) and deploys computations on the cloud according

to that choice.

After receiving the customer’s requirements needed to plan and deploy jobs in the cloud (namely a

computation to be executed in the cloud, a set of cloud services that could be used for executing the

computation, and a set of goals to optimize the execution), Conductor starts its life cycle.

This system is divided in four phases:

1. Create the initial model that describes the computation and the costs implied with its execution, for

instance, how much it will cost or how much time does it need to conclude a MapReduce job;

2. Determine an optimal execution plan by using a solver and the information gathered in the previous

phase;

3. Deploy the planned execution;

4. Detect deviations and upon that detection compute a new plan that enables the system to keep

meeting the client’s requirements. When computing the new plan, Conductor also tries to take

advantage of price drops on the services available from the cloud providers;

In Figure 2.1, we summarize the overall system design. There we can better understand the objective

of each phase and the information shared between them.

Summary of comparison to previous systems

Focus

These systems focus on ensuring that customer service level agreements, for instance, consistency and

performance, are always guaranteed while trying to keep the systems as cost-effective as possible to

the customers. These customer requirements are also concerns applicable to our proposed solution,

described in Chapter 3.

Design

To accomplish their goals, the designers of these systems needed to understand what was the best

strategy to be applied taking into account the specifics of the systems and the end users’ goals.

17

Figure 2.1: Conductor’s system overview [27]

To address the problems that each system faced, different strategies had to be chosen. Although

these systems share the same concerns, each one of them prioritize these concerns differently. To this

end, they needed to understand what were the tradeoffs that could be easily justifiable. For instance,

to reduce latency, some systems decided to use cloud providers that guaranteed better performance.

When availability was their main target, they often opted by introducing redundancy, whether through

replication-based schemes or erasure-coding-based schemes. However, they had to explore the impli-

cations of this redundancy, which could very easily affect the performance requirements of the customer.

They also noticed that depending on the workload characteristics of the data being managed, different

replications schemes had to be chosen accordingly in order to improve their results. If monetary cost

was their main concern, they opted to manage multiple cloud providers, which allowed them to choose

the ones that provided the best service at the lowest cost.

To better understand the differences and similarities between the different systems described above,

Table 2.2 summarizes each system.

Similarly to these systems that combine different storage systems available in the cloud to come up

with a better global storage system, we will be focusing in CQRS [8] (Command Query Responsibility

Segregation), which is a pattern that allows us to differentiate the models used to store and retrieve data.

However, instead of leveraging multiple cloud providers and their respective offer of storage services,

our focus is on combining different storage systems with different characteristics, where all of them

are locally deployed by the same organization. This gives the system increased flexibility in terms,

e.g., of configuring each storage system, or transferring data across systems and reconfiguring system

parameters dynamically.

18

System Main goals StrategyCosTLO [19] Reduce the high latency variance associated

with cloud storage servicesIssue concurrent redundant GET/PUT re-quests

RACS [4] Avoid vendor lock-in and better tolerate out-ages

Redundantly distribute its data through dif-ferent cloud providers through erasure-code-based schemes

SPANStore[23]

Minimize cloud storage monetary costs with-out violating service level agreements

Combine resources from multiple providers

HyRD [5] Improve availability, avoid vendor lock-in andreduce user access latency

Redundantly distribute data across multi-ple providers accordingly, combining erasure-code-based schemes and replication-basedschemes

Conductor[27]

Deploy MapReduce computations in thecloud with optimal cost and performance

Predict the costs of the computation andchoose the best approach according to out-put of optimization tool

Table 2.2: Cloud storage systems comparison

2.3 Quality of service in storage systems

Over the years, designing and maintaining storage systems is a process that is becoming very difficult

and complex due to the amount of information that needs to be processed. Data availability as well as

performance is crucial for every system that relies on storage systems to handle their data. If the storage

system goes down, certainly the whole computer system that relies on that storage system will go down

as well. When designing this systems, the administrators have to properly map their logical units to the

available hardware, otherwise the quality of service is highly affected.

Typically, administrators configure storage manually using rules of thumb. However, taking into ac-

count that storage systems are complex and application workloads rather complicated, the resulting

systems are costly or poorly setup.

To address this problem, a few tools have been proposed, each one of them exploring different

techniques, to come up with the best design possible for each system.

For instance, Minerva [28] divides the problem of designing storage systems in three phases. The

first one is to choose the right set of storage devices for a particular application workload. The second

phase is choosing the values for configuration parameters in the devices. The last phase is to map the

user data to those devices. To design the system it uses heuristic search techniques, and analytical

device models to evaluate the result of the heuristic search.

Ergastulum [7] receives a workload description, a set of potential devices, optional user-specified

constraints, and an externally-specified goal to choose the appropriate type of device to use, the con-

figuration of each device and the mapping of application data onto those devices, to best achieve the

specified goal using a generalization of the best-fit bin-packing algorithm [29], described in [7]. After

designing the first configuration, it monitors the current design and performs modifications if necessary.

Rome [30] isn’t a tool itself but an information model, which is the base of several tools used to design

storage systems configurations. It is also capable of monitoring the result of the designed system and if

necessary, with the help of those tools, automatically adapt to meet the quality of service intended.

Other systems have opted to use other techniques, for example, using iterative algorithms like Hip-

19

podrome [6] to try and evolve the storage system design to a good storage system design.

In contrast to this class of system, we are not trying to control and improve the quality of service of

a storage system by modifying that system, but instead our goal is to introduce a layer that mediates

access to one or more storage systems and superimposes such quality of service guarantees on top of

existing systems.

Summary

This chapter introduced three types of systems capable of dealing with the ever increasing data usage.

The first type relates to large-scale storage systems which rely in replication to guarantee availability and

performance under such circumstances. The second type are systems that combine several systems

with different characteristics to achieve a better overall solution. These storage systems are mainly cloud

storage systems, which are notorious for achieving higher availability and performance rates. Lastly, we

presented systems that address this problem by instilling quality of service in storage systems. However,

taking into account that storage systems are complex and application workloads rather complicated, the

resulting systems are costly or poorly setup.

This study led us to propose a solution, described in the next chapter, that combines a few ideas

gathered from the state-of-the-art.

20

Chapter 3

Proposed Solution

Taking into consideration the analysis of the solutions presented in Section 2.1 and Section 2.2, our

proposed solution consists of combining storage systems with different characteristics, similarly to the

systems described in Section 2.2, to create an enhanced global storage system capable of storing

and retrieving data to the user better than if the same storage system was used. However, instead of

leveraging multiple cloud providers and storage systems available on the cloud, we will combine storage

systems that can be all managed by the same cloud provider or even locally deployed by the same

organization. To achieve our goal, we will explore a hybrid approach inspired by the software pattern

called Command Query Responsibility Segregation [8]. This pattern states that we should segregate

the models responsible for storing and retrieving data from the system. In other words, that requests

that fetch information from the system should be possibly dealt with a different storage system than the

requests that perform modifications on the stored data. This notion will hold true in most cases, except

when dealing with requests that requires strong consistency. In this case, we will have to revert to a

more costly protocol to retrieve information back to the user.

In the following sections, we describe in more detail this hybrid approach. We start in Section 3.1

by outlining the setting in which our solution is to be deployed, which gives a real world motivation and

context for the required solution. Then, in Section 3.2, we briefly explain the high level idea of the

solution, followed, in Section 3.3, by the architecture of our solution. Finally, in Section 3.4 and Section

3.5, we describe the data structures and algorithms employed.

3.1 Requirements: the Unbabel use case

Unbabel1 is a company that provides translation as a service by combining humans and AI. Their ambi-

tion is to break the language barrier through the delivery of quality translations to its clients.

Their process is quite simple. Clients upload content specifying the desired language pairs, from

which language to which language they want their content to be translated to, and then Unbabel returns

their content translated into the chosen languages.

1https://unbabel.com/

21

Internally, the uploaded content goes through a pipeline that, in a very high level, looks as follows:

1. The clients’ content is uploaded to the system

2. The content is translated using AI

3. The content is split into smaller fragments

4. The AI translations are edited by humans

5. The fragments are merged

6. The translated content is delivered to the client

This workflow is represented in Fig. 3.12.

Figure 3.1: Unbabel’s Internal Workflow

As we can see in the workflow depicted in Fig. 3.1, besides the clients, we also have editors,

which are responsible for editing the translated text of the machine translation step. Each one of them

interacts with the system differently, producing different kinds of workloads. Naturally, these workloads

in conjunction, will put the system under pressure, consequently degrading the user experience. Being

a company that aims to deliver high quality translations as fast as possible, response times and user

experience are highly praised and must be ensured.

To understand in more detail these requirements and how our solution operates, we now describe in

more detail a specific aspect of this workflow, which is implemented by a subsystem called Tarkin.2https://developers.unbabel.com/v2/docs

22

Tarkin is a task manager, which means that its responsibility is to store and manage tasks. Tasks are

portions of text, sent from their customers, to be translated.

At a high level, the main operations of Tarkin are:

• Assign tasks for translation to editors;

• Receive translated tasks from the editors;

Every time an editor issues a request to receive a task for translation, the task manager sends them

the best suited task according to their profile.

In order to avoid escalating the costs of translation, by assigning the same task to several editors,

is necessary to ensure that only one editor is capable of accessing and editing one particular task at a

time. However, it is possible that one task, that has already been processed, returns to the task queue

to be edited once again.

The problem comes with the increase of the number of requests, performed by editors, and the write

requests (insertions of new tasks in the system), performed by customers. It not only makes the system

extremely concurrent, which may violate the condition imposed, where only one editor can access a

task at a time, but it also degrades significantly the response time of the system, since the system

needs to deal with both read and write workloads simultaneously. Another factor that contributes to the

degradation of the response time is the need to calculate the number of available tasks for an editor that

is requesting for new tasks.

In more detail, our system will process four kinds of requests, with different requirements:

1. Get the number of available tasks;

2. Get a task;

3. Submit a translated task;

4. Create a task;

All of these requests require good performance, in particular low latency so that the user does not

experience a slowdown when accessing the service. However, they differ in their needs of consistency.

While requests 1 and 3 only require eventual consistency, requests, 2 and 4 demand strong consistency,

namely that reads return the most recently written value.

The whole process of this use case is illustrated in Fig.3.2.

3.2 Adapting the CQRS pattern

As previously introduced, in order to achieve this enhanced global storage system, we will be exploring

the Command Query Responsibility Segregation [8] pattern. At a high level, this pattern states that

you can differentiate the models that update and read information. A possible way to implement this is

that, depending on whether the request is a read-only request or an update, the system will access two

different storage systems.

23

Figure 3.2: Use case

However, in contrast to the original philosophy, the distinction will not only be made between read-only

and update requests, but instead we will leverage the different levels of consistency between different

classes of requests. This will enable us to take advantage of the benefits of this pattern and do so in a

more generic way that applies to a broad range of settings that wouldn’t be applicable otherwise. For

instance, two different classes of get requests with different consistency requirements can be segregated

in our case, and wouldn’t be originally.

Throughout the document, requests that require strong consistency are dealt by the command side

of the pattern (write layer) and read requests with weaker consistency demands are processed by the

query side of the pattern (read layer).

3.3 Architecture

Fig.3.3 portrays the components that constitute our solution as well as the interactions that are performed

between those elements. After the arrival of a request to the API layer, the best suited storage system is

selected according to the requirements of that particular request. The components and the overall flow

of our system are explained in more detail next, in Section 3.3.1 and Section 3.3.2 respectively.

3.3.1 Layers

API

The API is the entry point of our system. This component is responsible for processing requests. It also

decides which storage system is more appropriate to help processing the incoming request, taking into

account its characteristics. This routing is explained in detail in Section 3.5.

Read Layer

The Read Layer, or the query side of the CQRS [8] pattern, contains the database that is accessed every

time a request to retrieve data back to the user arrives. Since these requests are read-only, we have to

24

Figure 3.3: Architecture

take into consideration that the storage system that will be used has to perform well when dealing with

this type of operations. This layer processes requests that don’t require strong consistency.

Write Layer

The Write Layer, or the command side of the CQRS [8] pattern, contains the database responsible for

dealing with the requests that requires the insertion, modification or deletion of data from the system.

When selecting which storage system to use in this layer, we need to consider that these requests

are write intensive. This layer is also responsible for processing get requests with strong consistency

requirements.

25

Replication Layer

The Replication Layer is the pivotal point of our solution. Its main responsibility is to replicate the data

from the command side to the query side, as fast as possible, thus minimizing the amount of time that

the system could retrieve inconsistent data.

3.3.2 Data flow

In order to achieve this enhanced global storage system, the components, presented in the previous

section, need to work together to successfully process requests. To better understand how the data,

depicted on Fig.3.3, flows between the components, we summarize the interactions that occur during

the life-cycle of a request:

Order From To Operation1 Outside World API Layer Sends a GET, POST, PUT or DELETE request2 API Layer Read Layer Sends IDs3 API Layer Write Layer Sends data4 Write Layer Replication Layer Sends data5 Replication Layer Read Layer Replicates the data introduced in flow 46 Read Layer API Layer Retrieves the value of flow 37 Write Layer API Layer Retrieves the result of flow 18 API Layer Outside World Retrieves the result of flow 1

Table 3.1: Data flow

3.4 Data structures

Taking into consideration that scalability and performance is a must and that data structures play an

important role in fulfilling these requirements, in the following sections, we describe the data structures

employed.

3.4.1 Databases

Since our system is an hybrid approach to the CQRS [8] pattern, we opted to pick two different databases

according to the workloads that they were going to handle. Hence, leveraging their strengths and routing

the workloads that best suits each one of the storage systems.

The storage system of the write layer will maintain all the information of the system, in our use case,

the editor’s profile, the tasks and the relation between editors and tasks. This information is needed

to aid the POST, PUT and DELETE requests of the system. Meanwhile, the database of the read layer,

only stores the number of available tasks of each editor, which is used to aid the GET requests related

to the available tasks count.

26

3.4.2 Replicator queue

The replication process is one of the most crucial moments of the overall flow of the system. With this in

mind, we had to be careful when designing the data structure that will aid this process. To avoid that it

could become a bottleneck, and that we could easily scale, when facing huge workloads, we decided to

create a queue that stores the information required to process each task (elements of the queue). For

each task, we store the following information:

– The name of the function that will process the replication

– The arguments of the function, which contains the data that is going to be stored in the query side

of the system

3.5 Algorithms

To provide an efficient solution, we rely on three main algorithms. The first algorithm operates similarly

as a load balancer according to the characteristics of the request. The second ensures that, as soon as

modifications on the stored data are detected, they are sent to the replication layer through a publish-

subscribe system. At last, the third algorithm guarantees that the modifications on the system are

propagated to the read layer.

3.5.1 Routing

The goal of the routing algorithm is to select which of the two storage systems, available in the read or

write layer, can aid processing the request. Basically, if the request demands to insert, update or delete

information from the system, then the API will access the write layer, otherwise, if the request needs to

fetch information, then the read layer will be accessed. However, if a get request requires consistent

data, that particular request will be routed to the write layer.

3.5.2 Replication

Every time that the system receives requests that perform modifications on the stored data, these mod-

ifications have to be replicated from the write layer to the read layer in order to keep the system as

consistent as possible. For this purpose, we have two algorithms:

• The first algorithm relies in a publish-subscribe system, which notifies the replication layer as soon

as modifications on the data, stored in the write layer, are detected;

• The second, the replication layer makes use of a scheduler that pulls information from the write

layer in order to keep the read layer consistent. This operation happens every x minutes, where x

is configured a priori;

27

Summary

This chapter presented the idea of creating an enhanced global storage system through the usage of

different storage systems to serve requests according to its characteristics. This idea was inspired by

the CQRS [8] pattern. However, in contrast to this ideology, we differentiate requests not only according

to its type but also its consistency levels. The presented architecture adds the benefit of being able

to independently scale each layer according to the system’s needs. In the next chapter we continue

describing the solution by presenting the implementation details.

28

Chapter 4

Implementation

In Chapter 3, we described our solution at a higher level. In this chapter, we will take a closer look at the

particularities of each component of the system. We start, in Section 4.1, by explaining the interaction

between components, followed by the nuances of each component in Section 4.2. We end the chapter

with the description of the software architecture in Section 4.3.

4.1 System interoperability

As previously documented, our architecture is composed by four layers. In order to ensure that these

layers are able to work as desired, it is necessary to transfer information from layer to layer. We will

describe the interactions between those layers including the information that they share.

Starting with the API and the outside world, these two elements communicate through Representa-

tional State Transfer [31] (REST) endpoints, listed In Fig.4.1.

Every time new tasks arrive, they do it through the /api/v1/task endpoint, which is an HTTP POST

request, that receives, for instance, the language pair of the task. The language pair is used to track the

original language and the desired language of the task’s output.

The /api/v1/editor/{editor id}/available tasks count endpoint is an HTTP GET request that, given

an editor id, returns the number of tasks that are available to be translated by that particular editor.

In order to retrieve a task to translation from the system, the /api/v1/task/search endpoint is used.

This endpoint is an HTTP GET request that contains the editor, the language pair and the type of the

Figure 4.1: API endpoints

29

task as the body of the request.

Finally, the last endpoint, /api/v1/task/submit is an HTTP POST request that succeeds a /api/v1/task/search

request, which returns to the system the translation of the task previously fetched. This request has in

its body the id of the task, the id of the editor that performed the translation and the translated data.

Regarding the interaction between API layer and the write layer, the communication is achieved

through calls to the database that composes the write layer. When the API receives a request to store

information on the write layer, the data of that particular request is forwarded to the write layer. As

explained in Section 3.5.1, get requests with tight consistency requirements are also addressed by the

write layer, which is the case of the /api/v1/task/search endpoint.

The communication between the API layer and the read layer is effectuated in the same manner as

the previous one. Every time a get request that doesn’t require strong consistency arrives, for instance,

when an editor wants to know how many tasks are available for him, the API layer sends the id of the

editor to the read layer and the read layer returns the number of available tasks.

To finalize, we need to address how the replication layer interacts with both the write and read layers.

To ensure that the data of our solution is as consistent as possible, the write layer needs to replicate the

new data to the read layer, which is the purpose of the replication layer. This process is accomplished

through the usage of a publish-subscribe system and a pull mechanism, described in Section 4.2.2 and

Section 4.2.4. These interactions are performed by accessing both layer’s databases directly.

4.2 Components

In this section we give more insight on the decision making and implementation of the elements that

constitute our solution.

4.2.1 API Layer

As described in the previous section, the API layer receives requests through REST endpoints. These

requests are processed with the help of uWSGI1, a web server gateway server interface, and Flask2,

which is a microframework for Python with RESTful request dispatching capabilities.

In this layer there are four endpoints. Three of them already existed in the previous version of Tarkin,

the /api/v1/search, /api/v1/submit and /api/v1/task, while the fourth is an extension made by us.

Fig.4.2 represents the fourth endpoint, which is the class responsible for processing the avail-

able tasks count endpoint described in the previous chapter. Since this is a request that can coop

with eventual consistent results, the API requests to the Redis database, integrated in the read layer, for

the number of available tasks for the given editor. The endpoint receives as argument the name of the

editor and retrieves its corresponding value, as you can see in line 79.

The remaining endpoints are processed as follows:

1https://uwsgi-docs.readthedocs.io/en/latest/2http://flask.pocoo.org/

30

Figure 4.2: Available tasks count endpoint class

• /api/v1/search executes a stored procedure, in the write layer database, that makes sure that one

task is only assign to an editor at a given time, with the help of locks and sql transactions;

• /api/v1/submit inserts into the write layer database a translated task;

• /api/v1/task inserts into the write layer database new tasks to be translated;

4.2.2 Write Layer

In the write layer we opted to use PostgreSQL [32] mainly because we are extending Tarkin, which is built

on top of PostgreSQL, and also to take advantage of a couple of features, that will aid the transformation

and replication process, described in Section 4.2.4, such as triggers and notifiers.

As previously seen, every time a new request with storage intents enters the system, the API layer

accesses this layer’s database and performs the duly operation. After a modification is made, we need

to replicate the new data to the read side in order to keep the data as consistent as possible. This

is accomplished through the usage of two different processes: a publish-subscribe system and a pull

mechanism, thoroughly explained in Section 4.2.4.

The publish-subscribe system is constructed on top of the two stored procedures depicted in Fig.4.3

and Fig.4.4. Starting with Fig. 4.3, in line 299, we create a trigger that will execute the stored procedure

notify trigger every time an INSERT, UPDATE or DELETE operation is detect on the table editor task.

This table contains the association of tasks to editors, thus affecting the result of the number of available

tasks per editor. Hence the importance of sending this data as soon as possible to the replication

layer. In line 284, the stored procedure iterates through the result of the get available tasks count

function, described below, for the editor associated with the operation that modified the editor task

table. Finally, in line 289, The flow of the trigger ends with the propagation of each iteration’s result

through the publish-subscribe channel new editor task trigger. The results are later processed by the

replicator, which finishes the replication by transforming the received data and forwarding it to the read

layer.

Fig.4.4 contains the code of the get available tasks count stored procedure accountable for calcu-

lating the number of available tasks per language pair and task type for the given editor according to its

31

Figure 4.3: Stored procedure: notify trigger

profile.

These interactions are performed with the help of SQLAlchemy3, which is a object-relational mapper

adapted to Python.

4.2.3 Read Layer

To store the information necessary to provide the results when GET requests arrive to the system, we

opted to use Redis [13] due to its high performant characteristics when dealing with read requests. As

an example, the data accessed by the available tasks count endpoint is stored in the following format:

data = {'map_reduce_paid':

{'en_es': 14},

{'es_en': 0}

}

For each task type, it provides the number of available tasks per language pair of a given editor.

4.2.4 Replication Layer

The replication layer is the piece of the puzzle that ensures that the system is capable of retrieving data,

to the outside world, as consistent as possible, when facing get requests with more relaxed requirements.

Its main responsibility is to transform the data provided by the write layer into a more desired format that3https://www.sqlalchemy.org/

32

Figure 4.4: Stored procedure: get available tasks count

33

will be then replicated to the read layer. To this purpose, it makes use of two processes, as seen in

Section 4.2.2.

The publish-subscribe flow, that began in the write layer, ends with the replicator listening to the

new editor task trigger channel, in line 21 of Fig.4.5, and transforming and forwarding the data to the

read layer.

The second process used is a pull mechanism. As Fig.4.7 portraits, every x minutes, where x is

defined in the configuration of the system, the replicator fetches from the write layer the number of

available tasks for each editor, with the help of the function illustrated in Fig.4.8.

In both processes, the transformation and forwarding of the data is achieved using the function

depicted in Fig.4.6. In lines 36 to 51, we transform the received data, number of available tasks per

language pair and task type of one editor, from the following format:

data = [{'editor_id': 44, 'language_pair': 'en_es', 'task_type': 'map_reduce_paid',

'devices': None, 'editor_name': 'AS', 'count': 14},↪→

{'editor_id': 44, 'language_pair': 'es_en', 'task_type': 'map_reduce_paid', 'devices':

None, 'editor_name': 'AS', 'count': 0}]↪→

to:

data = {'map_reduce_paid':

{'en_es': 14},

{'es_en': 0}

}

After the transformation, in line 53, the data is then forwarded and stored in the read layer.

The functions that have impact in the transformation and replication of the data, with the exception

of the function pg listen (represented in Fig.4.5), are asynchronous tasks that are being processed with

the help of Celery4, which is a distributed task queue. Allowing the system to more easily scale.

Figure 4.5: Function: pg listen

4http://www.celeryproject.org/

34

Figure 4.6: Function: replicate available tasks count

Figure 4.7: Function: setup periodic tasks

Figure 4.8: Function: get editor id

35

4.3 Software Architecture

The software architecture of our solution is illustrated in the figures below. Fig.4.9 represents the schema

of the data stored in the read layer. Fig.4.105 refers to the schema of the data stored in the write layer,

which is also the data structured in use today at Unbabel.

Figure 4.9: Redis database schema

Figure 4.10: PostgreSQL database schema

5Unbabel’s Tarkin ER diagram

36

Summary

Throughout this chapter we described the implementation details of our architecture. We started by

describing the inter-communication that occurs between components, followed by the nuances of each

component. We finished the chapter by presenting an overall description of the software architecture

present on our solution.

The next chapter presents the evaluation results of our solution against the current version of Tarkin.

37

Chapter 5

Evaluation

In this chapter we present the results of the tests that were executed to evaluate the proposed solution

against the Tarkin system that is currently being used at Unbabel. We start by, in Section 5.1, specifying

which tests were run, followed by the results in Section 5.2.

5.1 Setup

The evaluation was conducted locally, using Docker1 as the host of the components and JMeter2 as

the tool that executes the load tests. We opted to use JMeter due to its widely adoption and ability to

simulate real-user behaviors for testing applications against heavy load, multiple and concurrent user

traffic.

In order to have the system up and running, docker will launch the following containers: PostgreSQL,

Redis, Tarkin, RabbitMQ3, Scheduler, Replicator, Replicator Worker and Tarkin Worker.

Each container belongs to one of the four layers of our solution’s architecture. Below we map each

container to the respective layer:

• PostgreSQL→Write Layer

• Redis→ Read Layer

• Tarkin→ API Layer

• RabbitMQ→ Replication Layer

• Scheduler→ API Layer

• Replicator→ Replication Layer

• Replicator Worker→ Replication Layer

• Tarkin Worker→ API Layer1https://www.docker.com/2https://jmeter.apache.org/3https://www.rabbitmq.com/

39

The PostgreSQL, Scheduler and the Tarkin Worker are containers that belong to the original Tarkin

and are used by our solution as well.

The containers are powered by an Intel i7-6660U CPU @2,4GHz, 16 GB 1867MHz LPDDR3 and

256GB SSD, running under macOS High Sierra version 10.13.6.

Our goal was to compare both solutions in the most realistic environment possible. Hence, we

populated the PostgreSQL database using Unbabel’s staging environment database dump.

5.1.1 Workload

Tarkin was chosen as our use case not only because it’s one of the most important subsystems of

Unbabel, due to its responsibility of linking machine and human translation, but also because it’s con-

stantly under high load pressure, which makes a good use case for our solution. Having this in mind,

we thought that the best way to test Tarkin was to simulate high load pressure by constantly sending

concurrent requests through JMeter.

5.1.2 Metrics

To evaluate our solution, we decided to use the following metrics:

• Read Layer Latency

• Write Layer Latency

• Throughput

These metrics were selected taking into consideration the characteristics of the use case. Since

Tarkin is a system under constant heavy load, whether from the assignment of tasks to editors or from

the constant editor requests regarding their status, we found it would be good to compare the trace of

both systems when dealing with this kind of requests. In addition, we would like to showcase the benefits

of having requests being dealt with storage systems that suits them better.

5.1.3 Description

Each one of the tests was performed during one hour. Since our goal was to test our solution under

intensive pressure, each one of the tests were carried out using six concurrent users.

For the read layer latency test, we executed two tests. The first is meant to compare both systems

regarding the response time of read requests, taking into account that, with our solution, read requests

with weaker consistency requirements are now handled by a storage system which one of its strengths

is its read performance. The second test compares the benefits of segregating requests when dealing

with them simultaneously, which is the case of a real workload.

Regarding the write layer latency test, we only conducted one test. Since the requests that are

handled by the write layer of our solution are the same as the original system, they are only being routed

differently than the read requests that don’t require strong consistency, there would be no point testing

40

how the storage system would perform in terms of the write response time itself. Thus, we decided to

test this latency when the system is being pressured with both types of requests.

Finally, the throughput section will illustrate the throughput results of the three tests that we just

mentioned above.

5.2 Results

In the previous section we presented the nuances of each test. In this section, we analyze the results of

those tests.

5.2.1 Read Layer Latency

As discussed in Section 5.1.3, the first test refers to how both systems perform when processing get

requests. In order to test this, we will put the /api/v1/editor/{editor id}/available tasks count endpoint

under pressure.

Figure 5.1: Available tasks count response time over time

Fig. 5.1 depicts the results of this test. As we can see, our solution, which is represented by the end-

point named as /api/v1/editor/{editor id}/available tasks count (NEW), outperforms the old version

of this endpoint. While the old system averaged read response times of, approximately, 39 milliseconds,

our solution was able to keep this value at the 12 ms mark. Mostly, as a result of being capable of

differentiating requests and selecting the most appropriate storage system to process them.

41

Moreover, we also managed an improvement of the 90th, 95th and 99th percentile, by a factor of 3.4,

3.73 and 6.85 respectively.

Table 5.1 summarizes the results of this test.

Endpoint # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/available tasks count (NEW) 242790 11.95 3 13515 17.00 19.00 24.00/available tasks count (OLD) 197169 39.49 4 14974 58.00 71.00 165.00

Table 5.1: Available tasks count response times summary

The second test presents the results of processing the available tasks count endpoint while under

high load pressure from the remaining endpoints. The result of such test is illustrated in Fig. 5.2.

Figure 5.2: Available tasks count response time over time under high load pressure

Table 5.2 details the results of this test.

Endpoints # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/available tasks count (NEW) 10412 200.71 3 62053 644.70 890.00 1307.09/available tasks count (OLD) 9716 209.09 7 60620 658.30 909.00 1328.49

Table 5.2: Available tasks count under pressure summary

5.2.2 Write Layer Latency

As previously explained, since we are using the same endpoints from the original Tarkin for the write

layer, there would not be of any value to test each request individually. So, we performed a test to figure

out if our solution would improve the write layer when the system is under high load pressure from both

kinds of requests (the ones dealt by the read layer and the write layer).

42

Naturally, since we are running the tests locally, the results are not accurate and may be misleading

due to the fact that every container is using the same resources. Nevertheless, we thought it would be

interesting to see how both systems perform in the same environment.

We start with the /api/v1/task endpoint. Fig. 5.3 details the response times of this request.

Figure 5.3: Create tasks response time over time

In Fig. 5.4 is illustrated the response times over time of the /api/v1/task/search endpoint. In this

case, the results were very similar. However, in average, our solution managed to outperform the old

tarkin, as we can see in Table 5.3.

Regarding the last endpoint, /api/v1/task/submit, in Fig. 5.5, as we can clearly see, most of the

time, our solution is achieving better response times.

To finalize, in Table 5.3, we have a summary of the endpoints, comparing the old version of Tarkin

against our solution.

As we could see throughout this test, our solution is able to slightly improve response times during

load peaks, mainly by virtue of its architecture design. By routing requests through different storage

systems, we are able to offload the undergoing load pressure, which allows each storage system to

work with less pressure thus producing better results. However, we only managed to outperform the old

version of Tarkin, under high load pressure from all kinds of requests, by a very fine margin. Which we

think correlates to the fact that the tests were deployed and run in the same machine.

5.2.3 Throughput

In this section we compare the results of the throughput for each endpoint of both systems.

43

Figure 5.4: Get task response time over time

Figure 5.5: Submit task response time over time

44

Endpoints # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/task (NEW) 1956 1336.82 14 61475 68.00 103.00 59626.00/task (OLD) 2000 1318.98 14 60448 59.00 119.00 59553.97/task/search (NEW) 10406 1349.63 19 62669 2706.30 4920.90 10810.67/task/search (OLD) 9714 1488.47 18 60622 3095.00 5341.25 11292.10/task/submit (NEW) 14 677.07 199 1417 1399.50 1417.00 1417.00/task/submit (OLD) 13 722.92 200 1609 1589.00 1609.00 1609.00

Table 5.3: Write layer response times summary

We start with the available tasks count endpoint. In Fig. 5.6 is illustrated the read throughput of this

endpoint compared to the endpoint of the original Tarkin. As we can quickly see, there is a considerable

improvement in the amount of transactions per second that the new system is capable of processing. In

average, our solution can process, approximately, 67 transactions per second while the original Tarkin

can only deal with 55 transactions per second.

Figure 5.6: Available tasks count transactions per second

Table 5.4 summarizes the throughput results for this test (the first test of the read layer latency

section).

Endpoints # Samples Throughput (sec)/available tasks count (NEW) 242790 67.44/available tasks count (OLD) 197169 54.77

Table 5.4: Available tasks count throughput summary

To finalize, since the remaining tests, the second one of the read layer latency section and the one

45

performed in the write layer latency section, were conducted with the endpoints being under high load

pressure from each other, we decided to group the results in Table 5.5.

Endpoints # Samples Throughput (sec)/available tasks count (NEW) 10412 2.89/available tasks count (OLD) 9716 2.70/task (NEW) 1956 0.54/task (OLD) 2000 0.55/task/search (NEW) 10406 2.89/task/search (OLD) 9714 2.70/task/submit (NEW) 14 0.01/task/submit (OLD) 13 0.00

Table 5.5: Throughput summary under high load pressure

Summary

In order to perform a proper comparison between our solution and the old version of Tarkin, we con-

ducted these experiments using the same environment. The chapter starts by detailing the evaluation

details preceded by describing its results. Summing up, we were able to improve the overall read re-

sponse times while maintaining the same times for the remaining requests.

The next chapter concludes the document by summarizing our work.

46

Chapter 6

Conclusions

Although there are different solutions to deal with the rapid increase on the amount of data that compa-

nies face nowadays, sometimes these solutions are sub optimal for emerging companies which lack on

resources.

In this document, we started by analyzing the current solutions in order to identify and elaborate a

solution to our problem. Due to the lack of work regarding the CQRS pattern, we decided to explore

this idea. We used the knowledge gained from our research, for instance, which storage systems are

better for certain types of requests as well as the idea of combining different storage systems, and built

a solution inspired in this software pattern.

The results of the evaluation demonstrate that our solution is capable of processing and evolving

along with the data growth while even averaging better response times. Mostly due to the ability to

differentiate requests and optimize its result by selecting the best storage system according to its char-

acteristics. Furthermore, with this architecture, we are able to offload the system by having different

storage systems processing requests instead of pressuring only one.

There is also the opportunity to understand how our system could help the infrastructure achieving

more stable numbers in terms of CPU and memory utilization, however, these are not tests that we were

able to do, as explained in the previous chapter.

6.1 Future Work

This thesis has focused in improving the performance of Tarkin regarding the throughput and latency of

the different requests. However, due to the architecture of our solution, one of its limitations, which was

not addressed in Chapter 5, is the possibility of delivering stale data to the user. Therefore, it would be

interesting to evaluate the percentage of requests that deliver inconsistent data as well as the impact of

configuring different intervals to the pull mechanism, of the replication layer, on this percentage.

Unfortunately, as a result of testing our solution locally, we were also not able to evaluate the poten-

tially benefit of segregating the load between the different layers in terms of CPU and memory utilization

inherent in each component, leaving it as future work.

47

In addition, taking into account the distributed architecture of this thesis, it would be worth exploring

the following ideas:

• Exploit the ability of scaling each component independently according to the system’s workload;

• Exploit the benefits of using different storage systems on both read and write layers;

48

Bibliography

[1] Mongodb architecture guide. Technical report. URL https://www.mongodb.com/collateral/

mongodb-architecture-guide. Accessed: 2018-01-05.

[2] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS

Operating Systems Review, 44(2):35–40, 2010.

[3] A technical overview of riak kv enterprise. Technical report. URL http://basho.com/wp-content/

uploads/2015/04/RiakKV-Enterprise-Technical-Overview-6page.pdf. Accessed: 2018-01-

05.

[4] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon. Racs: a case for cloud storage diversity. In

Proceedings of the 1st ACM symposium on Cloud computing, pages 229–240. ACM, 2010.

[5] B. Mao, S. Wu, and H. Jiang. Exploiting workload characteristics and service diversity to improve

the availability of cloud storage systems. IEEE Transactions on Parallel and Distributed Systems,

27(7):2010–2021, 2016.

[6] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. C. Veitch. Hippodrome: Running

circles around storage administration. In FAST, volume 2, pages 175–188, 2002.

[7] E. Anderson, M. Kallahalla, S. Spence, R. Swaminathan, and Q. Wang. Ergastulum: Quickly finding

near-optimal storage system designs. 2001.

[8] Cqrs. Technical report. URL https://martinfowler.com/bliki/CQRS.html. Accessed: 2018-09-

29.

[9] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with

bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM

European Conference on Computer Systems, pages 29–42. ACM, 2013.

[10] S. Lohr. Sampling: design and analysis. Thomson, 2009.

[11] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy.

Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment,

2(2):1626–1629, 2009.

49

https://www.mongodb.com/collateral/mongodb-architecture-guide

https://www.mongodb.com/collateral/mongodb-architecture-guide

http://basho.com/wp-content/uploads/2015/04/RiakKV-Enterprise-Technical-Overview-6page.pdf

http://basho.com/wp-content/uploads/2015/04/RiakKV-Enterprise-Technical-Overview-6page.pdf

https://martinfowler.com/bliki/CQRS.html

[12] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing

and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663.

ACM, 1997.

[13] M. Paksula. Persisting objects in redis key-value database. University of Helsinki, Department of

Computer Science, 2010.

[14] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,

and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on

Computer Systems (TOCS), 26(2):4, 2008.

[15] M. Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of

the 7th symposium on Operating systems design and implementation, pages 335–350. USENIX

Association, 2006.

[16] L. Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):

133–169, 1998.

[17] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan.

Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions

on Networking (TON), 11(1):17–32, 2003.

[18] J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In

Proceedings of the NetDB, pages 1–7, 2011.

[19] Z. Wu, C. Yu, and H. V. Madhyastha. Costlo: Cost-effective redundancy for lower latency variance

on cloud storage services. In NSDI, pages 543–557, 2015.

[20] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013.

[21] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker. Low latency via

redundancy. In Proceedings of the ninth ACM conference on Emerging networking experiments

and technologies, pages 283–294. ACM, 2013.

[22] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-

scale systems. In USENIX annual technical conference, volume 8, page 9. Boston, MA, USA,

2010.

[23] Z. Wu, M. Butkiewicz, D. Perkins, E. Katz-Bassett, and H. V. Madhyastha. Spanstore: Cost-effective

geo-replicated storage spanning multiple cloud services. In Proceedings of the Twenty-Fourth ACM

Symposium on Operating Systems Principles, pages 292–308. ACM, 2013.

[24] The worst cloud outages of 2014. Technical report. URL https://www.cio.com/article/

2597775/cloud-computing/162288-The-worst-cloud-outages-of-2014-so-far.html. Ac-

cessed: 2018-01-05.

50

https://www.cio.com/article/2597775/cloud-computing/162288-The-worst-cloud-outages-of-2014-so-far.html

https://www.cio.com/article/2597775/cloud-computing/162288-The-worst-cloud-outages-of-2014-so-far.html

[25] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata.

ACM Transactions on Storage (TOS), 3(3):9, 2007.

[26] A. Traeger, E. Zadok, N. Joukov, and C. P. Wright. A nine year study of file system and storage

benchmarking. ACM Transactions on Storage (TOS), 4(2):5, 2008.

[27] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the deployment of computations

in the cloud with conductor. In NSDI, pages 367–381, 2012.

[28] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy, R. Golding, A. Merchant,

M. Spasojevic, A. Veitch, and J. Wilkes. Minerva: An automated resource provisioning tool for

large-scale storage systems. ACM Transactions on Computer Systems (TOCS), 19(4):483–518,

2001.

[29] C. Kenyon et al. Best-fit bin-packing with random order. In SODA, volume 96, pages 359–364,

1996.

[30] J. Wilkes. Traveling to rome: Qos specifications for automated storage system management. In

International Workshop on Quality of Service, pages 75–91. Springer, 2001.

[31] L. Richardson and S. Ruby. RESTful web services. ” O’Reilly Media, Inc.”, 2008.

[32] B. Momjian. PostgreSQL: introduction and concepts, volume 192. Addison-Wesley New York, 2001.

51

flexible large-scale data storage · a hybrid approach inspired by the software pattern called...

Documents