flexible large-scale data storage · a hybrid approach inspired by the software pattern called...
Post on 22-Jul-2020
2 Views
Preview:
TRANSCRIPT
Flexible Large-Scale Data Storage
António José Freitas Pinheiro
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisor(s): Prof. Rodrigo Seromenho Miragaia RodriguesDr. Marcelo Lebre
Examination Committee
Chairperson: Prof. Miguel Pupo CorreiaSupervisor: Prof. Rodrigo Seromenho Miragaia Rodrigues
Member of the Committee: Prof. João Coelho Garcia
November 2018
ii
Acknowledgments
First and foremost, I would like to convey a special appreciation to my thesis advisor Rodrigo Rodrigues,
for the dedication and endless support. My sincere gratitude also goes to my co-advisor Marcelo Lebre,
for all the enthusiasm, motivation and wise suggestions throughout the last year.
A very special gratitude goes out to all down at Unbabel, for the support and friendship. With a
special note to Eduardo Pereira and Emanuel Velho, for the insightful comments during this journey.
A special thanks to Sergio Mendes, Steven Silva, Rui Venancio and Paulo Figueira for the construc-
tive criticism on the manuscript of this dissertation.
Finally, I must express my profound gratitude to my family and friends, for their wise consul and
guidance. It was their support and encouragement that got me through these hard-working years.
iii
iv
Resumo
Com o aumento da quantidade de informacao processada, as empresas sao forcadas a aumentar as
suas infraestruturas de modo a continuarem a prestar um servico de qualidade aos seus clientes. Para
alem disso, diferentes classes de pedidos tem diferentes requisitos de performance e consistencia, o
que cria tensao entre a simplicidade de utilizar o mesmo sistema para lidar com todos os pedidos e a
capacidade de diferenciar como estes pedidos sao processados. No entanto, empresas que nao tem a
dimensao de grandes entidades, sentem dificuldades em garantir que a qualidade do servico prestado
nao seja afetada.
Neste trabalho, foi feita uma analise meticulosa em relacao aos diferentes tipos de solucoes capazes
de lidar com o aumento da quantidade de informacao, para projetar uma solucao tao capaz quanto
aquelas, mas a um custo reduzido. Apos esta analise, esta dissertacao propoe uma abordagem hıbrida,
inspirada no padrao de software Command Query Responsibility Segregation. No seu estado mais puro,
este padrao segrega os pedidos de acordo com o seu proposito, leitura ou escrita de dados. No entanto,
nos iremos mais longe e teremos em conta os diferentes de consistencia de cada pedido.
Para testar a nossa solucao, colaboramos com a Unbabel, uma startup que fornece traducao como
um servico, implementando-a num modulo especıfico do seu sistema. Os resultados desta avaliacao
demonstram que a nossa solucao e capaz de melhorar o tempo de resposta dos pedidos de leitura que
possuem requisitos de consistencia mais relaxados, atingindo, em media, um speedup de 3.30.
Palavras-Chave: CQRS, Armazenamento em larga escala, Disponibilidade, Consistencia
v
vi
Abstract
The ever increasing amount of data that companies have to process forces their infrastructure to grow to
continue serving the client’s demands. Furthermore, different classes of storage requests have different
requirements in terms of performance and consistency, and this leads to a tension between the simplicity
of using the same system for handling all requests, and the ability to differentiate how these requests are
handled. However, companies that lack the dimension of major Internet players find it difficult to address
this problem without reducing the quality of the service provided to the client.
In this work, we make a thorough analysis on the state-of-art regarding the different types of so-
lutions capable of dealing with request differentiation in large scale storage, to design a solution as
capable as those but at a severely reduced cost. After analyzing the state-of-art, this thesis proposes
a hybrid approach inspired by the software pattern called Command Query Responsibility Segregation.
At its core, this pattern segregates requests according to its functionality, whether it reads or updates
data. However, in our solution, we will go further and include the different levels of consistency in this
differentiation.
In order to test our solution, we collaborated with Unbabel, a startup that offers translation as a
service, by implementing it on a specific module of their system. This solution managed to improve the
read response time of requests with more relaxed consistency requirements by achieving a speedup of
3.30 while maintaining the same performance for the remaining classes of requests.
Keywords: CQRS, Large scale storage, Availability, Consistency
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Current Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Large-scale storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Combining different storage systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Quality of service in storage systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Proposed Solution 21
3.1 Requirements: the Unbabel use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Adapting the CQRS pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Replicator queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ix
4 Implementation 29
4.1 System interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 API Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Write Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Read Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 Replication Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Evaluation 39
5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Read Layer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2 Write Layer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusions 47
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Bibliography 49
x
List of Tables
2.1 Storage systems comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Cloud storage systems comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Available tasks count response times summary . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Available tasks count under pressure summary . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Write layer response times summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Available tasks count throughput summary . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Throughput summary under high load pressure . . . . . . . . . . . . . . . . . . . . . . . . 46
xi
xii
List of Figures
2.1 Conductor’s system overview [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Unbabel’s Internal Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 API endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Available tasks count endpoint class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Stored procedure: notify trigger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Stored procedure: get available tasks count . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Function: pg listen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Function: replicate available tasks count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Function: setup periodic tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 Function: get editor id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Redis database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.10 PostgreSQL database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Available tasks count response time over time . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Available tasks count response time over time under high load pressure . . . . . . . . . . 42
5.3 Create tasks response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Get task response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Submit task response time over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Available tasks count transactions per second . . . . . . . . . . . . . . . . . . . . . . . . . 45
xiii
xiv
Nomenclature
API Application Programming Interface
CQRS Command Query Responsibility Segregation
xv
xvi
Chapter 1
Introduction
Nowadays, startups that process huge amounts of data, but lack the tremendous scale and economic
power that companies like Google1 or Facebook2 have, find some difficulty in storing and processing
that amount of information in a way that the quality of service provided to the user is not affected.
Typically, these startups use open-source storage systems, for example MongoDB [1] or Cassandra
[2], which makes them capable of coping with the continuous growth of data through the usage of
sharding techniques, for instance. So far this methodology has worked relatively well. However, for
simplicity, the same storage system is being used for dealing with different data types as well as for
different kinds of requests.
Although MongoDB and Cassandra scale well in terms of the amount of data stored, they don’t have
a way of differentiating the data types and request types that the system has to deal with. Without this
differentiation the load pressure related to the read requests will affect the response time of the write
requests and vice-versa.
In some cases, such as Unbabel3, the read load is higher than the write load. Hence, the differen-
tiation of request type would allow them to tackle that problem by, for instance, scaling the side which
has the most load pressure instead of their entire infrastructure which naturally would reduce the cost
associated to this procedure.
We aim to bring together the best of both worlds by using open source storage systems, benefiting
from the work done by the open source community, and by introducing the concept of request and data
type differentiation where requests are segregated into two different flows according to its requirements
and type.
1.1 Current Solutions
Currently, there are open source scalable storage systems capable of dealing with the large amount
of data, that some enterprises have to deal with, such as MongoDB [1], Cassandra [2] or Riak KV [3].
1https://www.google.com/2https://www.facebook.com/3https://unbabel.com/
1
However, they do not offer data and request differentiation, which is useful to improve the quality of
service for different types of workloads.
On the other hand, there are systems that combine storage systems with different characteristics
to build a global storage system better than each one of them individually, for instance RACS [4] or
HyRD [5]. However, these systems have some objectives that differ from ours, for example, avoiding
vendor-lock in.
Lastly, there are storage systems that focus on quality of service requirements. However, such sys-
tems, like Hippodrome [6] or Ergastulum [7], are often old prototypes that lack the widespread adoption
or constant enhancement of a solution like MongoDB or Cassandra.
1.2 Challenges
Having the ability to deal with large amounts of data as well coping with sudden workload spikes, inherent
to almost every huge enterprise on the internet today, has became more difficult overtime virtue of the
pace at which information is growing.
A possible solution here would be to change the storage system itself to incorporate a quality of
service concept. This would be within the reach of big companies such as Google or Facebook, but for
a startup, which needs to focus on the rapid growth of its business, it does not make sense to develop
in-house large-scale storage.
Even adopting a storage system capable of coping with the workload demand, instead of developing
in-house storage, would be troublesome due to the amount of effort and time that refactoring the existing
code would consume. Consequently preventing the productivity and growth of startups, which is crucial
in its earlier stages.
Another important factor is the ability to scale over time. Albeit current solutions manage to effectively
deal with the increase of processed data, it is of great importance to be able to easily adapt the system
in case the problem emerges again in the future.
1.3 Objectives
The goal is to have the capacity to evolve along with the data growth without compromising the response
times of the system. Moreover, we should be able to employ this solution with the least amount of
refactoring possible.
The main idea behind the solution is to combine different storage systems with different character-
istics to create an enhanced global storage system capable of handling each request better than if we
were to use the same storage system. Having this in mind, we will be exploring the CQRS [8] pattern.
This pattern states that the write and read responsibilities are segregated and that should be individually
addressed by different models. However, we will not only segregate requests based on their purpose,
read-only or write, but also taken into account the different levels of consistency.
2
1.4 Thesis outline
We start this thesis by, in Chapter 2, presenting our research on the related work, followed by the pro-
posed solution in Chapter 3. Next, in Chapter 4, we describe how our solution is implemented. Chapter
5 demonstrates the results of the evaluation performed on our solution. We finalize the document, in
Chapter 6, by presenting our conclusions and addressing future work.
3
4
Chapter 2
Related Work
In this chapter we present our survey on the most relevant topics related to our work. We start, in
Section 2.1, by addressing systems that were built to store large quantities of data. Next, in Section 2.2,
we present a set of systems that combine different storage systems to build a global system that takes
advantage of the strengths of each storage system. We finalize, in Section 2.3, by addressing studies
and proposals for adding quality of service to storage systems.
2.1 Large-scale storage
We can broadly divide storage systems into two classes: SQL or NoSQL. NoSQL storage systems can
be further differentiated according to how they choose to store data. In particular, within the NoSQL
category, we can find key-value stores, wide-columns stores, document databases or graph databases.
In this section, we will focus on storage systems that are designed to store large quantities of data while
providing high performance and availability guarantees.
SQL Storage Systems
SQL storage systems are those that are based on the language supported by relational databases,
which is by large a variation of the Structured Query Language (SQL). This corresponds to a rigid,
structured way of storing data, like a phone book. Initially, relational databases are based on an im-
plementation for a single machine. However, there are several ways to allow relational databases to
be more available or store more data and still allow for good performance. Below, we describe recent
research proposals for scaling this approach.
BlinkDB
BlinkDB [9] is an approximate query engine for running interactive SQL queries on large volumes of
data. It aims to enhance the performance on large data by trading query accuracy for response time.
Storage systems that use approximation techniques heavily depend on the workload of the samples
used. We can divide the workload in four categories: Predictable Queries, Predictable Query Predicates,
5
Predictable Query Column Sets and Unpredictable Queries. Predictable queries assume that all future
queries are known in advance, Predictable Query Predicates assume that the frequency of group and
filter predicates do not change over time, Predictable Query Column Sets assume that the frequency
of the sets of columns used for grouping and filtering does not change over time and Unpredictable
Queries assume that queries are unpredictable. BlinkDB uses the Predictable Query Column Sets
model because they found these type of queries to be the most common ones on large scale analytical
clusters. These systems also depend on the type of sampling performed on that workload. In the case of
BlinkDB, they resort to stratified sampling [10] and uniform sampling, depending on the user constraints.
For example, uniform sampling works well with queries that do not filter or group subsets of the table,
but otherwise stratified sampling is the best choice.
Regarding the data model, BlinkDB is a SQL database that supports a constrained set of SQL-style
declarative queries. This storage system can provide approximate results for standard SQL aggregate
queries such as COUNT, AVG, SUM and QUANTILE, but it doesn’t support arbitrary joins and nested
SQL queries. Queries can be annotated with either an error bound or a time constraint. As an example,
consider querying a table Person, with three columns, Name, Genre, Email. The query:
SELECT COUNT(*)
FROM Person
WHERE Genre = ’male’
ERROR WITHIN 5% AT CONFIDENCE 95%
will return the results for the count operation having relative error of at most 5% at a 95% confidence
level. While the query:
SELECT COUNT(*)
FROM Person
WHERE Genre = ’male’
WITHIN 2 SECONDS
will return the most accurate results in 2 seconds.
Architecturally, BlinkDB extends the Apache Hive Framework [11] by adding two major components:
an offline sampling module, responsible for creating and maintaining the samples over time, and a
runtime sample selection module, responsible for creating an Error-Latency Profile that is later used to
select the most suitable sample depending on the type of query the user is requesting, either an error
bound one or a time constraint one.
NoSQL Key-Value Storage Systems
Key-value stores are storage systems that store data as an association between key and value. Each
item stored has an attribute name (or ”key”) coupled with its value. Below we describe a few storage
systems that uses this type of storage.
6
Riak KV
Riak KV [3] a NoSQL Key-Value database capable of running on commodity hardware while guarantee-
ing high availability and scalability. It was designed to survive network partitions and failures as well as
to scale seamlessly. It is capable of adding and removing capacity on demand without data sharding or
manually restructuring the cluster.
Architecturally, Riak uses a masterless, peer-to-peer design under a ring topology instead of a
master-slave design. Since all nodes are homogeneously capable of serving reads and writes, the
system has no points of failure. With this kind of design, the system provides high data availability under
many failure circumstances. To balance the load of the ring, this storage system uses consistent hashing
[12], which allows for simple, local decisions regarding where to store the data, and attempts to minimize
the number of keys that need to be remapped upon nodes joining or leaving the system.
Another important aspect of this system, which relates to all NoSQL systems, is the concept of
eventual consistency. This means that, when executing a write operation that updates the data, even
if only a subset of the nodes have confirmed to receive the data, then that transaction is considered to
be complete. By default, Riak replicas are eventually consistent, which makes data more available, but
sometimes inconsistent.
Regarding the data model, similarly to other key-value stores, data is stored as a collection of
key/value pairs. In Riak the key is a binary value and the value, associated to that key, any unstruc-
tured data or one of the following data types: Flags, Registers, Counters, Sets, Maps, HyperLogLog.
These data types, called Riak Distributed Data Types, were developed by Riak to help the developer and
described in [3].
Redis
Redis [13] is a NoSQL Key-Value database that aims to provide high availability and performance.
Regarding the data model, key-value databases are really easy to understand, since the approach
used is also known as hash addressing. Basically, a value is saved associated with a unique key.
Even though Redis does not support complex queries or indexing, it supports data structures as values.
These data structures in Redis are called Redis Datatypes, which includes strings, lists, sets, sorted
sets and hashes. As an example of this data model, ”HMSET Cat:2:attributes color black age 20” adds
two key-value pairs to the key Cat:2:attributes.
Architecturally, the whole dataset is kept in-memory and therefore it cannot exceed the amount of
physical RAM. Redis writes the entire dataset to disk at configurable intervals. When a replica restarts, it
just reads into memory the file containing the dataset. This system also uses a Master-Slave replication
design.
NoSQL Wide-Colum Storage Systems
Wide-Column storage system store data together as columns instead of rows and are optimized for
queries over large datasets. Below we describe a few systems that use this approach.
7
Bigtable
Bigtable [14] is a NoSQL distributed storage system designed to scale to a very large size: petabytes of
data.
Regarding the data model, Bigtable is a distributed persistent multi-dimensional sorted map. This
map is indexed by a row key, column key and a timestamp. Column keys are grouped into sets called
column families. All data stored in a column family is usually of the same type. A column key is named
using the following syntax: family:qualifier. In this storage system, each cell can store multiple versions
of the same data, which are indexed by timestamps. Timestamps can be represented as real time
microseconds or be explicitly defined by client applications. However, versioned data must be managed.
Bigtable supports two per-column-family settings used to garbage-collect cell versions automatically.
The client is able to specify either that only the last x versions of a cell be kept, or that only new versions
be kept, for instance, the values written in the last 3 days.
To be able to scale properly, this storage system introduced the concept of tablets. Tables are divided
into tablets, which contain a defined range of rows of that particular table. With this notion, tablets are
now able to be distributed accordingly through replicas making the system highly scalable. Initially, each
table consists of just one tablet. As a table grows, it is automatically split into multiple tablets.
Architecturally, this storage system uses a master-slave design with Chubby [15] as its lock service
and Paxos [16] as the algorithm responsible to keep replicas consistent when failures occur. The system
has three major components: a library linked into every client, one master server and many tablet
servers. Tablet servers are responsible for managing a set of tablets. Tablets then contain the information
that clients look for when issuing read and write requests. The master server, in turn, is responsible for
assigning tablets to tablets servers, balancing the tablet-server load and to perform garbage-collection.
Bigtable uses a three-level hierarchy to store the tablet location information. The first level is a file stored
in Chubby that contains the location of the root tablet. The root tablet has the location of all tablets using
a metadata table containing the location of the user tablets.
Cassandra
Cassandra [2] is a highly scalable NoSQL distributed storage system designed to run on cheap hardware
while being able to handle high write throughput without sacrificing read efficiency.
Architecturally, Cassandra uses a peer-to-peer design under a ring topology instead of a master-
slave or a sharded design, mainly due to its high availability demands. Even if a node fails, the other
nodes in the ring are able to deliver the data since they all eventually share the same information.
One of the key features of this system is the ability to scale incrementally. To achieve this, Cassan-
dra partitions data using consistent hashing [12], which allows the system to minimize the effects of a
departure or arrival of a node to the ring. In this approach, immediate neighbors are affected but the
others nodes remain unaffected. This system also strengthens this algorithm by analyzing load informa-
tion on the ring and moving lightly loaded nodes on the ring in order to help heavily loaded nodes. This
technique is described in [17].
8
As we mentioned previously, this is a distributed system, so replication is a very important part of
the system, since it is the key to achieve high availability. Depending on the type of requests, replication
is performed differently. For writes, the system routes the requests through the replicas and waits for a
quorum of replicas to acknowledge those writes. For reads, the system, depending on the consistency
guarantees established by the customers, routes the requests to the closest replica or to all replicas and
waits for a certain number of responses. The client chooses the replication policy that best suits the
application, with the options of Rack Unaware, Rack Aware or Datacenter Aware. These policies are
described in [2].
Regarding the data model of the system, it works as follows. A table is a distributed multi dimensional
map indexed by a key. Every operation performed under a single row key is atomic per replica no matter
how many columns are being accessed. Much similarly to Bigtable, it allows columns to be grouped
together as column families. These columns can be of two different types: Simple and Super column
families. Super column families are columns that can be visualized as a column family within a column
family. When receiving read requests, Cassandra allows columns to be sorted either by time or name.
NoSQL Document Storage Systems
Document storage systems pair each key with a complex data structure known as a document. Docu-
ments can contain many different key-value pairs, or key-array pairs, or even nested documents. Below
we describe a system that relies on this approach.
MongoDB
MongoDB [1] is a distributed NoSQL document database with high availability, scalability and perfor-
mance guarantees. Unlike NoSQL databases, this system is not limited to key-value queries, since it is
possible to also use, to name a few, Range queries, Geospatial queries or MapReduce queries.
Regarding the data model, this storage system stores data as documents in a binary representation
called BSON (Binary JSON). BSON documents contain one or more fields, and each field can be of
different data types, including arrays or even other documents.
Scalability in MongoDB is ensured by using the technique called sharding. Sharding distributes
data across multiple partitions called shards. As the data grows or the size of the cluster increases,
this storage system is capable of automatically balancing the load between the replicas. Interestingly,
MongoDB provides multiple sharding policies that will be used to define how the data will be distributed
across the cluster. Developers can opt between: Range-based Sharding, Hash-based Sharding and
Location-aware Sharding. Range-based Sharding states that documents are partitioned according to
the shard key value. Hash-based Sharding says that documents are uniformly distributed according to
an MD5 hash of the shard key value. Location-aware Sharding means that documents are partitioned
according to a user-specified configuration that associates shard key ranges with specific shards and
hardware.
Data availability is achieved using replica sets. A replica set consists of two or more copies of the
9
data. There is always a primary replica that is responsible to handle write and read requests; however,
by default, it is possible to read eventually consistent data from secondary replicas. When the primary
replica crashes, a new primary replica is elected from the ones available in the replica set.
Architecturally, MongoDB uses a Master-slave design.
Other types of NoSQL storage systems
Even though messaging systems aren’t capable of storing persistently data, if properly adapted, it can
become one. For instance, we have Kafka described below.
Kafka
Kafka [18] is a distributed messaging system for log processing capable of collecting and delivering high
volumes of data with low latency while guaranteeing fault-tolerance and scalability. It can also be used
as a storage system.
Architecturally, Kafka uses a publish-subscribe design. It has three major components: producers,
brokers and consumers. Producers are responsible to publish topics into the brokers, and the consumers
are responsible for retrieving the topics from the brokers by subscribing for those particular topics. Since
this is a distributed system, a topic is divided into multiple partitions and each broker stores one or more
of those partitions. Each partition is also replicated across a configurable number of servers (brokers)
for fault tolerance.
Any message queue that allows publishing messages decoupled from consuming them is effectively
acting as a storage system. This is the main reason why we can also use Kafka not only as a distributed
messaging system but also as a storage service. Data written in this system is written to disk and
replicated for fault-tolerance. In addition, Kafka allows producers to wait for acknowledgements of their
writes, allowing them to consider a write to be complete only when it has already been fully replicated
and it is guaranteed that, no matter what happens, that particular write will be persistent. Regarding
reads, you may imagine that since this is a messaging system, consumers aren’t capable of choosing
what particular messages to read. However, with Kafka, consumers are capable of controlling their read
position.
Summary of comparison to previous systems
Focus
Like the systems covered above, we focus on providing performance and availability guarantees when
dealing with large quantities of data. Consistency is also addressed and, in some storage systems, the
customer can choose the level of consistency.
10
Design
To ensure performance and availability, these systems differ whether in terms of the replication architec-
ture, for example, a peer to peer design or a master slave one, and in terms of data model, for instance,
key-value stores and wide-columns stores.
Each one of them uses different approaches. For example, if consistency and accuracy is not a pri-
mary concern for the customer, using an approximate query engine can easily improve the performance
of the system by trading accuracy for performance. However, most of the time, customers care about
consistency. So, approaches like key-value stores, wide-column stores or document stores are more
suitable for them.
To better understand the differences and similarities between the different storage systems described
above, Table 2.1 provides an overview of the comparison between them.
System Main goals ArchitectureBlinkDB Performance UnknownRiak KV Availability and Scalability Peer to PeerRedis Availability and Performance Master-Slave
Bigtable Scalability Master-SlaveCassandra Scalability Peer to PeerMongoDB Availability, Scalability and Performance Master-Slave
Kafka Scalability, Availability and Performance Publisher-Subscribe
Table 2.1: Storage systems comparison
In contrast to these storage systems, where the same system is replicated to achieve availability and
performance guarantees when reading or writing data, our proposed solution will separate these two
models, e.g., by using different storage systems to handle writes and reads. This approach allow us
to select the most suited storage system for each request type. This approach is better described in
Chapter 3.
2.2 Combining different storage systems
In this section we will cover several systems where the idea is to combine several systems with different
characteristics in order to obtain a single global storage system with better characteristics than if we
were using each one of those systems individually. These storage systems are mainly cloud storage
systems, which are notorious for achieving higher availability and performance rates. Even though
combining different storage systems to achieve an enhanced global storage system is the main idea of
these proposals, these systems are also concerned with complying with the customer’s service level
agreements, for instance, performance, cost, throughput, consistency or availability.
CosTLO
Cost-effective Tail Latency Optimizer [19] is a system that aims to reduce the high latency variance asso-
ciated with cloud storage services while still supporting the demands, SLOs (Service Level Objectives),
11
of the clients.
When an end user needs to fetch or send data to the cloud, the requests responsible to perform
this operation are always associated with latency, which determines how fast the operation is executed.
In order to reduce the latency variance linked to cloud services, we need to identify the main factors
that lead to the increase of the response time of every request. The following factors are the ones that
CosTLO identified as responsible for latency variance:
• Internet latencies
• Latency spikes in a data center’s network
• Latency spikes within the storage service
Since tail latency samples extracted from any of the previous factors are mostly caused by isolated
latency spikes, they opted to implement redundancy, which is a well-known approach for reducing vari-
ance [20] [21]. By augmenting each GET and PUT request with a set of concurrent redundant requests,
they increase the probability of receiving one response faster than the other. The redundancy approach
of CosTLO can be exploited in two ways. It can implicitly take advantage of load balancing in the in-
ternet, due to the fact that a request can take different paths to arrive to the same data center, or it
can explicitly issue concurrent requests to a set of requests that are the same but stored in different
locations. Depending on the latency factor, their approach changes between the two.
The following techniques explain how to reduce latency caused by the factors previously identified.
To tackle spikes in internet latencies CosTLO issues multiple requests to the client’s closest data center.
If the client desires to reduce even more the tail latency, than CosTLO concurrently issues requests
to the two closest data centers to the client which, interestingly, can be from different cloud providers.
Even though datacenter’s inside the same cloud provider are spread away in the globe to achieve more
coverage, datacenter’s of different cloud providers are probably closer to each other in the globe, thus
leading to reduced latency. For latency spikes in a data center’s network, this system issues multiple
requests to the storage service in that data center. For the last factor, latency spikes within the storage
service, CosTLO issues multiple requests either to the same object or to different objects that are copies
of the object that the client wants to access.
However, while redundant requests are able to considerably reduce the latency variance, it also
doubles the monetary cost for GET operations and network bandwidth. This tradeoff is studied by
CosTLO in order to find the best suitable configuration that fits the client’s demands. This is due to the
fact that, even though the main goal of this system is to reduce the latency variance on cloud storage
services, they also have to take into account the users demands and how to make it cost-effective.
Architecturally speaking, CosTLO has a component called ConfSelector that is responsible to serve
PUTs and GETs between the client and the storage services available in the cloud. ConfSelector divides
time into epochs, and at the start of every epoch it links a configuration separately for each IP prefix. The
advantage of using epochs is that the system can now detect, at the start of each epoch, if the client’s
SLOs are being ensured. This configuration is identified as follows. ConfSelector starts by setting each
12
IP prefix to issue only a single request to the data center closest to the client. Finally, ConfSelector
maintains a pool of candidate configurations, that is gradually iterated in the increasing order of cost,
until the minimum cost configuration capable of meeting the client’s demands is found.
The techniques used by CosTLO to minimize cost to the client while keeping the latency variance
low are also expressed by the timeout period imposed before starting to issue redundant requests.
Concurrent redundant requests are only issued if, during a predefined time, a request issued by a client
has no response.
RACS
Redundant Array of Cloud Storage [4] is a proxy that focus on allowing customers to avoid vendor lock-in,
reduce cost when switching between cloud providers, and better tolerate provider outages or failures.
Cloud storage providers distinguish themselves by allowing data to be accessed all over the globe
as well as to integrate this service, storage, with other cloud computing services, for instance, Amazon’s
EC2 with S3. However, storage is the most wanted service of them all, thus leading to very competitive
pricing schemes among cloud providers. They also compete in other aspects, such as availability and
uptime guarantees, ensured by systems capable of handling load balancing, recurring data backup and
redundancy, to more efficiently minimize cost of failure.
Having in mind that clients are concerned about cost and availability, this system identified two types
of failures: outages and economic failures. Outages are described as a series of improbable events that
leads to data loss in the cloud, and economic failures as the constant change in the pricing schemes of
cloud providers. The last being the main concern of RACS.
To address these failures and avoid vendor lock-in, hence minimizing the cost of failure, RACS pro-
poses to redundantly distribute its data through different cloud providers. This strategy guarantees mar-
ket mobility but at a very high storage and bandwidth cost. Since cost and availability is the main goal
of the system, they opted to change its approach by striping data across multiple providers. They argue
that this technique added agility to the clients when responding to provider changes or to new providers
that emerge in the market with more attractive pricing schemes. This redundancy was ensured by the
use of erasure coding. Erasure coding allowed RACS to avoid strict replication because it only requires
a subset of the data to compute its original state.
In the end, RACS managed to tolerate outages, tolerate data loss, adapt to price changes and even
to implement data recovery. Nonetheless, while it was able to mitigate vendor lock-in, it incurred in
higher overhead costs in storage, throughput, bandwidth and latency. This tradeoff can be adjusted by
the consumer when setting up RACS.
Architecturally, RACS decided to mimic the interface of Amazon S3 by using the same data model. It
stores data in buckets, where each bucket contains keys associated to objects. This option was mostly
driven by S3’s popularity, hence augmenting RACS compatibility to existing client applications. In terms
of design, RACS is a proxy that links clients to a set of n repositories, which are cloud storage locations.
Basically, it is responsible to split the object into m data shares of equal size, when a user issues a PUT,
and to recompute the data coming from those n repositories, when a client issues a GET. Since all data
13
must pass through a RACS proxy, it can become a bottleneck. To solve this, this systems proposes a
distributed architecture where multiple proxies are synced using ZooKeeper [22].
SPANStore
Storage Provider Aggregating Networked Store [23] is a system that focus on minimizing cloud storage
costs while respecting the application’s SLOs, such as data consistency, fault tolerance and latency.
This system aims to overcome different challenges in order to better satisfy the client’s latency, con-
sistency and fault tolerance requirements. Firstly, the system needs to figure out how to handle the
interdependencies between these requirements. For example, to minimize cost for an application that
desires strongly consistent data, the best option is to store all of the application’s data in the cheapest
storage service. However, this may violate the latency requirements. Secondly, they have to take into
account the workload of the application. Even though different applications have the same latency, fault
tolerance and consistency demands, their optimal configurations may differ depending whether the ap-
plication is dominated by PUTs or GETs. Finally, they need to consider the multi-dimensional pricing of
cloud storage services due to the several metrics used, for instance, the amount of stored data or the
number of PUTs and GETs issued.
Like the previous systems that we analyzed, SPANStore seeks to provide low latency access to
storage. The best way to achieve this is to replicate every object through each data center. However,
while replicating all objects to all data centers may ensure low latency access, this approach is costly
and sometimes inefficient. For instance, some objects may be requested more times in some regions
than others. SPANStore decided to leverage multiple cloud providers, since different providers have
data centers in different locations throughout the globe, which allows clients and other data centers
to communicate with the closest data center possible thus lowering the associated latency of users
requests. While the primary goal of this approach is to reduce latency, it also helps to minimize cost. It
exploits the discrepancies in price schemes of cloud providers for the same services. For instance, if an
application is more PUT demanding, then SPANStore opts to use the data centers of the cloud provider
that has the less expensive PUTs, to serve the PUTs of that particular application. Thus, by judiciously
combining resources from multiple providers, SPANStore is able to use a set of inexpensive operations
to reduce costs.
To prevent the system from failing to meet the clients SLOs, due to, for instance, changes in the price
schemes of providers, they keep track of the application’s workload and latencies. If they notice that the
system isn’t meeting the client’s demands, they modify the system behavior towards that goal.
Architecturally, SPANStore is deployed at every data center in which the application is deployed. The
application issues PUT and GET requests for objects to a SPANStore library that this application is linked
to. This library then serves these requests by looking up in-memory metadata stored in the SPANStore
VMs of that data center, and thereafter issuing the requests to the respective storage services. The
manner in which SPANStore VMs should serve these requests for any particular object is defined by a
central component called PlacementManager. This component divides time into fixed-duration epochs,
as described previously. At the start of every epoch, all SPANStore VMs transmit to this component
14
a summary of the application’s workload and latencies measured in the previous epoch. With this
information, it computes the optimal replication policies to be used for the next epoch, in order to keep
meeting the client’s SLOs.
HyRD
HyRD [5] is a system that focuses on improving cloud storage availability, avoiding vendor lock-in, re-
ducing the user access latency and improving the cost-efficiency to avoid violating the service level
agreements.
The reliability of data stored in the cloud is threatened by outages that cloud storage services face,
and this becomes an even more important problem when using a single cloud storage service to store
our data. These outages may lead to data loss in the cloud or unavailability of the service to the users.
Even in the case when strict Service Level Agreements (SLAs) are set between the cloud provider
and the user, service failures and outages are almost unavoidable and may jeopardize such targets.
For instance, a series of high-profile cloud outages in the year of 2014 [24] happened in major cloud
providers, such as Amazon, Microsoft and Google, from a 5-minute failure that cost half a million dollars
to a week-long disruption that cost an immeasurable amount of brand damage. These are only some
examples of problems that were faced when storing data in the cloud. I previously addressed with more
detail this subject when describing SPANStore.
To solve this problem, HyRD proposes, similarly to other systems we surveyed, to redundantly dis-
tribute the data across multiple providers by means of data redundancy schemes, such as replication
and erasure codes. With this solution, users are able to avoid vendor lock-in, since the cost of switching
is now lower than before, while also protecting against outages of a single cloud provider. However,
deciding the best redundancy scheme to achieve this is not straightforward. On the one hand, while
replication increases availability and durability, it introduces a set of challenges. For example, to achieve
high availability for large systems, it is necessary to increase the number of replicas, which introduces
extra bandwidth and storage overhead to the system. This is more problematic when dealing with files of
large proportions. However, for smaller files, it is still the best approach to provide the best performance
with small bandwidth and storage overhead. On the other hand, erasure code provides redundancy with
less space overhead. As explained in RACS, it divides an object into m fragments and recode them into
a larger n fragments such that the original fragments can be recovered from a subset of the n fragments.
This implies that when storing a file it will require extra time to record the redundancy information, es-
pecially when dealing with small files. Another problem is the large amount network traffic required to
reconstruct data. In summary, replication-based schemes perform better when storing small files and
file metadata, and erasure-code-based schemes perform better and are more cost-efficient when deal-
ing with larger files. We also must take into consideration that the size of the files and metadata may
vary. Previous studies on the workload characteristics have shown that file metadata accesses are more
frequent than file accesses [25] [26], thus the performance of file metadata accesses is critical to the
overall system performance. HyRD, similarly to the systems that have been previously analyzed, no-
ticed that different cloud providers offer different price schemes as well as different access latency times.
15
Based on this analysis, HyRD decides to use the redundancy scheme that best fits each file. Large files
are redundantly stored in cost-efficient cloud storage providers using erasure-code-based schemes,
while small files are stored in multiple high-performance cloud storage providers using replication-based
schemes. This way, this system is able to mitigate the disadvantages of both schemes.
Architecturally, HyRD is divided in three main modules:
1. Workload Monitor
2. Cost and Performance Evaluator
3. Request Dispatcher
The first module is responsible for classifying the incoming data as file metadata, small file or large
file. This classification is based only on the access latency of the data. The second module evaluates
the cloud storage services based on its performance and price schemes. The last module uses the
classification, done by the workload monitor, to decide which redundancy scheme is more appropriate
to that data. This decision is supported by an evaluation, done by the cost and performance evaluator.
Finally, the data is then distributed to the corresponding cloud storage providers.
HyRD decided to interact with the cloud providers using their standard interfaces, which facilitates
the inclusion of any cloud storage provider.
When dealing with service outages, small files and file metadata are fetched from the replicated
providers, while large files are reconstructed using the erasure-coding scheme.
Conductor
Conductor [27] is a system that aims to deploy and execute MapReduce computations in the cloud
by choosing the best cloud providers/services that enable meeting the requirements presented by the
clients (the computation to be executed in the cloud, set of cloud services that could be used, set of
goals to optimize the execution of that computation).
The motivation of this system is that, when using cloud services, customers face the challenge of
choosing the cloud provider that suits them best according to their performance or price needs. However,
there are many providers that offer similar services that only differ in price schemes and performance
characteristics. And even after picking a provider, sometimes there are different types of resources for
the same service. An example of this is Amazon’s EC2 service, where you can choose among a variety
of types of virtual machine instances. Pricing schemes of cloud services is another issue that customers
need to take into consideration. These schemes aren’t static, which adds another layer of complexity.
They may vary over time as providers adjust their pricing models, or when new providers emerge in
the market. Estimating the cost of alternative deployment strategies is also challenging because the
information available about the services used to deploy a system may not correspond to what is actually
observed. This may also happen due to problems in the cloud service, where work throughput may
drop due to congestion. Customers also need to take into account the cost of transferring data between
computation and storage locations. Last but not least, the customer also has to decide whether it prefers
16
to pay less but have lower reliability, lower replication factors, or pay more and grant more reliability
against faults. All of these challenges are problems that Conductor tries to solve.
Even though these problems are Conductor’s main challenges, transparency, efficiency, adaptivity
and flexibility are concerns that are also taken into consideration. Their system is able to leverage
different types of services without having to adapt their applications to the interface that is provided by
that service; to allow the customer to specify goals; to detect when the system is no longer meeting the
customer’s SLAs and react accordingly; and to easily incorporate new services.
To accomplish all of this, Conductor chooses the services that are able to meet the customer-defined
goals, (for instance, minimizing cost or execution time) and deploys computations on the cloud according
to that choice.
After receiving the customer’s requirements needed to plan and deploy jobs in the cloud (namely a
computation to be executed in the cloud, a set of cloud services that could be used for executing the
computation, and a set of goals to optimize the execution), Conductor starts its life cycle.
This system is divided in four phases:
1. Create the initial model that describes the computation and the costs implied with its execution, for
instance, how much it will cost or how much time does it need to conclude a MapReduce job;
2. Determine an optimal execution plan by using a solver and the information gathered in the previous
phase;
3. Deploy the planned execution;
4. Detect deviations and upon that detection compute a new plan that enables the system to keep
meeting the client’s requirements. When computing the new plan, Conductor also tries to take
advantage of price drops on the services available from the cloud providers;
In Figure 2.1, we summarize the overall system design. There we can better understand the objective
of each phase and the information shared between them.
Summary of comparison to previous systems
Focus
These systems focus on ensuring that customer service level agreements, for instance, consistency and
performance, are always guaranteed while trying to keep the systems as cost-effective as possible to
the customers. These customer requirements are also concerns applicable to our proposed solution,
described in Chapter 3.
Design
To accomplish their goals, the designers of these systems needed to understand what was the best
strategy to be applied taking into account the specifics of the systems and the end users’ goals.
17
Figure 2.1: Conductor’s system overview [27]
To address the problems that each system faced, different strategies had to be chosen. Although
these systems share the same concerns, each one of them prioritize these concerns differently. To this
end, they needed to understand what were the tradeoffs that could be easily justifiable. For instance,
to reduce latency, some systems decided to use cloud providers that guaranteed better performance.
When availability was their main target, they often opted by introducing redundancy, whether through
replication-based schemes or erasure-coding-based schemes. However, they had to explore the impli-
cations of this redundancy, which could very easily affect the performance requirements of the customer.
They also noticed that depending on the workload characteristics of the data being managed, different
replications schemes had to be chosen accordingly in order to improve their results. If monetary cost
was their main concern, they opted to manage multiple cloud providers, which allowed them to choose
the ones that provided the best service at the lowest cost.
To better understand the differences and similarities between the different systems described above,
Table 2.2 summarizes each system.
Similarly to these systems that combine different storage systems available in the cloud to come up
with a better global storage system, we will be focusing in CQRS [8] (Command Query Responsibility
Segregation), which is a pattern that allows us to differentiate the models used to store and retrieve data.
However, instead of leveraging multiple cloud providers and their respective offer of storage services,
our focus is on combining different storage systems with different characteristics, where all of them
are locally deployed by the same organization. This gives the system increased flexibility in terms,
e.g., of configuring each storage system, or transferring data across systems and reconfiguring system
parameters dynamically.
18
System Main goals StrategyCosTLO [19] Reduce the high latency variance associated
with cloud storage servicesIssue concurrent redundant GET/PUT re-quests
RACS [4] Avoid vendor lock-in and better tolerate out-ages
Redundantly distribute its data through dif-ferent cloud providers through erasure-code-based schemes
SPANStore[23]
Minimize cloud storage monetary costs with-out violating service level agreements
Combine resources from multiple providers
HyRD [5] Improve availability, avoid vendor lock-in andreduce user access latency
Redundantly distribute data across multi-ple providers accordingly, combining erasure-code-based schemes and replication-basedschemes
Conductor[27]
Deploy MapReduce computations in thecloud with optimal cost and performance
Predict the costs of the computation andchoose the best approach according to out-put of optimization tool
Table 2.2: Cloud storage systems comparison
2.3 Quality of service in storage systems
Over the years, designing and maintaining storage systems is a process that is becoming very difficult
and complex due to the amount of information that needs to be processed. Data availability as well as
performance is crucial for every system that relies on storage systems to handle their data. If the storage
system goes down, certainly the whole computer system that relies on that storage system will go down
as well. When designing this systems, the administrators have to properly map their logical units to the
available hardware, otherwise the quality of service is highly affected.
Typically, administrators configure storage manually using rules of thumb. However, taking into ac-
count that storage systems are complex and application workloads rather complicated, the resulting
systems are costly or poorly setup.
To address this problem, a few tools have been proposed, each one of them exploring different
techniques, to come up with the best design possible for each system.
For instance, Minerva [28] divides the problem of designing storage systems in three phases. The
first one is to choose the right set of storage devices for a particular application workload. The second
phase is choosing the values for configuration parameters in the devices. The last phase is to map the
user data to those devices. To design the system it uses heuristic search techniques, and analytical
device models to evaluate the result of the heuristic search.
Ergastulum [7] receives a workload description, a set of potential devices, optional user-specified
constraints, and an externally-specified goal to choose the appropriate type of device to use, the con-
figuration of each device and the mapping of application data onto those devices, to best achieve the
specified goal using a generalization of the best-fit bin-packing algorithm [29], described in [7]. After
designing the first configuration, it monitors the current design and performs modifications if necessary.
Rome [30] isn’t a tool itself but an information model, which is the base of several tools used to design
storage systems configurations. It is also capable of monitoring the result of the designed system and if
necessary, with the help of those tools, automatically adapt to meet the quality of service intended.
Other systems have opted to use other techniques, for example, using iterative algorithms like Hip-
19
podrome [6] to try and evolve the storage system design to a good storage system design.
In contrast to this class of system, we are not trying to control and improve the quality of service of
a storage system by modifying that system, but instead our goal is to introduce a layer that mediates
access to one or more storage systems and superimposes such quality of service guarantees on top of
existing systems.
Summary
This chapter introduced three types of systems capable of dealing with the ever increasing data usage.
The first type relates to large-scale storage systems which rely in replication to guarantee availability and
performance under such circumstances. The second type are systems that combine several systems
with different characteristics to achieve a better overall solution. These storage systems are mainly cloud
storage systems, which are notorious for achieving higher availability and performance rates. Lastly, we
presented systems that address this problem by instilling quality of service in storage systems. However,
taking into account that storage systems are complex and application workloads rather complicated, the
resulting systems are costly or poorly setup.
This study led us to propose a solution, described in the next chapter, that combines a few ideas
gathered from the state-of-the-art.
20
Chapter 3
Proposed Solution
Taking into consideration the analysis of the solutions presented in Section 2.1 and Section 2.2, our
proposed solution consists of combining storage systems with different characteristics, similarly to the
systems described in Section 2.2, to create an enhanced global storage system capable of storing
and retrieving data to the user better than if the same storage system was used. However, instead of
leveraging multiple cloud providers and storage systems available on the cloud, we will combine storage
systems that can be all managed by the same cloud provider or even locally deployed by the same
organization. To achieve our goal, we will explore a hybrid approach inspired by the software pattern
called Command Query Responsibility Segregation [8]. This pattern states that we should segregate
the models responsible for storing and retrieving data from the system. In other words, that requests
that fetch information from the system should be possibly dealt with a different storage system than the
requests that perform modifications on the stored data. This notion will hold true in most cases, except
when dealing with requests that requires strong consistency. In this case, we will have to revert to a
more costly protocol to retrieve information back to the user.
In the following sections, we describe in more detail this hybrid approach. We start in Section 3.1
by outlining the setting in which our solution is to be deployed, which gives a real world motivation and
context for the required solution. Then, in Section 3.2, we briefly explain the high level idea of the
solution, followed, in Section 3.3, by the architecture of our solution. Finally, in Section 3.4 and Section
3.5, we describe the data structures and algorithms employed.
3.1 Requirements: the Unbabel use case
Unbabel1 is a company that provides translation as a service by combining humans and AI. Their ambi-
tion is to break the language barrier through the delivery of quality translations to its clients.
Their process is quite simple. Clients upload content specifying the desired language pairs, from
which language to which language they want their content to be translated to, and then Unbabel returns
their content translated into the chosen languages.
1https://unbabel.com/
21
Internally, the uploaded content goes through a pipeline that, in a very high level, looks as follows:
1. The clients’ content is uploaded to the system
2. The content is translated using AI
3. The content is split into smaller fragments
4. The AI translations are edited by humans
5. The fragments are merged
6. The translated content is delivered to the client
This workflow is represented in Fig. 3.12.
Figure 3.1: Unbabel’s Internal Workflow
As we can see in the workflow depicted in Fig. 3.1, besides the clients, we also have editors,
which are responsible for editing the translated text of the machine translation step. Each one of them
interacts with the system differently, producing different kinds of workloads. Naturally, these workloads
in conjunction, will put the system under pressure, consequently degrading the user experience. Being
a company that aims to deliver high quality translations as fast as possible, response times and user
experience are highly praised and must be ensured.
To understand in more detail these requirements and how our solution operates, we now describe in
more detail a specific aspect of this workflow, which is implemented by a subsystem called Tarkin.2https://developers.unbabel.com/v2/docs
22
Tarkin is a task manager, which means that its responsibility is to store and manage tasks. Tasks are
portions of text, sent from their customers, to be translated.
At a high level, the main operations of Tarkin are:
• Assign tasks for translation to editors;
• Receive translated tasks from the editors;
Every time an editor issues a request to receive a task for translation, the task manager sends them
the best suited task according to their profile.
In order to avoid escalating the costs of translation, by assigning the same task to several editors,
is necessary to ensure that only one editor is capable of accessing and editing one particular task at a
time. However, it is possible that one task, that has already been processed, returns to the task queue
to be edited once again.
The problem comes with the increase of the number of requests, performed by editors, and the write
requests (insertions of new tasks in the system), performed by customers. It not only makes the system
extremely concurrent, which may violate the condition imposed, where only one editor can access a
task at a time, but it also degrades significantly the response time of the system, since the system
needs to deal with both read and write workloads simultaneously. Another factor that contributes to the
degradation of the response time is the need to calculate the number of available tasks for an editor that
is requesting for new tasks.
In more detail, our system will process four kinds of requests, with different requirements:
1. Get the number of available tasks;
2. Get a task;
3. Submit a translated task;
4. Create a task;
All of these requests require good performance, in particular low latency so that the user does not
experience a slowdown when accessing the service. However, they differ in their needs of consistency.
While requests 1 and 3 only require eventual consistency, requests, 2 and 4 demand strong consistency,
namely that reads return the most recently written value.
The whole process of this use case is illustrated in Fig.3.2.
3.2 Adapting the CQRS pattern
As previously introduced, in order to achieve this enhanced global storage system, we will be exploring
the Command Query Responsibility Segregation [8] pattern. At a high level, this pattern states that
you can differentiate the models that update and read information. A possible way to implement this is
that, depending on whether the request is a read-only request or an update, the system will access two
different storage systems.
23
Figure 3.2: Use case
However, in contrast to the original philosophy, the distinction will not only be made between read-only
and update requests, but instead we will leverage the different levels of consistency between different
classes of requests. This will enable us to take advantage of the benefits of this pattern and do so in a
more generic way that applies to a broad range of settings that wouldn’t be applicable otherwise. For
instance, two different classes of get requests with different consistency requirements can be segregated
in our case, and wouldn’t be originally.
Throughout the document, requests that require strong consistency are dealt by the command side
of the pattern (write layer) and read requests with weaker consistency demands are processed by the
query side of the pattern (read layer).
3.3 Architecture
Fig.3.3 portrays the components that constitute our solution as well as the interactions that are performed
between those elements. After the arrival of a request to the API layer, the best suited storage system is
selected according to the requirements of that particular request. The components and the overall flow
of our system are explained in more detail next, in Section 3.3.1 and Section 3.3.2 respectively.
3.3.1 Layers
API
The API is the entry point of our system. This component is responsible for processing requests. It also
decides which storage system is more appropriate to help processing the incoming request, taking into
account its characteristics. This routing is explained in detail in Section 3.5.
Read Layer
The Read Layer, or the query side of the CQRS [8] pattern, contains the database that is accessed every
time a request to retrieve data back to the user arrives. Since these requests are read-only, we have to
24
Figure 3.3: Architecture
take into consideration that the storage system that will be used has to perform well when dealing with
this type of operations. This layer processes requests that don’t require strong consistency.
Write Layer
The Write Layer, or the command side of the CQRS [8] pattern, contains the database responsible for
dealing with the requests that requires the insertion, modification or deletion of data from the system.
When selecting which storage system to use in this layer, we need to consider that these requests
are write intensive. This layer is also responsible for processing get requests with strong consistency
requirements.
25
Replication Layer
The Replication Layer is the pivotal point of our solution. Its main responsibility is to replicate the data
from the command side to the query side, as fast as possible, thus minimizing the amount of time that
the system could retrieve inconsistent data.
3.3.2 Data flow
In order to achieve this enhanced global storage system, the components, presented in the previous
section, need to work together to successfully process requests. To better understand how the data,
depicted on Fig.3.3, flows between the components, we summarize the interactions that occur during
the life-cycle of a request:
Order From To Operation1 Outside World API Layer Sends a GET, POST, PUT or DELETE request2 API Layer Read Layer Sends IDs3 API Layer Write Layer Sends data4 Write Layer Replication Layer Sends data5 Replication Layer Read Layer Replicates the data introduced in flow 46 Read Layer API Layer Retrieves the value of flow 37 Write Layer API Layer Retrieves the result of flow 18 API Layer Outside World Retrieves the result of flow 1
Table 3.1: Data flow
3.4 Data structures
Taking into consideration that scalability and performance is a must and that data structures play an
important role in fulfilling these requirements, in the following sections, we describe the data structures
employed.
3.4.1 Databases
Since our system is an hybrid approach to the CQRS [8] pattern, we opted to pick two different databases
according to the workloads that they were going to handle. Hence, leveraging their strengths and routing
the workloads that best suits each one of the storage systems.
The storage system of the write layer will maintain all the information of the system, in our use case,
the editor’s profile, the tasks and the relation between editors and tasks. This information is needed
to aid the POST, PUT and DELETE requests of the system. Meanwhile, the database of the read layer,
only stores the number of available tasks of each editor, which is used to aid the GET requests related
to the available tasks count.
26
3.4.2 Replicator queue
The replication process is one of the most crucial moments of the overall flow of the system. With this in
mind, we had to be careful when designing the data structure that will aid this process. To avoid that it
could become a bottleneck, and that we could easily scale, when facing huge workloads, we decided to
create a queue that stores the information required to process each task (elements of the queue). For
each task, we store the following information:
– The name of the function that will process the replication
– The arguments of the function, which contains the data that is going to be stored in the query side
of the system
3.5 Algorithms
To provide an efficient solution, we rely on three main algorithms. The first algorithm operates similarly
as a load balancer according to the characteristics of the request. The second ensures that, as soon as
modifications on the stored data are detected, they are sent to the replication layer through a publish-
subscribe system. At last, the third algorithm guarantees that the modifications on the system are
propagated to the read layer.
3.5.1 Routing
The goal of the routing algorithm is to select which of the two storage systems, available in the read or
write layer, can aid processing the request. Basically, if the request demands to insert, update or delete
information from the system, then the API will access the write layer, otherwise, if the request needs to
fetch information, then the read layer will be accessed. However, if a get request requires consistent
data, that particular request will be routed to the write layer.
3.5.2 Replication
Every time that the system receives requests that perform modifications on the stored data, these mod-
ifications have to be replicated from the write layer to the read layer in order to keep the system as
consistent as possible. For this purpose, we have two algorithms:
• The first algorithm relies in a publish-subscribe system, which notifies the replication layer as soon
as modifications on the data, stored in the write layer, are detected;
• The second, the replication layer makes use of a scheduler that pulls information from the write
layer in order to keep the read layer consistent. This operation happens every x minutes, where x
is configured a priori;
27
Summary
This chapter presented the idea of creating an enhanced global storage system through the usage of
different storage systems to serve requests according to its characteristics. This idea was inspired by
the CQRS [8] pattern. However, in contrast to this ideology, we differentiate requests not only according
to its type but also its consistency levels. The presented architecture adds the benefit of being able
to independently scale each layer according to the system’s needs. In the next chapter we continue
describing the solution by presenting the implementation details.
28
Chapter 4
Implementation
In Chapter 3, we described our solution at a higher level. In this chapter, we will take a closer look at the
particularities of each component of the system. We start, in Section 4.1, by explaining the interaction
between components, followed by the nuances of each component in Section 4.2. We end the chapter
with the description of the software architecture in Section 4.3.
4.1 System interoperability
As previously documented, our architecture is composed by four layers. In order to ensure that these
layers are able to work as desired, it is necessary to transfer information from layer to layer. We will
describe the interactions between those layers including the information that they share.
Starting with the API and the outside world, these two elements communicate through Representa-
tional State Transfer [31] (REST) endpoints, listed In Fig.4.1.
Every time new tasks arrive, they do it through the /api/v1/task endpoint, which is an HTTP POST
request, that receives, for instance, the language pair of the task. The language pair is used to track the
original language and the desired language of the task’s output.
The /api/v1/editor/{editor id}/available tasks count endpoint is an HTTP GET request that, given
an editor id, returns the number of tasks that are available to be translated by that particular editor.
In order to retrieve a task to translation from the system, the /api/v1/task/search endpoint is used.
This endpoint is an HTTP GET request that contains the editor, the language pair and the type of the
Figure 4.1: API endpoints
29
task as the body of the request.
Finally, the last endpoint, /api/v1/task/submit is an HTTP POST request that succeeds a /api/v1/task/search
request, which returns to the system the translation of the task previously fetched. This request has in
its body the id of the task, the id of the editor that performed the translation and the translated data.
Regarding the interaction between API layer and the write layer, the communication is achieved
through calls to the database that composes the write layer. When the API receives a request to store
information on the write layer, the data of that particular request is forwarded to the write layer. As
explained in Section 3.5.1, get requests with tight consistency requirements are also addressed by the
write layer, which is the case of the /api/v1/task/search endpoint.
The communication between the API layer and the read layer is effectuated in the same manner as
the previous one. Every time a get request that doesn’t require strong consistency arrives, for instance,
when an editor wants to know how many tasks are available for him, the API layer sends the id of the
editor to the read layer and the read layer returns the number of available tasks.
To finalize, we need to address how the replication layer interacts with both the write and read layers.
To ensure that the data of our solution is as consistent as possible, the write layer needs to replicate the
new data to the read layer, which is the purpose of the replication layer. This process is accomplished
through the usage of a publish-subscribe system and a pull mechanism, described in Section 4.2.2 and
Section 4.2.4. These interactions are performed by accessing both layer’s databases directly.
4.2 Components
In this section we give more insight on the decision making and implementation of the elements that
constitute our solution.
4.2.1 API Layer
As described in the previous section, the API layer receives requests through REST endpoints. These
requests are processed with the help of uWSGI1, a web server gateway server interface, and Flask2,
which is a microframework for Python with RESTful request dispatching capabilities.
In this layer there are four endpoints. Three of them already existed in the previous version of Tarkin,
the /api/v1/search, /api/v1/submit and /api/v1/task, while the fourth is an extension made by us.
Fig.4.2 represents the fourth endpoint, which is the class responsible for processing the avail-
able tasks count endpoint described in the previous chapter. Since this is a request that can coop
with eventual consistent results, the API requests to the Redis database, integrated in the read layer, for
the number of available tasks for the given editor. The endpoint receives as argument the name of the
editor and retrieves its corresponding value, as you can see in line 79.
The remaining endpoints are processed as follows:
1https://uwsgi-docs.readthedocs.io/en/latest/2http://flask.pocoo.org/
30
Figure 4.2: Available tasks count endpoint class
• /api/v1/search executes a stored procedure, in the write layer database, that makes sure that one
task is only assign to an editor at a given time, with the help of locks and sql transactions;
• /api/v1/submit inserts into the write layer database a translated task;
• /api/v1/task inserts into the write layer database new tasks to be translated;
4.2.2 Write Layer
In the write layer we opted to use PostgreSQL [32] mainly because we are extending Tarkin, which is built
on top of PostgreSQL, and also to take advantage of a couple of features, that will aid the transformation
and replication process, described in Section 4.2.4, such as triggers and notifiers.
As previously seen, every time a new request with storage intents enters the system, the API layer
accesses this layer’s database and performs the duly operation. After a modification is made, we need
to replicate the new data to the read side in order to keep the data as consistent as possible. This
is accomplished through the usage of two different processes: a publish-subscribe system and a pull
mechanism, thoroughly explained in Section 4.2.4.
The publish-subscribe system is constructed on top of the two stored procedures depicted in Fig.4.3
and Fig.4.4. Starting with Fig. 4.3, in line 299, we create a trigger that will execute the stored procedure
notify trigger every time an INSERT, UPDATE or DELETE operation is detect on the table editor task.
This table contains the association of tasks to editors, thus affecting the result of the number of available
tasks per editor. Hence the importance of sending this data as soon as possible to the replication
layer. In line 284, the stored procedure iterates through the result of the get available tasks count
function, described below, for the editor associated with the operation that modified the editor task
table. Finally, in line 289, The flow of the trigger ends with the propagation of each iteration’s result
through the publish-subscribe channel new editor task trigger. The results are later processed by the
replicator, which finishes the replication by transforming the received data and forwarding it to the read
layer.
Fig.4.4 contains the code of the get available tasks count stored procedure accountable for calcu-
lating the number of available tasks per language pair and task type for the given editor according to its
31
Figure 4.3: Stored procedure: notify trigger
profile.
These interactions are performed with the help of SQLAlchemy3, which is a object-relational mapper
adapted to Python.
4.2.3 Read Layer
To store the information necessary to provide the results when GET requests arrive to the system, we
opted to use Redis [13] due to its high performant characteristics when dealing with read requests. As
an example, the data accessed by the available tasks count endpoint is stored in the following format:
data = {'map_reduce_paid':
{'en_es': 14},
{'es_en': 0}
}
For each task type, it provides the number of available tasks per language pair of a given editor.
4.2.4 Replication Layer
The replication layer is the piece of the puzzle that ensures that the system is capable of retrieving data,
to the outside world, as consistent as possible, when facing get requests with more relaxed requirements.
Its main responsibility is to transform the data provided by the write layer into a more desired format that3https://www.sqlalchemy.org/
32
Figure 4.4: Stored procedure: get available tasks count
33
will be then replicated to the read layer. To this purpose, it makes use of two processes, as seen in
Section 4.2.2.
The publish-subscribe flow, that began in the write layer, ends with the replicator listening to the
new editor task trigger channel, in line 21 of Fig.4.5, and transforming and forwarding the data to the
read layer.
The second process used is a pull mechanism. As Fig.4.7 portraits, every x minutes, where x is
defined in the configuration of the system, the replicator fetches from the write layer the number of
available tasks for each editor, with the help of the function illustrated in Fig.4.8.
In both processes, the transformation and forwarding of the data is achieved using the function
depicted in Fig.4.6. In lines 36 to 51, we transform the received data, number of available tasks per
language pair and task type of one editor, from the following format:
data = [{'editor_id': 44, 'language_pair': 'en_es', 'task_type': 'map_reduce_paid',
'devices': None, 'editor_name': 'AS', 'count': 14},↪→
{'editor_id': 44, 'language_pair': 'es_en', 'task_type': 'map_reduce_paid', 'devices':
None, 'editor_name': 'AS', 'count': 0}]↪→
to:
data = {'map_reduce_paid':
{'en_es': 14},
{'es_en': 0}
}
After the transformation, in line 53, the data is then forwarded and stored in the read layer.
The functions that have impact in the transformation and replication of the data, with the exception
of the function pg listen (represented in Fig.4.5), are asynchronous tasks that are being processed with
the help of Celery4, which is a distributed task queue. Allowing the system to more easily scale.
Figure 4.5: Function: pg listen
4http://www.celeryproject.org/
34
Figure 4.6: Function: replicate available tasks count
Figure 4.7: Function: setup periodic tasks
Figure 4.8: Function: get editor id
35
4.3 Software Architecture
The software architecture of our solution is illustrated in the figures below. Fig.4.9 represents the schema
of the data stored in the read layer. Fig.4.105 refers to the schema of the data stored in the write layer,
which is also the data structured in use today at Unbabel.
Figure 4.9: Redis database schema
Figure 4.10: PostgreSQL database schema
5Unbabel’s Tarkin ER diagram
36
Summary
Throughout this chapter we described the implementation details of our architecture. We started by
describing the inter-communication that occurs between components, followed by the nuances of each
component. We finished the chapter by presenting an overall description of the software architecture
present on our solution.
The next chapter presents the evaluation results of our solution against the current version of Tarkin.
37
38
Chapter 5
Evaluation
In this chapter we present the results of the tests that were executed to evaluate the proposed solution
against the Tarkin system that is currently being used at Unbabel. We start by, in Section 5.1, specifying
which tests were run, followed by the results in Section 5.2.
5.1 Setup
The evaluation was conducted locally, using Docker1 as the host of the components and JMeter2 as
the tool that executes the load tests. We opted to use JMeter due to its widely adoption and ability to
simulate real-user behaviors for testing applications against heavy load, multiple and concurrent user
traffic.
In order to have the system up and running, docker will launch the following containers: PostgreSQL,
Redis, Tarkin, RabbitMQ3, Scheduler, Replicator, Replicator Worker and Tarkin Worker.
Each container belongs to one of the four layers of our solution’s architecture. Below we map each
container to the respective layer:
• PostgreSQL→Write Layer
• Redis→ Read Layer
• Tarkin→ API Layer
• RabbitMQ→ Replication Layer
• Scheduler→ API Layer
• Replicator→ Replication Layer
• Replicator Worker→ Replication Layer
• Tarkin Worker→ API Layer1https://www.docker.com/2https://jmeter.apache.org/3https://www.rabbitmq.com/
39
The PostgreSQL, Scheduler and the Tarkin Worker are containers that belong to the original Tarkin
and are used by our solution as well.
The containers are powered by an Intel i7-6660U CPU @2,4GHz, 16 GB 1867MHz LPDDR3 and
256GB SSD, running under macOS High Sierra version 10.13.6.
Our goal was to compare both solutions in the most realistic environment possible. Hence, we
populated the PostgreSQL database using Unbabel’s staging environment database dump.
5.1.1 Workload
Tarkin was chosen as our use case not only because it’s one of the most important subsystems of
Unbabel, due to its responsibility of linking machine and human translation, but also because it’s con-
stantly under high load pressure, which makes a good use case for our solution. Having this in mind,
we thought that the best way to test Tarkin was to simulate high load pressure by constantly sending
concurrent requests through JMeter.
5.1.2 Metrics
To evaluate our solution, we decided to use the following metrics:
• Read Layer Latency
• Write Layer Latency
• Throughput
These metrics were selected taking into consideration the characteristics of the use case. Since
Tarkin is a system under constant heavy load, whether from the assignment of tasks to editors or from
the constant editor requests regarding their status, we found it would be good to compare the trace of
both systems when dealing with this kind of requests. In addition, we would like to showcase the benefits
of having requests being dealt with storage systems that suits them better.
5.1.3 Description
Each one of the tests was performed during one hour. Since our goal was to test our solution under
intensive pressure, each one of the tests were carried out using six concurrent users.
For the read layer latency test, we executed two tests. The first is meant to compare both systems
regarding the response time of read requests, taking into account that, with our solution, read requests
with weaker consistency requirements are now handled by a storage system which one of its strengths
is its read performance. The second test compares the benefits of segregating requests when dealing
with them simultaneously, which is the case of a real workload.
Regarding the write layer latency test, we only conducted one test. Since the requests that are
handled by the write layer of our solution are the same as the original system, they are only being routed
differently than the read requests that don’t require strong consistency, there would be no point testing
40
how the storage system would perform in terms of the write response time itself. Thus, we decided to
test this latency when the system is being pressured with both types of requests.
Finally, the throughput section will illustrate the throughput results of the three tests that we just
mentioned above.
5.2 Results
In the previous section we presented the nuances of each test. In this section, we analyze the results of
those tests.
5.2.1 Read Layer Latency
As discussed in Section 5.1.3, the first test refers to how both systems perform when processing get
requests. In order to test this, we will put the /api/v1/editor/{editor id}/available tasks count endpoint
under pressure.
Figure 5.1: Available tasks count response time over time
Fig. 5.1 depicts the results of this test. As we can see, our solution, which is represented by the end-
point named as /api/v1/editor/{editor id}/available tasks count (NEW), outperforms the old version
of this endpoint. While the old system averaged read response times of, approximately, 39 milliseconds,
our solution was able to keep this value at the 12 ms mark. Mostly, as a result of being capable of
differentiating requests and selecting the most appropriate storage system to process them.
41
Moreover, we also managed an improvement of the 90th, 95th and 99th percentile, by a factor of 3.4,
3.73 and 6.85 respectively.
Table 5.1 summarizes the results of this test.
Endpoint # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/available tasks count (NEW) 242790 11.95 3 13515 17.00 19.00 24.00/available tasks count (OLD) 197169 39.49 4 14974 58.00 71.00 165.00
Table 5.1: Available tasks count response times summary
The second test presents the results of processing the available tasks count endpoint while under
high load pressure from the remaining endpoints. The result of such test is illustrated in Fig. 5.2.
Figure 5.2: Available tasks count response time over time under high load pressure
Table 5.2 details the results of this test.
Endpoints # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/available tasks count (NEW) 10412 200.71 3 62053 644.70 890.00 1307.09/available tasks count (OLD) 9716 209.09 7 60620 658.30 909.00 1328.49
Table 5.2: Available tasks count under pressure summary
5.2.2 Write Layer Latency
As previously explained, since we are using the same endpoints from the original Tarkin for the write
layer, there would not be of any value to test each request individually. So, we performed a test to figure
out if our solution would improve the write layer when the system is under high load pressure from both
kinds of requests (the ones dealt by the read layer and the write layer).
42
Naturally, since we are running the tests locally, the results are not accurate and may be misleading
due to the fact that every container is using the same resources. Nevertheless, we thought it would be
interesting to see how both systems perform in the same environment.
We start with the /api/v1/task endpoint. Fig. 5.3 details the response times of this request.
Figure 5.3: Create tasks response time over time
In Fig. 5.4 is illustrated the response times over time of the /api/v1/task/search endpoint. In this
case, the results were very similar. However, in average, our solution managed to outperform the old
tarkin, as we can see in Table 5.3.
Regarding the last endpoint, /api/v1/task/submit, in Fig. 5.5, as we can clearly see, most of the
time, our solution is achieving better response times.
To finalize, in Table 5.3, we have a summary of the endpoints, comparing the old version of Tarkin
against our solution.
As we could see throughout this test, our solution is able to slightly improve response times during
load peaks, mainly by virtue of its architecture design. By routing requests through different storage
systems, we are able to offload the undergoing load pressure, which allows each storage system to
work with less pressure thus producing better results. However, we only managed to outperform the old
version of Tarkin, under high load pressure from all kinds of requests, by a very fine margin. Which we
think correlates to the fact that the tests were deployed and run in the same machine.
5.2.3 Throughput
In this section we compare the results of the throughput for each endpoint of both systems.
43
Figure 5.4: Get task response time over time
Figure 5.5: Submit task response time over time
44
Endpoints # Samples Average (ms) Min (ms) Max (ms) 90th pct (ms) 95th pct (ms) 99th pct (ms)/task (NEW) 1956 1336.82 14 61475 68.00 103.00 59626.00/task (OLD) 2000 1318.98 14 60448 59.00 119.00 59553.97/task/search (NEW) 10406 1349.63 19 62669 2706.30 4920.90 10810.67/task/search (OLD) 9714 1488.47 18 60622 3095.00 5341.25 11292.10/task/submit (NEW) 14 677.07 199 1417 1399.50 1417.00 1417.00/task/submit (OLD) 13 722.92 200 1609 1589.00 1609.00 1609.00
Table 5.3: Write layer response times summary
We start with the available tasks count endpoint. In Fig. 5.6 is illustrated the read throughput of this
endpoint compared to the endpoint of the original Tarkin. As we can quickly see, there is a considerable
improvement in the amount of transactions per second that the new system is capable of processing. In
average, our solution can process, approximately, 67 transactions per second while the original Tarkin
can only deal with 55 transactions per second.
Figure 5.6: Available tasks count transactions per second
Table 5.4 summarizes the throughput results for this test (the first test of the read layer latency
section).
Endpoints # Samples Throughput (sec)/available tasks count (NEW) 242790 67.44/available tasks count (OLD) 197169 54.77
Table 5.4: Available tasks count throughput summary
To finalize, since the remaining tests, the second one of the read layer latency section and the one
45
performed in the write layer latency section, were conducted with the endpoints being under high load
pressure from each other, we decided to group the results in Table 5.5.
Endpoints # Samples Throughput (sec)/available tasks count (NEW) 10412 2.89/available tasks count (OLD) 9716 2.70/task (NEW) 1956 0.54/task (OLD) 2000 0.55/task/search (NEW) 10406 2.89/task/search (OLD) 9714 2.70/task/submit (NEW) 14 0.01/task/submit (OLD) 13 0.00
Table 5.5: Throughput summary under high load pressure
Summary
In order to perform a proper comparison between our solution and the old version of Tarkin, we con-
ducted these experiments using the same environment. The chapter starts by detailing the evaluation
details preceded by describing its results. Summing up, we were able to improve the overall read re-
sponse times while maintaining the same times for the remaining requests.
The next chapter concludes the document by summarizing our work.
46
Chapter 6
Conclusions
Although there are different solutions to deal with the rapid increase on the amount of data that compa-
nies face nowadays, sometimes these solutions are sub optimal for emerging companies which lack on
resources.
In this document, we started by analyzing the current solutions in order to identify and elaborate a
solution to our problem. Due to the lack of work regarding the CQRS pattern, we decided to explore
this idea. We used the knowledge gained from our research, for instance, which storage systems are
better for certain types of requests as well as the idea of combining different storage systems, and built
a solution inspired in this software pattern.
The results of the evaluation demonstrate that our solution is capable of processing and evolving
along with the data growth while even averaging better response times. Mostly due to the ability to
differentiate requests and optimize its result by selecting the best storage system according to its char-
acteristics. Furthermore, with this architecture, we are able to offload the system by having different
storage systems processing requests instead of pressuring only one.
There is also the opportunity to understand how our system could help the infrastructure achieving
more stable numbers in terms of CPU and memory utilization, however, these are not tests that we were
able to do, as explained in the previous chapter.
6.1 Future Work
This thesis has focused in improving the performance of Tarkin regarding the throughput and latency of
the different requests. However, due to the architecture of our solution, one of its limitations, which was
not addressed in Chapter 5, is the possibility of delivering stale data to the user. Therefore, it would be
interesting to evaluate the percentage of requests that deliver inconsistent data as well as the impact of
configuring different intervals to the pull mechanism, of the replication layer, on this percentage.
Unfortunately, as a result of testing our solution locally, we were also not able to evaluate the poten-
tially benefit of segregating the load between the different layers in terms of CPU and memory utilization
inherent in each component, leaving it as future work.
47
In addition, taking into account the distributed architecture of this thesis, it would be worth exploring
the following ideas:
• Exploit the ability of scaling each component independently according to the system’s workload;
• Exploit the benefits of using different storage systems on both read and write layers;
48
Bibliography
[1] Mongodb architecture guide. Technical report. URL https://www.mongodb.com/collateral/
mongodb-architecture-guide. Accessed: 2018-01-05.
[2] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS
Operating Systems Review, 44(2):35–40, 2010.
[3] A technical overview of riak kv enterprise. Technical report. URL http://basho.com/wp-content/
uploads/2015/04/RiakKV-Enterprise-Technical-Overview-6page.pdf. Accessed: 2018-01-
05.
[4] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon. Racs: a case for cloud storage diversity. In
Proceedings of the 1st ACM symposium on Cloud computing, pages 229–240. ACM, 2010.
[5] B. Mao, S. Wu, and H. Jiang. Exploiting workload characteristics and service diversity to improve
the availability of cloud storage systems. IEEE Transactions on Parallel and Distributed Systems,
27(7):2010–2021, 2016.
[6] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. C. Veitch. Hippodrome: Running
circles around storage administration. In FAST, volume 2, pages 175–188, 2002.
[7] E. Anderson, M. Kallahalla, S. Spence, R. Swaminathan, and Q. Wang. Ergastulum: Quickly finding
near-optimal storage system designs. 2001.
[8] Cqrs. Technical report. URL https://martinfowler.com/bliki/CQRS.html. Accessed: 2018-09-
29.
[9] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with
bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM
European Conference on Computer Systems, pages 29–42. ACM, 2013.
[10] S. Lohr. Sampling: design and analysis. Thomson, 2009.
[11] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy.
Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment,
2(2):1626–1629, 2009.
49
[12] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing
and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In
Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654–663.
ACM, 1997.
[13] M. Paksula. Persisting objects in redis key-value database. University of Helsinki, Department of
Computer Science, 2010.
[14] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes,
and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on
Computer Systems (TOCS), 26(2):4, 2008.
[15] M. Burrows. The chubby lock service for loosely-coupled distributed systems. In Proceedings of
the 7th symposium on Operating systems design and implementation, pages 335–350. USENIX
Association, 2006.
[16] L. Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2):
133–169, 1998.
[17] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan.
Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions
on Networking (TON), 11(1):17–32, 2003.
[18] J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In
Proceedings of the NetDB, pages 1–7, 2011.
[19] Z. Wu, C. Yu, and H. V. Madhyastha. Costlo: Cost-effective redundancy for lower latency variance
on cloud storage services. In NSDI, pages 543–557, 2015.
[20] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013.
[21] A. Vulimiri, P. B. Godfrey, R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker. Low latency via
redundancy. In Proceedings of the ninth ACM conference on Emerging networking experiments
and technologies, pages 283–294. ACM, 2013.
[22] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-
scale systems. In USENIX annual technical conference, volume 8, page 9. Boston, MA, USA,
2010.
[23] Z. Wu, M. Butkiewicz, D. Perkins, E. Katz-Bassett, and H. V. Madhyastha. Spanstore: Cost-effective
geo-replicated storage spanning multiple cloud services. In Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, pages 292–308. ACM, 2013.
[24] The worst cloud outages of 2014. Technical report. URL https://www.cio.com/article/
2597775/cloud-computing/162288-The-worst-cloud-outages-of-2014-so-far.html. Ac-
cessed: 2018-01-05.
50
[25] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata.
ACM Transactions on Storage (TOS), 3(3):9, 2007.
[26] A. Traeger, E. Zadok, N. Joukov, and C. P. Wright. A nine year study of file system and storage
benchmarking. ACM Transactions on Storage (TOS), 4(2):5, 2008.
[27] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the deployment of computations
in the cloud with conductor. In NSDI, pages 367–381, 2012.
[28] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy, R. Golding, A. Merchant,
M. Spasojevic, A. Veitch, and J. Wilkes. Minerva: An automated resource provisioning tool for
large-scale storage systems. ACM Transactions on Computer Systems (TOCS), 19(4):483–518,
2001.
[29] C. Kenyon et al. Best-fit bin-packing with random order. In SODA, volume 96, pages 359–364,
1996.
[30] J. Wilkes. Traveling to rome: Qos specifications for automated storage system management. In
International Workshop on Quality of Service, pages 75–91. Springer, 2001.
[31] L. Richardson and S. Ruby. RESTful web services. ” O’Reilly Media, Inc.”, 2008.
[32] B. Momjian. PostgreSQL: introduction and concepts, volume 192. Addison-Wesley New York, 2001.
51
52
top related