analysis of the nosql databases: bigtable, neo4j...

Analysis of the NoSQL databases: BigTable, Neo4j, Voldermort

By: Thomas Tingey

P a g e | 2

Contents 1. Introduction .............................................................................................................................................. 3

2. BigTable ................................................................................................................................................ 3

2.1 Data Model ................................................................................................................................... 3

2.2 Physical Storage ............................................................................................................................ 4

2.3 Transactions .................................................................................................................................. 5

2.4 Scalability ...................................................................................................................................... 5

3 Neo4j ..................................................................................................................................................... 6

3.1 Data Model ................................................................................................................................... 6


3.3 Transactions .................................................................................................................................. 7

3.4 Scalability ...................................................................................................................................... 8

4 Voldemort ............................................................................................................................................. 8

4.1 Data Model ................................................................................................................................... 8


4.3 Transactions .................................................................................................................................. 9

4.4 Scalability .................................................................................................................................... 10

5 Comparisons ....................................................................................................................................... 10

6 Conclusion .......................................................................................................................................... 11

References .................................................................................................................................................. 12

P a g e | 3

1. Introduction

NoSQl has grown rapidly over the years as an alternative to the tried and tested

SQL databases. With the rise in popularity there has been a large number of

NoSQL databases that have been created to support different needs in the

industry. All of the popular NoSQL databases have specific pros and cons and

have specific use cases. Often the creation of a new NoSQL database is caused

by a need not met by any current database. NoSQL databases are often just

built on top of existing databases and just extend the functionality of the base

database. Three NoSQL databases that stand out are BigTable, Neo4j, and

Voldemort. These databases account for only a small percentage of all the

NoSQL databases out there but give an understanding on how different the

databases can be, and how each fulfills a different need. The three databases

do cover multiple types of database stores as well as covering every possible

combination of the CAP’s theorem.

2. BigTable

BigTable is an example of a database that was built for a specific need. Google

began development of BigTable in 2004. The purpose for it was a database that

could quickly return results via random access reads. There were databases

already on the market that handle similar amounts of data, but none of them

were fast and reliable enough for Google’s needs. Google also wanted their own

in house NoSQL database for financial reasons. As scaling their own database

requires no licensing contracts and allows them to scale without having to pay

some third party for the licenses. BigTable is used through a majority of

Google’s services such as: Web Indexing, MapReduce, Google Maps, Google

Book Search, My Search History, Google Earth, Blogger.com, Google Code

hosting, YouTube, and Gmail. Paul Newson, a developer advocate at Google,

stated that there is not a single byte of Google’s data that hasn’t at some point

gone through a BigTable database. As can be seen BigTable has become a

crucial element of Google’s services. In May 2015 Google announced that its

BigTable database would be released in the cloud. This gave access to

everyone who desired it the ability to use the power of Google’s BigTable

database with a few clicks of the mouse.

2.1 Data Model

The data in BigTable is stored in a modified map. Each of the rows in the

database have a row key that allows for indexing by a primary key. This allows

P a g e | 4

for quicker read of the data as a certain row can be pulled up via the primary key

without searching all of the data stored in the database. Another intriguing

aspect of BigTable is the rows are three dimensional. There is the obvious row

column architecture, but there is a third dimension of time. This allows for

versioning in a cell as changes to that cell will be stacked with the associative

timestamp. Therefore selecting data in BigTable follows the format of

(row:string, column:string, time:int64) -> return:string. This timestamp then

can have rules applied to it so that everything of a certain age is dumped keeping

the database from overflowing with useless data. The data of BigTable can be

described as sparse. There are lots of columns for every row and a majority of

those columns can be null. BigTable supports point operations such as read and

write, but also contains the power to do a full scan for map reduces that require

the entire data set. Write operations are a 2 step process. BigTable first writes

the operation to the propriety Google File System in a tablet log. The write

operation is then written to a memTable inside of the tablet's memory. During a

read operation the data is pulled from what Google calls a SSTable stored in

Google’s File System (GFS) in the tablet and then applies the write operations in

memTable to return the final value. The SSTables are immutable and cannot be

written to during normal write operations. When the memTable gets too large the

tablet will compact the data. This compaction writes the memTable write

operations to the SSTables and updates the data effectively emptying the

memTable and freeing up memory. The tablet log is kept as way to recover

changes in case of memory dumps. BigTable allows deletions as well but the

deletions actually do not remove data at the time of the deletions. Instead the

deletion is just a pin in the memTable. During a major compaction of the

SSTables the deletes are then applied and the deleted data is removed.

BigTable utilizes what Google calls Bloom filters. Bloom filters are a filter that

quickly figures out if a key exists. This allows a speed increase because the file

system doesn’t have to be touched if the key doesn’t exist. BigTable was written

using HBase. Because of this BigTable is accessed through HBase API allowing

for quicker adoption as those who already know HBase can seamlessly switch to

BigTable with very little effort.

2.2 Physical Storage

As stated in section 2.1 BigTable is a map with an index able row key. To

facilitate quick reads, the data is lexicographically sorted and then broken into

Tablets. These tablets are partitions of like keys and are stored separately on a

server node. The underlying data store is Google’s proprietary file system.

These tablets can be split up at any time if the tablet becomes too large. Once

P a g e | 5

the data is in the tablet the data is then split into SSTables as mentioned in

section 2.1. In order to facilitate this partitioning BigTable has a root tablet that

stores METADATA of where other METADATA Tablets are stored. The other

METADATA tablets then store the identifiers of where the searched key resides.

BigTable has also made some innovations in allowing users to set an affinity for

SSD or HDD storage based on the “heat” of the tablet. So data that will be

accessed often can be stored on SSD drives allowing for quicker R/W

Operations, whereas data that is accessed less can then be stored on cheaper

HDD. Finally, all tablets are in charge of their own read/write operations. The

tablets function as standalone entities allowing better load balancing. There is a

master tablet that assigns keys to certain tablets but once the data is on the

tablet that tablet takes over and handles that information. Because of the way

BigTable is designed a single cluster can handle hundreds of petabytes of data,

tens of millions of operations per second, and can stream hundreds of GB per

second of data.

2.3 Transactions

By the nature of the structure BigTable does not support atomic transactions

between rows, but BigTable does support atomic transactions on a single row. In

the cap theorem BigTable resides on the bottom of the table and provides

consistency and partition tolerance. Availability takes a hit to accomplish this.

BigTable is very partition tolerant and boasts a recovery rate of less than 1

second after failure. This is because BigTable supports replication and handles it

well. BigTable also stores the operations in a log stored on the file system so

memory dumps are recovered from quickly by writing the log back into the

memTable. If the tablet completely fails, then a redundant tablet simply takes its

place. This is why BigTable sacrifices availability because during replication

reads cannot happen until both copies are updated.

2.4 Scalability

One of the main reasons BigTable came to be was to produce a highly scalable

NoSQL database. Google has succeeded in this and with their server structure

the database can scale over hundreds of nodes providing a large scaled

database. BigTable is also designed to allow quick and efficient replication. This

allows BigTable to decrease latency by removing geographical distances by

replicating the data on a closer node.

P a g e | 6

3 Neo4j

Neo4j was implemented because the founder of the database was frustrated with

the limitations of the relational database model he was using. Neo4j is

considered the first graph database built. In 2002 the founder began working on

the first version of Neo4j. By 2003 24x7 development of Neo4j was in full swing.

It took 4 years of development and then the founder started a Swedish-based

company behind Neo4j and open sourced Neo4j. Finally, 8 years’ latter in 2010

Neo4j version 1.0 was released to the public. Now thousands of organizations

use Neo4j, most notably both EBay and Walmart relay on Neo4j for their NoSQL

needs.

3.1 Data Model

Neo4j is a graph database. It’s unique in the fact that it breaks away from the

standard row column structure and utilizes a graph architecture. The idea behind

the design is that data is represented by nodes and then the edges of the graph

define the relationship of those nodes. A silly example of this architecture from

Neo4j website:

As can be seen in the image above both the nodes themselves and the edges

contain properties. The properties are stored in key value pairs. The edges

represent relationships and must be directional. Non directional relationships will

throw errors and are not allowed in Neo4j. Neo4j uses Cypher Query Language

(CQL) for its read write operations. Neo4j has similar CRUD operations as SQL.

The conventions are slightly different as Neo4j for example calls SELECT

statements MATCH statements. That isn’t the purpose of Neo4j though and

trying to do simple “give me all nodes with age < 30” would be terribly slow in

Neo4j and this kind of data retrieval would be better done in a RDMS. On the

other hand, where Neo4j really shines is getting relationships of data. Trying to

P a g e | 7

figure out the path from (using the picture above) Morpheus to The Architect is

extremely efficient in Neo4j and is the true reason Neo4j was designed and

implemented. The graph functions as a linked list. Every node knows where

what it’s connected to and what is connected to it. This connected list makes

Neo4j extremely fast and unlike RDMS it’s queries are not affected by the size of

the database. This is because the query specifies the first node and then it’s just

a matter of pointer chasing from that node on.


Neo4j stores its data using a variety of store files on each server node. There

are separate stores for nodes, relationships, labels, and properties. The main

reason for this breakup of data is so that traversals through the graph are

extremely efficient. Neo4j only has to parse the data it needs instead of trying to

parse all of the information every time it traverses the graph. Neo4j also utilizes

a fixed record size and pointers to help keep speeds up. Neo4j can be scaled

horizontally, but each server must contain the entire database. So clusters help

more with load balancing and redundancy then true distribution. This is

discussed in more detail in section 3.4.

3.3 Transactions

Neo4j is somewhat unique in the NoSQL world in the fact that it fully supports

transactions. The nature of these transactions work exactly like SQL

transactions. The book Graph Databases explains transactions as follows:

The transaction implementation in Neo4j is fairly straightforward. Each

transaction is represented as an in-memory object whose state represents writes

to the database. This object is supported by a lock manager, which applies write

locks to nodes and relationships as they are created, updated, and deleted. On a

transaction rollback, the transaction object is discarded and the write locks

released, otherwise on successful completion the transaction is committed to

disk. Committing data to disk in Neo4j uses a Write Ahead Log, whereby

changes are appended as actionable entries in the active transaction log. On

transaction commit a commit entry will be written to the log. This causes the log

to be flushed to disk, thereby making the changes durable. Once the disk flush

has occurred, the changes are applied to the graph itself. After all the changes

have been applied to the graph, any write locks associated with the transaction

are released.

P a g e | 8

3.4 Scalability

Neo4j does scale across multiple servers or in a cluster but does have a

limitation of not sharding. So in a multiple server architecture there are just

copies of the entire database on each server. The reason for this design choice

is because if nodes were placed on other servers then graph traversal could

cross the Ethernet which would be slow. Take a worst case scenario of a path

with N edges. Now put every other node on a different server. So the graph

would have to cross the Ethernet N-1 times. This would be extremely slow. So

for this reason Neo4j puts the entire graph on each server and just replicates the

data. In the cap theorem Neo4j favors availability and consistency over partition

tolerance.

4 Voldemort

Voldemort was originally built by LinkedIn and released in 2009. The main

reason for the construction of Voldemort was a need for large amounts of writes.

LinkedIn has a feature in their sight called “Who’s viewed my profile” and this

feature often writes data to the database more often than data is read from the

database. The databases they were using just didn’t cut it. They originally

wanted to use Google’s BigTable (discussed in section 2 of this paper), but

without the proprietary Google File System BigTable just didn’t make sense. So

they set out to create a table with low-latency, high-availability access for their

data set. Voldemort is built with Java and is in use today for most of LinkedIn’s

website.

4.1 Data Model

Voldemort is a key value store. Voldemort can also be described as a large

unordered map or distributed hash table. To ensure high performance Voldemort

only supports very simple key value operations. Now the Key Values supported

are very complex and support serializable objects, arrays, and a whole variety of

things in both key and value, but the actual operations allowed are get data, put

data and delete data. Voldemort in no way supports any form of joins or any

complex queries. This compromise allows Voldemort to be blazing fast, and

focus more on writing data instead of reading it.

P a g e | 9


Voldemort uses a technique known as consistent hashing. This has multiple

advantages for speed and redundancy. The first main advantage is that the

nodes in the cluster form a sort of ad-hoc network that doesn’t rely on any master

relay node to direct traffic. Any request can come to any node and then be

directed from there. This is accomplished by forming what is known as a hash

ring (pictured left).

.

The hash ring works by giving each server an arbitrary hash

that then is used to compute where a given key will be

written/read. The circle above shows a hash ring for servers A,

B, C, D. The request starts at the entrance point and then

works in a clockwise direction looking for the correct hash for

the read/write operation. How this works is we take key K and hash it. The

resulting hash is computed by the equation a = K mod S where S stands for the

number of partitions in the network. So for instance in the above picture if the

result of K mod S was 3 then the data would be stored on every 3rd server in the

node. The servers are then placed in random pattern in the hash ring to help

distribute the data without overburdening any one node in the network. If the a is

not found due to database failures, then the first server is used once the request

comes full circle. In this way the data can be distributed across a large number

of servers without relying on a routing node. As far as file systems Voldemort is

pretty customizable and allows for several different file systems to be utilized

based on the needs of the system.

4.3 Transactions

Nope. That’s really all that needs to be said about transactions. Voldemort isn’t

designed for transactions. The queries are very simple. Voldemort falls in the

AP category of the CAP theorem. It does have eventual consistency, but it really

was designed to be a highly available and scalable database and as such

sacrifices consistency in a very big way. LinkedIn didn’t need the data to be

consistent, they needed it to be fast. Consistency is configurable inside of the

set up Voldemort and can be configured to be more consistent. This consistency

comes at a cost to availability and the more consistent Voldemort is required to

be the greater the loss of availability.

P a g e | 10

4.4 Scalability

Unlike transactions scalability is what Voldemort was designed for. As discussed

in 4.2 the hash ring allows Voldemort to be very scalable and supports large

node clusters for large amounts of data. Also, thanks to the hash ring Voldemort

handles failures in the background. Unless there is some catastrophic damage

done to the cluster, failures will go unnoticed and are handled silently. The hash

ring is designed so the data is always stored in multiple places.

5 Comparisons

The biggest mistake any aspiring database administrator can make is assuming

that one size fits all for databases. It’s easy to get wrapped up with what

Amazon or Facebook is doing and think that because Facebook uses it it’s the

best possible options. The three databases reviewed span across the CAP

theorem. BigTable is an excellent database designed by Google. And now that

they are offering it in the cloud it easily could be the easiest to set up database

out of the three mentioned. BigTable offers low latency and reliable partitioning,

but what is gained in consistency and partitioning is loss in availability. BigTable

does have some ways around adding availability to their database but it

sacrifices partitioning or consistency to accomplish it. BigTable is a great

database for analytical things and crunching through data to come up with

conclusions. With tweaks in availability it can even be used in live data streams,

but with some possible slower load times. Neo4j is an excellent graph database.

It is unique in this list because of the focus on node relationships over anything

else. Also, as another unique factor it is a fully transactional database which is a

little rare in the NoSQL world. There simply is no better way to model a graph

other than a graph database. With it being in the CA sphere of the CAP it is an

excellent database for live applications. The high availability and the quick

traversal allows for finding relationships at unparalleled speeds. It does however

suffer from a lack of partitioning tolerance. As mentioned in section 3.4 the

database cannot be split into shards. As such all data has to live on each node

of the database. This can result in costlier servers. Neo4j is also pretty limited

when it comes to retrieving data other than relationships. Neo4j does

complicated queries but computes these queries at a much slower rate than the

simplest of SQL based databases could do. Neo4j is an excellent database for

recommending things like friends based on connections. Hence the reason it is

used in some major fortune 500 companies. Finally, there is Voldemort. From

the looks of Voldemort, it was designed for a very specific use case. It falls in the

AP sphere of the CAP theorem. The scalability and partition tolerance of

P a g e | 11

Voldemort is impressive. The restrictive rules on data retrieval and storage also

allows Voldemort to retain speed and force joins and complex data manipulation

to be done in code. Voldemort excels at server failures. With the hash ring set

up Voldemort failures are unnoticed by the end user. Voldemort recovers without

pause when nodes in the hash ring are knocked offline. Voldemort’s strengths

are also its weakness. The strong emphasis on availability sacrifices

consistency. This database works well though for systems that don’t need

consistent data, but need a database that handles more writes than reads. And

is really focused on writing data to the table.

6 Conclusion

Each database mentioned was formed out of necessity. No database on the

market did exactly what the company/individual needed and so they created their

own to fit their specific needs. This is a model to follow whenever a database is

needed for a project. There simply is not a “one size fits all” database. Before a

database is chosen the strengths and limitations of that database should be

studied and understood. This allows for the selection of the best possible

database.

P a g e | 12

References

"Example BigTable." YouTube. YouTube. Web. 21 Oct. 2015.

"Project Voldemort: Scaling Simple Storage At LinkedIn." Official LinkedIn Blog Project

Voldemort Scaling Simple Storage AtLinkedIn Comments. Web. 21 Oct. 2015.

Robinson, Ian, and James Webber. Graph Databases. Sebastopol, Calif.: O'Reilly Media,

2013. Print.

"Scalability Meetup @ Whitepages - Google Cloud BigTable." YouTube. YouTube. Web. 21

Oct. 2015.

"Voldemort." Voldemort. Web. 21 Oct. 2015.

analysis of the nosql databases: bigtable, neo4j...

Documents