analysis of the nosql databases: bigtable, neo4j...
TRANSCRIPT
Analysis of the NoSQL databases: BigTable, Neo4j, Voldermort
By: Thomas Tingey
P a g e | 2
Contents 1. Introduction .............................................................................................................................................. 3
2. BigTable ................................................................................................................................................ 3
2.1 Data Model ................................................................................................................................... 3
2.2 Physical Storage ............................................................................................................................ 4
2.3 Transactions .................................................................................................................................. 5
2.4 Scalability ...................................................................................................................................... 5
3 Neo4j ..................................................................................................................................................... 6
3.1 Data Model ................................................................................................................................... 6
3.2 Physical Storage ............................................................................................................................ 7
3.3 Transactions .................................................................................................................................. 7
3.4 Scalability ...................................................................................................................................... 8
4 Voldemort ............................................................................................................................................. 8
4.1 Data Model ................................................................................................................................... 8
4.2 Physical Storage ............................................................................................................................ 9
4.3 Transactions .................................................................................................................................. 9
4.4 Scalability .................................................................................................................................... 10
5 Comparisons ....................................................................................................................................... 10
6 Conclusion .......................................................................................................................................... 11
References .................................................................................................................................................. 12
P a g e | 3
1. Introduction
NoSQl has grown rapidly over the years as an alternative to the tried and tested
SQL databases. With the rise in popularity there has been a large number of
NoSQL databases that have been created to support different needs in the
industry. All of the popular NoSQL databases have specific pros and cons and
have specific use cases. Often the creation of a new NoSQL database is caused
by a need not met by any current database. NoSQL databases are often just
built on top of existing databases and just extend the functionality of the base
database. Three NoSQL databases that stand out are BigTable, Neo4j, and
Voldemort. These databases account for only a small percentage of all the
NoSQL databases out there but give an understanding on how different the
databases can be, and how each fulfills a different need. The three databases
do cover multiple types of database stores as well as covering every possible
combination of the CAP’s theorem.
2. BigTable
BigTable is an example of a database that was built for a specific need. Google
began development of BigTable in 2004. The purpose for it was a database that
could quickly return results via random access reads. There were databases
already on the market that handle similar amounts of data, but none of them
were fast and reliable enough for Google’s needs. Google also wanted their own
in house NoSQL database for financial reasons. As scaling their own database
requires no licensing contracts and allows them to scale without having to pay
some third party for the licenses. BigTable is used through a majority of
Google’s services such as: Web Indexing, MapReduce, Google Maps, Google
Book Search, My Search History, Google Earth, Blogger.com, Google Code
hosting, YouTube, and Gmail. Paul Newson, a developer advocate at Google,
stated that there is not a single byte of Google’s data that hasn’t at some point
gone through a BigTable database. As can be seen BigTable has become a
crucial element of Google’s services. In May 2015 Google announced that its
BigTable database would be released in the cloud. This gave access to
everyone who desired it the ability to use the power of Google’s BigTable
database with a few clicks of the mouse.
2.1 Data Model
The data in BigTable is stored in a modified map. Each of the rows in the
database have a row key that allows for indexing by a primary key. This allows
P a g e | 4
for quicker read of the data as a certain row can be pulled up via the primary key
without searching all of the data stored in the database. Another intriguing
aspect of BigTable is the rows are three dimensional. There is the obvious row
column architecture, but there is a third dimension of time. This allows for
versioning in a cell as changes to that cell will be stacked with the associative
timestamp. Therefore selecting data in BigTable follows the format of
(row:string, column:string, time:int64) -> return:string. This timestamp then
can have rules applied to it so that everything of a certain age is dumped keeping
the database from overflowing with useless data. The data of BigTable can be
described as sparse. There are lots of columns for every row and a majority of
those columns can be null. BigTable supports point operations such as read and
write, but also contains the power to do a full scan for map reduces that require
the entire data set. Write operations are a 2 step process. BigTable first writes
the operation to the propriety Google File System in a tablet log. The write
operation is then written to a memTable inside of the tablet's memory. During a
read operation the data is pulled from what Google calls a SSTable stored in
Google’s File System (GFS) in the tablet and then applies the write operations in
memTable to return the final value. The SSTables are immutable and cannot be
written to during normal write operations. When the memTable gets too large the
tablet will compact the data. This compaction writes the memTable write
operations to the SSTables and updates the data effectively emptying the
memTable and freeing up memory. The tablet log is kept as way to recover
changes in case of memory dumps. BigTable allows deletions as well but the
deletions actually do not remove data at the time of the deletions. Instead the
deletion is just a pin in the memTable. During a major compaction of the
SSTables the deletes are then applied and the deleted data is removed.
BigTable utilizes what Google calls Bloom filters. Bloom filters are a filter that
quickly figures out if a key exists. This allows a speed increase because the file
system doesn’t have to be touched if the key doesn’t exist. BigTable was written
using HBase. Because of this BigTable is accessed through HBase API allowing
for quicker adoption as those who already know HBase can seamlessly switch to
BigTable with very little effort.
2.2 Physical Storage
As stated in section 2.1 BigTable is a map with an index able row key. To
facilitate quick reads, the data is lexicographically sorted and then broken into
Tablets. These tablets are partitions of like keys and are stored separately on a
server node. The underlying data store is Google’s proprietary file system.
These tablets can be split up at any time if the tablet becomes too large. Once
P a g e | 5
the data is in the tablet the data is then split into SSTables as mentioned in
section 2.1. In order to facilitate this partitioning BigTable has a root tablet that
stores METADATA of where other METADATA Tablets are stored. The other
METADATA tablets then store the identifiers of where the searched key resides.
BigTable has also made some innovations in allowing users to set an affinity for
SSD or HDD storage based on the “heat” of the tablet. So data that will be
accessed often can be stored on SSD drives allowing for quicker R/W
Operations, whereas data that is accessed less can then be stored on cheaper
HDD. Finally, all tablets are in charge of their own read/write operations. The
tablets function as standalone entities allowing better load balancing. There is a
master tablet that assigns keys to certain tablets but once the data is on the
tablet that tablet takes over and handles that information. Because of the way
BigTable is designed a single cluster can handle hundreds of petabytes of data,
tens of millions of operations per second, and can stream hundreds of GB per
second of data.
2.3 Transactions
By the nature of the structure BigTable does not support atomic transactions
between rows, but BigTable does support atomic transactions on a single row. In
the cap theorem BigTable resides on the bottom of the table and provides
consistency and partition tolerance. Availability takes a hit to accomplish this.
BigTable is very partition tolerant and boasts a recovery rate of less than 1
second after failure. This is because BigTable supports replication and handles it
well. BigTable also stores the operations in a log stored on the file system so
memory dumps are recovered from quickly by writing the log back into the
memTable. If the tablet completely fails, then a redundant tablet simply takes its
place. This is why BigTable sacrifices availability because during replication
reads cannot happen until both copies are updated.
2.4 Scalability
One of the main reasons BigTable came to be was to produce a highly scalable
NoSQL database. Google has succeeded in this and with their server structure
the database can scale over hundreds of nodes providing a large scaled
database. BigTable is also designed to allow quick and efficient replication. This
allows BigTable to decrease latency by removing geographical distances by
replicating the data on a closer node.
P a g e | 6
3 Neo4j
Neo4j was implemented because the founder of the database was frustrated with
the limitations of the relational database model he was using. Neo4j is
considered the first graph database built. In 2002 the founder began working on
the first version of Neo4j. By 2003 24x7 development of Neo4j was in full swing.
It took 4 years of development and then the founder started a Swedish-based
company behind Neo4j and open sourced Neo4j. Finally, 8 years’ latter in 2010
Neo4j version 1.0 was released to the public. Now thousands of organizations
use Neo4j, most notably both EBay and Walmart relay on Neo4j for their NoSQL
needs.
3.1 Data Model
Neo4j is a graph database. It’s unique in the fact that it breaks away from the
standard row column structure and utilizes a graph architecture. The idea behind
the design is that data is represented by nodes and then the edges of the graph
define the relationship of those nodes. A silly example of this architecture from
Neo4j website:
As can be seen in the image above both the nodes themselves and the edges
contain properties. The properties are stored in key value pairs. The edges
represent relationships and must be directional. Non directional relationships will
throw errors and are not allowed in Neo4j. Neo4j uses Cypher Query Language
(CQL) for its read write operations. Neo4j has similar CRUD operations as SQL.
The conventions are slightly different as Neo4j for example calls SELECT
statements MATCH statements. That isn’t the purpose of Neo4j though and
trying to do simple “give me all nodes with age < 30” would be terribly slow in
Neo4j and this kind of data retrieval would be better done in a RDMS. On the
other hand, where Neo4j really shines is getting relationships of data. Trying to
P a g e | 7
figure out the path from (using the picture above) Morpheus to The Architect is
extremely efficient in Neo4j and is the true reason Neo4j was designed and
implemented. The graph functions as a linked list. Every node knows where
what it’s connected to and what is connected to it. This connected list makes
Neo4j extremely fast and unlike RDMS it’s queries are not affected by the size of
the database. This is because the query specifies the first node and then it’s just
a matter of pointer chasing from that node on.
3.2 Physical Storage
Neo4j stores its data using a variety of store files on each server node. There
are separate stores for nodes, relationships, labels, and properties. The main
reason for this breakup of data is so that traversals through the graph are
extremely efficient. Neo4j only has to parse the data it needs instead of trying to
parse all of the information every time it traverses the graph. Neo4j also utilizes
a fixed record size and pointers to help keep speeds up. Neo4j can be scaled
horizontally, but each server must contain the entire database. So clusters help
more with load balancing and redundancy then true distribution. This is
discussed in more detail in section 3.4.
3.3 Transactions
Neo4j is somewhat unique in the NoSQL world in the fact that it fully supports
transactions. The nature of these transactions work exactly like SQL
transactions. The book Graph Databases explains transactions as follows:
The transaction implementation in Neo4j is fairly straightforward. Each
transaction is represented as an in-memory object whose state represents writes
to the database. This object is supported by a lock manager, which applies write
locks to nodes and relationships as they are created, updated, and deleted. On a
transaction rollback, the transaction object is discarded and the write locks
released, otherwise on successful completion the transaction is committed to
disk. Committing data to disk in Neo4j uses a Write Ahead Log, whereby
changes are appended as actionable entries in the active transaction log. On
transaction commit a commit entry will be written to the log. This causes the log
to be flushed to disk, thereby making the changes durable. Once the disk flush
has occurred, the changes are applied to the graph itself. After all the changes
have been applied to the graph, any write locks associated with the transaction
are released.
P a g e | 8
3.4 Scalability
Neo4j does scale across multiple servers or in a cluster but does have a
limitation of not sharding. So in a multiple server architecture there are just
copies of the entire database on each server. The reason for this design choice
is because if nodes were placed on other servers then graph traversal could
cross the Ethernet which would be slow. Take a worst case scenario of a path
with N edges. Now put every other node on a different server. So the graph
would have to cross the Ethernet N-1 times. This would be extremely slow. So
for this reason Neo4j puts the entire graph on each server and just replicates the
data. In the cap theorem Neo4j favors availability and consistency over partition
tolerance.
4 Voldemort
Voldemort was originally built by LinkedIn and released in 2009. The main
reason for the construction of Voldemort was a need for large amounts of writes.
LinkedIn has a feature in their sight called “Who’s viewed my profile” and this
feature often writes data to the database more often than data is read from the
database. The databases they were using just didn’t cut it. They originally
wanted to use Google’s BigTable (discussed in section 2 of this paper), but
without the proprietary Google File System BigTable just didn’t make sense. So
they set out to create a table with low-latency, high-availability access for their
data set. Voldemort is built with Java and is in use today for most of LinkedIn’s
website.
4.1 Data Model
Voldemort is a key value store. Voldemort can also be described as a large
unordered map or distributed hash table. To ensure high performance Voldemort
only supports very simple key value operations. Now the Key Values supported
are very complex and support serializable objects, arrays, and a whole variety of
things in both key and value, but the actual operations allowed are get data, put
data and delete data. Voldemort in no way supports any form of joins or any
complex queries. This compromise allows Voldemort to be blazing fast, and
focus more on writing data instead of reading it.
P a g e | 9
4.2 Physical Storage
Voldemort uses a technique known as consistent hashing. This has multiple
advantages for speed and redundancy. The first main advantage is that the
nodes in the cluster form a sort of ad-hoc network that doesn’t rely on any master
relay node to direct traffic. Any request can come to any node and then be
directed from there. This is accomplished by forming what is known as a hash
ring (pictured left).
.
The hash ring works by giving each server an arbitrary hash
that then is used to compute where a given key will be
written/read. The circle above shows a hash ring for servers A,
B, C, D. The request starts at the entrance point and then
works in a clockwise direction looking for the correct hash for
the read/write operation. How this works is we take key K and hash it. The
resulting hash is computed by the equation a = K mod S where S stands for the
number of partitions in the network. So for instance in the above picture if the
result of K mod S was 3 then the data would be stored on every 3rd server in the
node. The servers are then placed in random pattern in the hash ring to help
distribute the data without overburdening any one node in the network. If the a is
not found due to database failures, then the first server is used once the request
comes full circle. In this way the data can be distributed across a large number
of servers without relying on a routing node. As far as file systems Voldemort is
pretty customizable and allows for several different file systems to be utilized
based on the needs of the system.
4.3 Transactions
Nope. That’s really all that needs to be said about transactions. Voldemort isn’t
designed for transactions. The queries are very simple. Voldemort falls in the
AP category of the CAP theorem. It does have eventual consistency, but it really
was designed to be a highly available and scalable database and as such
sacrifices consistency in a very big way. LinkedIn didn’t need the data to be
consistent, they needed it to be fast. Consistency is configurable inside of the
set up Voldemort and can be configured to be more consistent. This consistency
comes at a cost to availability and the more consistent Voldemort is required to
be the greater the loss of availability.
P a g e | 10
4.4 Scalability
Unlike transactions scalability is what Voldemort was designed for. As discussed
in 4.2 the hash ring allows Voldemort to be very scalable and supports large
node clusters for large amounts of data. Also, thanks to the hash ring Voldemort
handles failures in the background. Unless there is some catastrophic damage
done to the cluster, failures will go unnoticed and are handled silently. The hash
ring is designed so the data is always stored in multiple places.
5 Comparisons
The biggest mistake any aspiring database administrator can make is assuming
that one size fits all for databases. It’s easy to get wrapped up with what
Amazon or Facebook is doing and think that because Facebook uses it it’s the
best possible options. The three databases reviewed span across the CAP
theorem. BigTable is an excellent database designed by Google. And now that
they are offering it in the cloud it easily could be the easiest to set up database
out of the three mentioned. BigTable offers low latency and reliable partitioning,
but what is gained in consistency and partitioning is loss in availability. BigTable
does have some ways around adding availability to their database but it
sacrifices partitioning or consistency to accomplish it. BigTable is a great
database for analytical things and crunching through data to come up with
conclusions. With tweaks in availability it can even be used in live data streams,
but with some possible slower load times. Neo4j is an excellent graph database.
It is unique in this list because of the focus on node relationships over anything
else. Also, as another unique factor it is a fully transactional database which is a
little rare in the NoSQL world. There simply is no better way to model a graph
other than a graph database. With it being in the CA sphere of the CAP it is an
excellent database for live applications. The high availability and the quick
traversal allows for finding relationships at unparalleled speeds. It does however
suffer from a lack of partitioning tolerance. As mentioned in section 3.4 the
database cannot be split into shards. As such all data has to live on each node
of the database. This can result in costlier servers. Neo4j is also pretty limited
when it comes to retrieving data other than relationships. Neo4j does
complicated queries but computes these queries at a much slower rate than the
simplest of SQL based databases could do. Neo4j is an excellent database for
recommending things like friends based on connections. Hence the reason it is
used in some major fortune 500 companies. Finally, there is Voldemort. From
the looks of Voldemort, it was designed for a very specific use case. It falls in the
AP sphere of the CAP theorem. The scalability and partition tolerance of
P a g e | 11
Voldemort is impressive. The restrictive rules on data retrieval and storage also
allows Voldemort to retain speed and force joins and complex data manipulation
to be done in code. Voldemort excels at server failures. With the hash ring set
up Voldemort failures are unnoticed by the end user. Voldemort recovers without
pause when nodes in the hash ring are knocked offline. Voldemort’s strengths
are also its weakness. The strong emphasis on availability sacrifices
consistency. This database works well though for systems that don’t need
consistent data, but need a database that handles more writes than reads. And
is really focused on writing data to the table.
6 Conclusion
Each database mentioned was formed out of necessity. No database on the
market did exactly what the company/individual needed and so they created their
own to fit their specific needs. This is a model to follow whenever a database is
needed for a project. There simply is not a “one size fits all” database. Before a
database is chosen the strengths and limitations of that database should be
studied and understood. This allows for the selection of the best possible
database.
P a g e | 12
References
"Example BigTable." YouTube. YouTube. Web. 21 Oct. 2015.
"Project Voldemort: Scaling Simple Storage At LinkedIn." Official LinkedIn Blog Project
Voldemort Scaling Simple Storage AtLinkedIn Comments. Web. 21 Oct. 2015.
Robinson, Ian, and James Webber. Graph Databases. Sebastopol, Calif.: O'Reilly Media,
2013. Print.
"Scalability Meetup @ Whitepages - Google Cloud BigTable." YouTube. YouTube. Web. 21
Oct. 2015.
"Voldemort." Voldemort. Web. 21 Oct. 2015.