databases and how to choose them

DATABASES AND HOW TO CHOOSE THEM

Databases and how to choose them - January 2017

Index

1

Databases types2

3

4

Use cases

Best and bad practices

Key concepts


Key concepts


ACID vs BASE

● ACID:● Atomicity. It contains the concept of transaction, as a group of tasks that must be performed against a database. If one element of a

transaction fails, the entire transaction fails.● Consistency. This is usually defined like the property that guarantees that a transaction brings the database from one valid state (in a

formal sense, not in a functional one) to another. In ACID, consistency just implies a compliance with the defined rules, like constraints, triggers, etc.

● Isolation. Each transaction must be independent by itself, meaning that it should not “see” the effects of other concurrent operations. ● Durability. This property ensures that once a transaction is complete, it will survive system failure, power loss and other types of system

breakdowns.

● BASE:● Basically Available. This property states that the system ensures the availability of the data in a way: there will be a response to any

request (it could be inconsistent data or even a error).

● Soft-state. Due to the way from eventual consistency to actually consistency, the state of the system could change over time, even while there is not an input operation over the database. Thus, the state of the system is called “soft”.

● Eventual consistency. After the system stops receiving input, when data have been propagated to every nodes, it will eventually become consistent.


CAP THEOREM

● CAP:● Consistency. C in CAP actually means “linearizability”, which is a very specific and strong notion of consistency

that has nothing to do with the C in ACID (it has more to do with Atomic and Isolation, indeed). A typical way to define it is like this: “if operation B started after operation A successfully completed, B must see the the system in the same state as it was on completion of operation A, or a newer state”. Thus, a system is consistent if an update is applied to all nodes at the same time.

name=Alice

name?

Alice

http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf


CAP THEOREM

● CAP:● Availability. A in CAP is defined as “every request received by a non-failing database node must result in a

non-error response”. This is both a strong and a weak requirement, since 100% of the requests must return a response, but the response can take an unbounded (but finite) amount of time. As people tend to care more about latency, a very slow response usually makes a system “not-available” for users.


CAP THEOREM

● CAP:● Partition Tolerance. P in CAP means… well, it is not clear. Some definitions of the concept state that the system

keeps on working even if some nodes, or the connection between two of them, fail. This kind of definition is what drives to apply the CAP theorem to monolithic, single-node relational databases (they qualify as CA). A multi-node system not requiring partition-tolerance would have to run on a network that never drops messages and whose nodes can’t fail. Since this kind of system does not exist, P in CAP can’t be excluded by decision.


CAP THEOREM


Isolation

● Isolation.

In database systems, isolation determines how transaction integrity is visible to other users and systems. Though it’s often

used in a relaxed way, this property of ACID in a DBMS (Database Management System) is an important part of any

transactional system. This property specifies when and how the changes implemented in an operation become visible to

other parallel operations.

Acquiring locks on data is the way to achieve a good isolation level, so the most locks taken in an executing transaction, the

higher isolation level. On the other hand, locks have an impact on performance.


Isolation

● Isolation levels.ISOLATION LEVELS

READ UNCOMMITED

READ COMMITED

REPEATABLE READS

SERIALIZABLE

CONCURRENCY PHENOMENA

DIRTY READS

UNREAPEATABLE READS

PHANTOM READS


Indexes

● Indexes

A database index is a data structure that improves the speed of searches on a database table, with the trade-off of slower write performance, due to additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table.

https://en.wikipedia.org/wiki/Data_structure


Indexes

● Inverted indexesAn inverted index is a data structure that maps content to its locations in a database file (in contrast to a Forward Index, which maps from documents to content). The purpose of an inverted index is to allow fast full text searches, at the cost of increased processing and intensive use of resources.

This document

can be stored in

ElasticSearch.

1

ElasticSearch is a

document

oriented database

2

https://en.wikipedia.org/wiki/Forward_Index


Sharding

● ShardingShards are partitions of data within a database. Since each partition is smaller than the whole database, a query using the shard key (the field that sets the partition) will avoid a full scan, so there will be a dramatic improvement in search performance. On the other side, sharding implies a strong dependency on the network, with higher latency when querying several shards, as well as consistency concerns when data is replicated among several shards (as it should be, for high-availability needs).It also introduces additional complexity in design (partition key must be carefully chosen) and development (load balancing, replication, failover, etc).


Database types

{data}


Database types

● Database types

As a first approach, we have the next kinds of databases:

● Relational● Key-value, column-oriented● Document-oriented● Graph

We deliberately exclude the popular key-value type because of the naive approach of its players for several production use cases and the overlapping of some features with some of the aforementioned.


Database types

Relational columnar storage.The concept of relational databases is wide-known and involves some of the topics already treated in this document, specially ACID. Recently, the schema-less need has been covered by RDBMS also, so their strengths are the consistency under heavy read and write needs and the popular knowledge in both design and query language.

Columnar storage can be seen as a transposition of the common row-storage, meaning that:

Columnar models are very useful for some use cases. A common example is selecting a unique field, or calculating an average. Instead of going through every row and accessing to the field age, a columnar model allows accessing exactly to the area where age is stored.

This kind of models are just relational (thus, ACID), and they are suitable for use cases with needs of very good read performance till certain limit in volume (say, under one Terabyte).

1, 2, 3; Alice, Bob, Charles; Adams, Brown, Cooper; 23, 42, 34


Database types

● Column-oriented databases.● A common misunderstanding is about columnar storage in relational databases and column oriented databases, such as Cassandra. ● Column oriented databases store data in column families as rows that have many columns associated with a row key. Column

families are groups of related data that are often accessed together. ● Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists

of multiple columns (and here it is where the key-value concept appears).

● This kind of databases are strongly dependent of design, since they are thought to be accessed by a key. Secondary indexes are allowed but they do not bring good enough performance for operational needs.

users

1 “Name”: “Alice” “Surname”: ”Adams” “age”: “23”

2 “Name”: “Bob” “Surname”: ”Brown” “age”: “42”

3 “Name”: “Charles” “Surname”: ”Cooper” “age”: “34”


Database types

● Document-oriented databases.● Just like it sounds, document-oriented databases store documents, typically in a JSON format. They are a certain kind of key-value

storages, with the nuance of having an internal structure which is used by their engines to query for the data.

● The way of viewing the data seems similar to the one in relational databases, except for the need of a schema and relational

constraints.

● The main difference between two worlds is in the ACID vs BASE distinction, which translate to horizontal scaling capabilities.

● Thus, these systems can offer good performance operating with several Terabytes.

● ElasticSearch is a rare example of document-oriented database. It is very suitable for Full Text Search and its capabilities (making

use of the aforementioned inverted indexes) allow to solve non-defined searches in operational-use-cases time.


Database types

● Graph databases.

● Graph databases use the mathematical concept of a graph to store data. Graphs consists of nodes, edges and properties, which are

used to query for the desired information.

● The main advantage of these systems is the high performance for certain use cases involving a lot of SQL-joins, since those cases

are about following nodes relations.

● Write performance (and read performance without joins) are under the ones offered by other systems, so this kind of databases are

quite polarized regarding the use case.


Database types

● High-level comparison.Relational (row-based)

Relational (columnar)

Document-oriented Key-value column-oriented

Graph

Basic description Data structured in rows

Data stored in columns

Data stored in (potentially) unstructured documents

Data structured as key-value maps

Data structured as nodes and edges (graphs) with relations

Strengths ACIDGood performanceLow complexity

ACIDGood read performance

ScalabilityGood read performance

ScalabilityGood write performance

ACIDGood read performance

Weaknesses Scalability ScalabilityCounter-intuitive

ConsistencyComplexity

Strong design dependency, use-case polarization

ScalabilityComplexityCounter-intuitive

Typical use cases

Online operational with ACID needs

Read-only without scaling-out

Heavy readings with high volume of records

Heavy writings with high volume of records and reads by key

SQL-Joins (relations)

Key players PostgreSQL PostgreSQL ElasticSearchMongoDB

Cassandra Neo4J


Database types

● Radar graph.


Use cases


Use cases

● CRUD over an entity● For typical CRUD operations (and, maybe, listing) over a certain entity, in a RESTful way, the very first option should be a

RDBMS. They provide:○ good write and read performance○ (typically) lots of features○ (typically) the advantage of the SQL modeling and language, which qualifies them for a straightforward usage.

● Note that CRUD over an entity usually implies accessing data by an unique key, which would be the entity id. Accessing one, or several (listing), entities by other fields, would need index creation.

● Both scenarios fit well in a RDBMS while the WHERE clause fields were known, but the possibility of scaling out has to be considered. If volume of data may grow too much, a document-oriented database could be the logical alternative.

● Particularizing, MongoDB covers essentially the same use cases than PostgreSQL, with the former being the chosen when volume is (or could be) high, and the last being the election when ACID capabilities are more important.


Use cases

● FTS or searching by any field● Performing searches by any field involves the creation of lots of indexes in the way PostgreSQL or MongoDB treat them.

● Instead of that, using ElasticSearch would be much more effective. The same logic applies for Free Text Search, with the

inverted indexes of Elastic being the solution.

● The intensive use of resources made by ElasticSearch prevents it to be used in other use cases, like the aforementioned CRUD

over an entity or much more concrete accesses (id or known fields).


Use cases

● High-volume loads● Cassandra is the system that provides better write performance and scalability. ● A typical use case could be a log system, if it is just accessed by date or by component name.● If there is a high volume of online writes, but access can not be done by a unique field, then we can choose among others

products, attending to the previous considerations. ● It is important to know that reindexing operations over the database has a big impact in performance. If it is not possible to

switch off the indexes while writing (like in a typical online operative), MongoDB and PostgreSQL could be worse options than ElasticSearch.

● On the other hand, in high writing and reading scenarios, consistency becomes relevant, so PostgreSQL may have the edge.


Use cases

● Relations● Fraud detection or a recommendation engine are typical cases in which a lot of SQL joins are needed, since they are all about

querying several entities of the same type by a variety of fields, and maybe with entities of a different type.

● In a graph, that’s about following a path among several nodes, so it is natively more efficient to use a graph database.

● Scalability or consistency could be concerns in those cases.


Use cases

● Analytics● Analytics use cases usually involve:

○ a huge volume of data

○ a much more relaxed time of processing

○ a much lower level of concurrence.

● For those cases, jobs accesing to a DFS can be enough.



● Best practices:

● Choose the right database for the each use case.

● A new “materialized view” is better than fight with problems. There is not a silver bullet.

● Avoid BLOBs

● Schemas are good: keep order and are intuitive.

● Mind the CAP



● Bad practices:● Over…

○ indexing

○ normalization

○ provisioning of resources

● Relational mindset

● Split brain

● Fashion victim


Questions


Thanks!

databases and how to choose them

Data & Analytics