cs-580k/480k advanced topics in cloud computing › ~huilu › slides580ksp20 › nosql.pdf · x is...

CS-580K/480K Advanced Topics in Cloud Computing

Chapter V

15. NoSQL Database

1

Where are we?Cloud Applications

New databases technologies (e.g., Key-value store, and Object store)

New Programming Models (e.g., severless, microservices, Hadoop, and Spark)…

Cloud Platforms

Storage Virtualization

Virtualization Layer

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3

Network Virtualization


Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3


Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

Operating System

APP1

APP2

APP3

APP4

VM1 VM2 VM3

A simple example – web-based applications

3

Storage

SQL Databases

▪ A relational database is a collection of data items organized as a set of formally-described tables – tables and their relations.

▪ Structured Query Language is the standard means of manipulating and querying data in relational databases.

Tables and their relations

Scalability Problem

4

Storage

Stateful SQL Databases

▪As data volume keeps increasing, it is cheaper and more reasonable to scale out horizontally, by adding smaller, less expensive servers rather than investing in a single large server.

Scale horizontally

Adding a cache layer

5

Front-end1

Middle-Tier

Storage

Front-end2

Middle-Tier

Front-end3

Middle-TierStateless

Stateless

Stateful SQL Databases

Cache Cache CacheStateless

As datasets grows, the simple memcache model starts to become problematic.

Scaling RDBMS – Master/Slave

• Master-Slave▪ All writes are written to the master.

▪ All reads performed against the replicated slave databases

▪ Critical reads may be incorrect as writes may not have been propagated down

▪ Large data sets can pose problems as master needs to duplicate data to slaves

6

Master

Slave Slave Slave

Scaling RDBMS - Partitioning

▪ “Sharding”▪ Divide data amongst many cheap databases

▪ Manage parallel access in the application

▪ Scales well for both reads and writes

▪ Not transparent, application needs to be partition-aware

7

Other ways to scale RDBMS

▪ Multi-Master replication

▪ INSERT only, not UPDATES/DELETES

▪ In-memory databases

8

In Summary

▪ Traditional relational databases do not scale well, when dataset grows

▪ Adding a cache layer scales well for reads, but not for writes

▪ Master/slave also scales well for reads, but not for writes

▪ Sharding scales well for both reads and writes, but not transparent

9

What is NoSQL

• Stands for No-SQL or Not Only SQL??• Class of non-relational data storage systems

• E.g. BigTable, and Dynamo

• Do not require a fixed table schema nor do they use the concept of joins (no relationships)• Distributed data storage systems

• All NoSQL offerings relax one or more of the ACID properties (i.e., CAP theorem)

10

Why NoSQL?

▪ Relational databases offer a very good general purpose solution to many different data storage needs.

▪ But it cannot fit all

▪ Just as there are different programming languages, need to have other data storage tools in the toolbox

11

Why NoSQL?

▪ Explosion of social media sites (Facebook, Twitter) with large data needs

▪ Explosion of storage needs in large web sites such as Google, Yahoo

▪ Much of the data is not files

▪ Shift to dynamically-typed data with frequent schema changes

12

Dynamo and BigTable

▪Three major papers were the seeds of the NoSQL movement▪BigTable (Google)▪Dynamo (Amazon)▪CAP Theorem (counterpart of ACID)

13

CAP Theorem (I)

• Three properties of a distributed system• Consistency

• All copies have same value

• Availability• Every request received by a non-failing node the system

must result in a response

• Partition tolerance • Network can break into two or more independent parts (due

to separate optimization and/or failures)• Partition tolerance means that even after the network is

partitioned into multiple sub-systems, it still works correctly.14

CAP Theorem (II)

• Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system

• Proof:https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

• Very large systems will partition at some point• Choose one of consistency or availability• Traditional database choose consistency• Most Web applications choose availability

• Except for specific parts such as order processing

15

https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

Availability

▪ For a large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.

▪Availability refers to the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.▪Percentage of availability = (total elapsed time – sum of

downtime)/total elapsed time

▪Traditionally, thought of as the server/process available five 9’s (99.999 %).▪The yearly service downtime could be as much as 5.256

minutes.16

Availability

17

Consistency Model

▪ A consistency model determines rules for visibility and order of updates.

▪ For example:▪ X is replicated on nodes M and N

▪ Client A writes X to node N

▪ Some period of time t elapses.

▪ Client B reads X from node M

▪ Does client B see the write from client A?

▪ For NoSQL, the answer would be:

▪ Yes, if the NoSQL adopts a strict consistency model

18

M N

Client AClient B

X X

Strict Consistency

• All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to.

• It implies nodes employ some kind of distributed transaction protocol to ensure all data copies have the same value

• CAP Theorem: Strict Consistency can’t be achieved at the same time as availability and partition-tolerance.

19

Eventual Consistency

▪ When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

▪ The type of large systems built based on CAP are known as BASE (Basically Available, Soft state, Eventual consistency)

▪ Who builds large-scale distributed systems based on CAP?▪ Google, Yahoo, Facebook, Amazon, eBay, etc…

20

Types of NoSQL (1)

▪Key/Value data model▪Use of a hash table▪Store data as a key/value pair ▪Access data (values) by strings

called keys ▪Value has no require format▪Basics operations include

insert(key, value), fetch(key),update(key), delete(key)

▪E.g., Amazon S3 (Dynamo) DeCandia, Giuseppe, et al. "Dynamo: amazon's highly

available key-value store." ACM SIGOPS operating

systems review 41.6 (2007): 205-220.

Key/Value

▪Pros:▪ Simple model

▪ Very fast

▪ Very scalable

▪ Able to scale horizontally

▪Cons: ▪ Many data structures (objects) can't be easily modeled as key

value pairs

Types of NoSQL (2)

▪Other schema-less data models which come in multiple flavors such as column-based, document-based or graph-based.▪Cassandra (column-based)▪MongoDB (document-based)▪Neo4J (graph-based)▪Redis (key/value-based)

23

Document-based

▪Can model more complex objects

▪Data model: collection of documents

▪Document: JSON ▪ JavaScript Object Notation is a data model, which supports

objects, records, structs, list, array, maps, dates, with nesting.

24

Column family data model

▪ Column based databases use a concept called a keyspace.

▪ The keyspace contains all the column families

▪ A column family consists of multiple rows.

▪ Each row can contain a different number of columns.

▪ Each column contains a name/value pair, along with a timestamp.

25

Graph data model▪ Based on Graph Theory: A graph is

composed of two elements: a node and aedge (i.e., the relationship).

▪ Each node represents an entity (e.g., a person, place, thing, category or other piece of data), and each relationship represents how two nodes are associated.▪ Twitter is a perfect example of a graph database

connecting 330 million monthly active users.

▪ Graph databases, by design, allow simple and fast retrieval of complex hierarchical structures that are difficult to model in relational systems.

26

https://neo4j.com/why-graph-databases/?ref=blog

https://www.omnicoreagency.com/twitter-statistics/

Typical NoSQL API

▪ Basic API access:

▪ get(key) -- Extract the value given a key

▪ put(key, value) -- Create or update the value given its key

▪ delete(key) -- Remove the key and its associated value

▪ execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).

27

Advantages of NoSQL Systems

▪ Easy to use

▪ Easy to scale horizontally on commodity hardware

▪ Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned▪ No single point of failure

▪ Breadth of functionality

▪ Executing code next to the data (e.g., Hadoop)

▪ …

28

What does NoSQL not Provide?

▪ Joins

▪ Group by

▪ ACID transactions

▪ SQL

▪ Integration with applications that are based on SQL

29

Which One to use?

▪ NoSQL Data storage systems make sense for applications that need to deal with very very large semi-structured data

▪ Log Analysis

▪ Social Networking Feeds

▪ Most of our work is on organizational databases, which are not that large and have low update/query rates

▪ regular relational databases are the correct solution for such applications

30

In Summary -- NoSQL

• “Not Only SQL”

• 1. Scale horizontally “simple operations”

• 2. Replicate/distribute data over many servers

• 3. Simple call level interface (contrast w/ SQL)

• 4. Weaker concurrency/consistency model than ACID

• 5. Flexible schema (i.e., schema-less)

31

Demos

• Google Big Tables: https://www.youtube.com/watch?v=ChoXxlddGis

32

https://www.youtube.com/watch?v=ChoXxlddGis

Sources

• Reliability vs Availability: What’s the Difference? https://www.bmc.com/blogs/reliability-vs-availability/

• An Illustrated Proof of the CAP Theorem: https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

• What is a Column Store Database: https://database.guide/what-is-a-column-store-database/

• 10 Advantages of NoSQL over RDBMS: https://www.dummies.com/programming/big-data/10-advantages-of-nosql-over-rdbms/

33

https://www.bmc.com/blogs/reliability-vs-availability/

https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

https://database.guide/what-is-a-column-store-database/

https://database.guide/what-is-a-column-store-database/

https://www.dummies.com/programming/big-data/10-advantages-of-nosql-over-rdbms/

cs-580k/480k advanced topics in cloud computing › ~huilu › slides580ksp20 › nosql.pdf · x is...

Documents