cs-580k/480k advanced topics in cloud computing › ~huilu › slides580ksp20 › nosql.pdf · x is...
TRANSCRIPT
CS-580K/480K Advanced Topics in Cloud Computing
Chapter V
15. NoSQL Database
1
Where are we?Cloud Applications
New databases technologies (e.g., Key-value store, and Object store)
New Programming Models (e.g., severless, microservices, Hadoop, and Spark)…
Cloud Platforms
Storage Virtualization
Virtualization Layer
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
VM1 VM2 VM3
Network Virtualization
Virtualization Layer
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
VM1 VM2 VM3
Virtualization Layer
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
Operating System
APP1
APP2
APP3
APP4
VM1 VM2 VM3
A simple example – web-based applications
3
Storage
SQL Databases
▪ A relational database is a collection of data items organized as a set of formally-described tables – tables and their relations.
▪ Structured Query Language is the standard means of manipulating and querying data in relational databases.
Tables and their relations
Scalability Problem
4
Storage
Stateful SQL Databases
▪As data volume keeps increasing, it is cheaper and more reasonable to scale out horizontally, by adding smaller, less expensive servers rather than investing in a single large server.
Scale horizontally
Adding a cache layer
5
Front-end1
Middle-Tier
Storage
Front-end2
Middle-Tier
Front-end3
Middle-TierStateless
Stateless
Stateful SQL Databases
Cache Cache CacheStateless
As datasets grows, the simple memcache model starts to become problematic.
Scaling RDBMS – Master/Slave
• Master-Slave▪ All writes are written to the master.
▪ All reads performed against the replicated slave databases
▪ Critical reads may be incorrect as writes may not have been propagated down
▪ Large data sets can pose problems as master needs to duplicate data to slaves
6
Master
Slave Slave Slave
Scaling RDBMS - Partitioning
▪ “Sharding”▪ Divide data amongst many cheap databases
▪ Manage parallel access in the application
▪ Scales well for both reads and writes
▪ Not transparent, application needs to be partition-aware
7
Other ways to scale RDBMS
▪ Multi-Master replication
▪ INSERT only, not UPDATES/DELETES
▪ In-memory databases
8
In Summary
▪ Traditional relational databases do not scale well, when dataset grows
▪ Adding a cache layer scales well for reads, but not for writes
▪ Master/slave also scales well for reads, but not for writes
▪ Sharding scales well for both reads and writes, but not transparent
9
What is NoSQL
• Stands for No-SQL or Not Only SQL??• Class of non-relational data storage systems
• E.g. BigTable, and Dynamo
• Do not require a fixed table schema nor do they use the concept of joins (no relationships)• Distributed data storage systems
• All NoSQL offerings relax one or more of the ACID properties (i.e., CAP theorem)
10
Why NoSQL?
▪ Relational databases offer a very good general purpose solution to many different data storage needs.
▪ But it cannot fit all
▪ Just as there are different programming languages, need to have other data storage tools in the toolbox
11
Why NoSQL?
▪ Explosion of social media sites (Facebook, Twitter) with large data needs
▪ Explosion of storage needs in large web sites such as Google, Yahoo
▪ Much of the data is not files
▪ Shift to dynamically-typed data with frequent schema changes
12
Dynamo and BigTable
▪Three major papers were the seeds of the NoSQL movement▪BigTable (Google)▪Dynamo (Amazon)▪CAP Theorem (counterpart of ACID)
13
CAP Theorem (I)
• Three properties of a distributed system• Consistency
• All copies have same value
• Availability• Every request received by a non-failing node the system
must result in a response
• Partition tolerance • Network can break into two or more independent parts (due
to separate optimization and/or failures)• Partition tolerance means that even after the network is
partitioned into multiple sub-systems, it still works correctly.14
CAP Theorem (II)
• Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system
• Proof:https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/
• Very large systems will partition at some point• Choose one of consistency or availability• Traditional database choose consistency• Most Web applications choose availability
• Except for specific parts such as order processing
15
Availability
▪ For a large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.
▪Availability refers to the percentage of time that the infrastructure, system or a solution remains operational under normal circumstances in order to serve its intended purpose.▪Percentage of availability = (total elapsed time – sum of
downtime)/total elapsed time
▪Traditionally, thought of as the server/process available five 9’s (99.999 %).▪The yearly service downtime could be as much as 5.256
minutes.16
Availability
17
Consistency Model
▪ A consistency model determines rules for visibility and order of updates.
▪ For example:▪ X is replicated on nodes M and N
▪ Client A writes X to node N
▪ Some period of time t elapses.
▪ Client B reads X from node M
▪ Does client B see the write from client A?
▪ For NoSQL, the answer would be:
▪ Yes, if the NoSQL adopts a strict consistency model
18
M N
Client AClient B
X X
Strict Consistency
• All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to.
• It implies nodes employ some kind of distributed transaction protocol to ensure all data copies have the same value
• CAP Theorem: Strict Consistency can’t be achieved at the same time as availability and partition-tolerance.
19
Eventual Consistency
▪ When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent
▪ The type of large systems built based on CAP are known as BASE (Basically Available, Soft state, Eventual consistency)
▪ Who builds large-scale distributed systems based on CAP?▪ Google, Yahoo, Facebook, Amazon, eBay, etc…
20
Types of NoSQL (1)
▪Key/Value data model▪Use of a hash table▪Store data as a key/value pair ▪Access data (values) by strings
called keys ▪Value has no require format▪Basics operations include
insert(key, value), fetch(key),update(key), delete(key)
▪E.g., Amazon S3 (Dynamo) DeCandia, Giuseppe, et al. "Dynamo: amazon's highly
available key-value store." ACM SIGOPS operating
systems review 41.6 (2007): 205-220.
Key/Value
▪Pros:▪ Simple model
▪ Very fast
▪ Very scalable
▪ Able to scale horizontally
▪Cons: ▪ Many data structures (objects) can't be easily modeled as key
value pairs
Types of NoSQL (2)
▪Other schema-less data models which come in multiple flavors such as column-based, document-based or graph-based.▪Cassandra (column-based)▪MongoDB (document-based)▪Neo4J (graph-based)▪Redis (key/value-based)
23
Document-based
▪Can model more complex objects
▪Data model: collection of documents
▪Document: JSON ▪ JavaScript Object Notation is a data model, which supports
objects, records, structs, list, array, maps, dates, with nesting.
24
Column family data model
▪ Column based databases use a concept called a keyspace.
▪ The keyspace contains all the column families
▪ A column family consists of multiple rows.
▪ Each row can contain a different number of columns.
▪ Each column contains a name/value pair, along with a timestamp.
25
Graph data model▪ Based on Graph Theory: A graph is
composed of two elements: a node and aedge (i.e., the relationship).
▪ Each node represents an entity (e.g., a person, place, thing, category or other piece of data), and each relationship represents how two nodes are associated.▪ Twitter is a perfect example of a graph database
connecting 330 million monthly active users.
▪ Graph databases, by design, allow simple and fast retrieval of complex hierarchical structures that are difficult to model in relational systems.
26
Typical NoSQL API
▪ Basic API access:
▪ get(key) -- Extract the value given a key
▪ put(key, value) -- Create or update the value given its key
▪ delete(key) -- Remove the key and its associated value
▪ execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
27
Advantages of NoSQL Systems
▪ Easy to use
▪ Easy to scale horizontally on commodity hardware
▪ Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned▪ No single point of failure
▪ Breadth of functionality
▪ Executing code next to the data (e.g., Hadoop)
▪ …
28
What does NoSQL not Provide?
▪ Joins
▪ Group by
▪ ACID transactions
▪ SQL
▪ Integration with applications that are based on SQL
29
Which One to use?
▪ NoSQL Data storage systems make sense for applications that need to deal with very very large semi-structured data
▪ Log Analysis
▪ Social Networking Feeds
▪ Most of our work is on organizational databases, which are not that large and have low update/query rates
▪ regular relational databases are the correct solution for such applications
30
In Summary -- NoSQL
• “Not Only SQL”
• 1. Scale horizontally “simple operations”
• 2. Replicate/distribute data over many servers
• 3. Simple call level interface (contrast w/ SQL)
• 4. Weaker concurrency/consistency model than ACID
• 5. Flexible schema (i.e., schema-less)
31
Demos
• Google Big Tables: https://www.youtube.com/watch?v=ChoXxlddGis
32
Sources
• Reliability vs Availability: What’s the Difference? https://www.bmc.com/blogs/reliability-vs-availability/
• An Illustrated Proof of the CAP Theorem: https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/
• What is a Column Store Database: https://database.guide/what-is-a-column-store-database/
• 10 Advantages of NoSQL over RDBMS: https://www.dummies.com/programming/big-data/10-advantages-of-nosql-over-rdbms/
33