no sql distilled-distilled
DESCRIPTION
TRANSCRIPT
NoSQL Distilled (Distilled)demystifying that which should have never entered “mystified” status
rICh Morrow, quicloud LLC
NoSQL Distilled
This talk is essentially the first couple chapters of “NoSQL Distilled” (Sadalage, Fowler)
Highly recommend this book!
2 reasons to like NoSQL
App development productivity Fixes “impedance mismatch”
Large scale Happily handles the “three Vs” of “big
data”▪ Volume▪ Velocity▪ Variety
A brief history of Storage
You’ve always needed a “backing store” …could be files
great for a single user or application …could be databases
great for multiple users/applications …and on the DB side, could be:
Application Database (used by single app)
Integration Database (used by several apps)
Multi-user adds complexity
Concurrency Simple problem, very tough to solve
Application Datastores One app, many users
Integration Datastores One set of data, many apps, lots of
potential for headbanging
Impedance mismatch?
{ “id”: “1001”, "firstName": ”Ann", "lastName": "Williams", "age": 55,“purchasedItems”: { 0321290533 {qty, price… } 0321601912 {qty, price… } 0131495054 {qty, price… } }“paymentDetails”: { cc info… }"address": { "street": "1234 Park", "city": "San Francisco", "state": "CA", "zip": "94102" }} 1 object = 10, 20, 100? Tables. Ugh…
Your code has one structure, but your RDBMS stores in another…
The long reign of RDBMS
A great "all purpose" storage + query toolACID compliant
Supports many users Supports many apps
3NF stores data efficiently Disk wasn't always cheap
Fast and tunable Introduced a common interface (SQL)
Which every vendor quickly then “broke”
...but RDBMS != all unicorns and rainbows
Impedance mismatch Many teams build (then have to maintain)
custom ORM or SOA proxies Weren't build to be distributed
Google, Amazon, et al hit hard walls on RDBMS capabilities
Often required expensive, proprietary hardware
Ooops, I sharded myself! Additional complexity Cross shard joins now extremely expensive
“Web Scale” brings on the three Vs
Velocity Faster responses required
Volume 100s of TB, PB now common “Web Scale” can mean 100s of thousands
of concurrent transactions Both of those increasing rapidly
Variety Mixed structure, semi-structured,
unstructured
Google and Amazon drive the space
Bigtable paper (by Google) Heavily influenced the “Columnar” branch of NoSQL
Dynamo paper (by Amazon) Heavily influenced the “Ke Value” branch of NoSQL This is NOT DynamoDB!!!
Design considerations: Distributed from the start Clusters of inexpensive commodity hardware are
cheaper & more fault tolerant at scale Relaxed and/or tunable C&A (from CAP theorem) Deal with unheard of volume & velocity Schemaless (bye bye impedance mismatch)
CAP (Theorem)?
Consistency How consistent the data looks to 2 or more
viewers “Eventual” consistency possible (and
common)! Availability
Responsiveness of the system Partition Tolerance
How well does the system respond to partition failures?
This is normally “untunable”, unlike the C&A
NoSQL is Born
Because “Cloud” and “Big Data” were just not confusing enough people in IT
"Not ONLY SQL" - incredibly unfortunate "little o"
Name born out of a Bay Area meetup in 2009 …and regretted / derided ever since
“Polyglot Persistence”
Fancy term for “multiple datastores” ...you're already doing it
Browser side cache Memcache Query cache OLAP systems ...just add NoSQL
Tell your RDBMS not to worry – it will (probably) still live a long, happy life
"NoSQL” Datastores share
Generally Open Source Schemaless
Easily change schema or do 'schema on read' Cluster-oriented
With the exception of Graph DBs Generally favor "Web Scale" over ACID Generally better for APPLICATION
Databases Aggregate data models
Let you treat a group of data as a unit Again, graph DBs are an exception here…
The 4 Flavors of NoSQL
Key Value Fast lookup on a single “hashed” key
Document Each “Document” self-defines it’s own structure
Columnar (or Column-Family) Great for “sparse” data (millions of columns)
Graph [bit of a black sheep in the NoSQL family] Specialized to crawl graph relations like social
networks, resource flows, etc Less popular at the moment, but gaining steam fast
Key Value
Can only look up by (normally a single) Key
Extremely fast for that key Value can be anything
Example: DynamoDB, Riak
Document
Document can contain anything json extremely popular But can also be XML, CSV, semi-structured,
unstructured, custom… literally anything Can query on aggregates inside of document Can even index on aggregates Can retrieve part of the document Extremely memory intensive
Example: MongoDB, CouchDB
Column (or Columnar Family) Great for “sparse” data (populated
columns vary greatly between rows) Group columns into families Think of it as a “two level” aggregate
First level “key” is rowID or aggregate of interest
2nd level values are the columns You can visualize the data as row or
column-oriented
Example: Hbase, Cassandra
Graph Databases
Built to efficiently crawl & search graph trees Social Networks Resource flows “people of interest”
Don’t run well on clusters
Example: Neo4J (and not much else right now)
Takeaways
RDBMS were not designed with many of today’s problems in mind
NoSQL DBs were built from the ground up to deal with these “Three V” issues
NoSQL can either replace or (more commonly) supplement existing RDBMS functions Move hot tables out to DynamoDB Write a greenfield app from ground up with only a
NoSQL datastore Consistency & Availability are often tunable Many flavors exist & each have their own best use
cases Research heavily before deciding upon a platform
Questions? Answers?
Thanks!