no sql distilled-distilled

NoSQL Distilled (Distilled)demystifying that which should have never entered “mystified” status

rICh Morrow, quicloud LLC

NoSQL Distilled

This talk is essentially the first couple chapters of “NoSQL Distilled” (Sadalage, Fowler)

Highly recommend this book!

2 reasons to like NoSQL

App development productivity Fixes “impedance mismatch”

Large scale Happily handles the “three Vs” of “big

data”▪ Volume▪ Velocity▪ Variety

A brief history of Storage

You’ve always needed a “backing store” …could be files

great for a single user or application …could be databases

great for multiple users/applications …and on the DB side, could be:

Application Database (used by single app)

Integration Database (used by several apps)

Multi-user adds complexity

Concurrency Simple problem, very tough to solve

Application Datastores One app, many users

Integration Datastores One set of data, many apps, lots of

potential for headbanging

Impedance mismatch?

{ “id”: “1001”, "firstName": ”Ann", "lastName": "Williams", "age": 55,“purchasedItems”: { 0321290533 {qty, price… } 0321601912 {qty, price… } 0131495054 {qty, price… } }“paymentDetails”: { cc info… }"address": { "street": "1234 Park", "city": "San Francisco", "state": "CA", "zip": "94102" }} 1 object = 10, 20, 100? Tables. Ugh…

Your code has one structure, but your RDBMS stores in another…

The long reign of RDBMS

A great "all purpose" storage + query toolACID compliant

Supports many users Supports many apps

3NF stores data efficiently Disk wasn't always cheap

Fast and tunable Introduced a common interface (SQL)

Which every vendor quickly then “broke”

...but RDBMS != all unicorns and rainbows

Impedance mismatch Many teams build (then have to maintain)

custom ORM or SOA proxies Weren't build to be distributed

Google, Amazon, et al hit hard walls on RDBMS capabilities

Often required expensive, proprietary hardware

Ooops, I sharded myself! Additional complexity Cross shard joins now extremely expensive

“Web Scale” brings on the three Vs

Velocity Faster responses required

Volume 100s of TB, PB now common “Web Scale” can mean 100s of thousands

of concurrent transactions Both of those increasing rapidly

Variety Mixed structure, semi-structured,

unstructured

Google and Amazon drive the space

Bigtable paper (by Google) Heavily influenced the “Columnar” branch of NoSQL

Dynamo paper (by Amazon) Heavily influenced the “Ke Value” branch of NoSQL This is NOT DynamoDB!!!

Design considerations: Distributed from the start Clusters of inexpensive commodity hardware are

cheaper & more fault tolerant at scale Relaxed and/or tunable C&A (from CAP theorem) Deal with unheard of volume & velocity Schemaless (bye bye impedance mismatch)

CAP (Theorem)?

Consistency How consistent the data looks to 2 or more

viewers “Eventual” consistency possible (and

common)! Availability

Responsiveness of the system Partition Tolerance

How well does the system respond to partition failures?

This is normally “untunable”, unlike the C&A

NoSQL is Born

Because “Cloud” and “Big Data” were just not confusing enough people in IT

"Not ONLY SQL" - incredibly unfortunate "little o"

Name born out of a Bay Area meetup in 2009 …and regretted / derided ever since

“Polyglot Persistence”

Fancy term for “multiple datastores” ...you're already doing it

Browser side cache Memcache Query cache OLAP systems ...just add NoSQL

Tell your RDBMS not to worry – it will (probably) still live a long, happy life

"NoSQL” Datastores share

Generally Open Source Schemaless

Easily change schema or do 'schema on read' Cluster-oriented

With the exception of Graph DBs Generally favor "Web Scale" over ACID Generally better for APPLICATION

Databases Aggregate data models

Let you treat a group of data as a unit Again, graph DBs are an exception here…

The 4 Flavors of NoSQL

Key Value Fast lookup on a single “hashed” key

Document Each “Document” self-defines it’s own structure

Columnar (or Column-Family) Great for “sparse” data (millions of columns)

Graph [bit of a black sheep in the NoSQL family] Specialized to crawl graph relations like social

networks, resource flows, etc Less popular at the moment, but gaining steam fast

Key Value

Can only look up by (normally a single) Key

Extremely fast for that key Value can be anything

Example: DynamoDB, Riak

Document

Document can contain anything json extremely popular But can also be XML, CSV, semi-structured,

unstructured, custom… literally anything Can query on aggregates inside of document Can even index on aggregates Can retrieve part of the document Extremely memory intensive

Example: MongoDB, CouchDB

Column (or Columnar Family) Great for “sparse” data (populated

columns vary greatly between rows) Group columns into families Think of it as a “two level” aggregate

First level “key” is rowID or aggregate of interest

2nd level values are the columns You can visualize the data as row or

column-oriented

Example: Hbase, Cassandra

Graph Databases

Built to efficiently crawl & search graph trees Social Networks Resource flows “people of interest”

Don’t run well on clusters

Example: Neo4J (and not much else right now)

Takeaways

RDBMS were not designed with many of today’s problems in mind

NoSQL DBs were built from the ground up to deal with these “Three V” issues

NoSQL can either replace or (more commonly) supplement existing RDBMS functions Move hot tables out to DynamoDB Write a greenfield app from ground up with only a

NoSQL datastore Consistency & Availability are often tunable Many flavors exist & each have their own best use

cases Research heavily before deciding upon a platform

Questions? Answers?

Thanks!

no sql distilled-distilled

Documents