nosql - life beyond the outer join

NoSQL - Life beyond the Outer Join

Glen Smith

([email protected])

Survey the landscape of NoSQL offerings

Learn some of the terminology

Look at some of the Java offerings in the space

Take away source to play with

Be able to ask questions (but you may not get answers)

Objectives

(N)ot (O)nly SQL not “Anti SQL”

Movement more than “one” technology

Distributed Storage System

Much weaker queries

Scale across many machines

Much larger data, much faster queries

What is NoSQL?

Inspired by Distributed Data Storage problems

Scale easily by adding servers

Not suited to all problem types, but super-suited to certain large problem types

High-write situations (eg activity tracking or timeline rendering for millions of users)

A lot of relational uses are really dumbed down (eg fetch by PK with update)

Why NoSQL?

Nothing ;-)

To scale RDBMS, your approach is typically:

Shard your datasource

Put in a bunch of read replicas

Put memcached in front of those

What could possibly go wrong?

Complex. Custom caching. Partitioning. Migrating of shards. Tons of moving parts.

What’s wrong with RDBMS?

Atomic (it happens or not, no partial completes)

Consistent (DB internals, ref integ, field validate)

Isolated (Can’t modify uncommitted data)

Durable (written to disk/transaction log)

But in a distributed db, life is not so simple...

How can I live w/o ACID?

In a distributed system, when you have state on more than one machine, pick any two:

Consistency (easy in read-only states – copy!)

Availability (can you get at your data? Is it up?)

Partition Tolerance (3 machines on one net, 3 on the other, with a broken link. How do you take updates since you can’t keep people up to date. What if you don’t agree on what’s up?)

The CAP theorum

Basically big distributed hashtables

Push all logic into the write (update two lists – one for userId, one for email)

Things don’t happen transactionally. These are two writes.

There is no free lunch. The programmer is now handling consistency problems.

You were thinking about query optimisation before, and now even more so.

How do these NoSQL things work?

Digg - 3Tb

Facebook Inbox – 50 Tb

eBay – 2 Pb

Think about Twitter’s issues.. Billion of queries a second over Tb of data.

How big are we talking?

Key-Value In-Memory stores (Memcached, Redis)

Key-Value “Eventually Consistent” stores (“Dynamo Clones” like Cassandra, Voldemort, Riak)

Document stores (Couchdb, Mongodb, JCR)

Graph Databases (Neo4j)

Tabular (“BigTable clones” like Hadoop/Hbase)

The NoSQL Taxonomy

Developed for the original LiveJournal site

LRU, distributed hashtable

Logic is in both client and server

Used in Google App Engine, Facebook, Twitter

Ehcache now has similar service

Good for things that outlive an app server

Memcached

Clients know how to: Send items to servers (consistent hashing)

What to do when a server fails

How to fetch keys from servers

Can “weigh” to server capacities

Servers know how to: Store items they receive

Expire them from the cache

No inter-server comms – everything is unaware

How does it work?

Sample Code

Less than Memcached, but also more!

Not a cache, but a distributed key/value store

Developed by LinkedIn

Works on distributed hashmap w/failover

Logic can be in client/server or just server

Pluggable storage (mysql,bdb,mock)

Pluggable serialization (JSON, Google PB, etc)

Voldemort

Eventual consistency – data will come into sync but not immediately on the write. In practice “pretty soon” is milliseconds later

We are actually used to this – eg Google indexes update every so often.

Guarantees to read your own writes (eg your profile on LinkedIn)

Tuneable to better performance/weaker consistency

“Relaxed” Consistency

Data is automatically replicated

Partitioning ensures all servers have subset

Server failure is handled transparently

Data is rebalanced when servers added/removed

Serialization is pluggable

Apache License

What’s attractive?

“We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.”

Impressive Performance

Performance Info

http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010








Starting the server (or deploy as a .war) bin\voldemort-server.bat config\single_node_cluster

Starting the console bin\voldemort-shell.bat test tcp://localhost:6666

Run some queries

put “hello” “world”

get “hello”

put “hello” “world 2.0”

delete “hello”

Sample Script

Sample Code

Document-Oriented Db – No Schema

Written in Erlang (!) by a Notes Dev (!!!)

Everything is stored in JSON, Restful API

Clever replication concepts – works in disconnected settings

Every write is a new document, version

Map/Reduce baked in

Apache License

CouchDb

Schemaless operation – Adhoc data

Incremental replication (great for disconnected settings)

Great fault-tolerance (with versioned conflicts)

Fast query with flexibility (MapReduce)

What’s attractive?

Popularized by Google’s BigTable

Map functions collect documents matching criteria and create a B-Tree

Reduce functions operate on the B-Tree

Everything happens in parallel on many machines

Example: distributed grep

So what is this Map/Reduce thing?

http://127.0.0.1:5984/

http://127.0.0.1:5984/_all_dbs

http://127.0.0.1:5984/mydb (PUT)

http://127.0.0.1:5984/_utils/ (Futon)

The Naked Couch

http://127.0.0.1:5984/

http://127.0.0.1:5984/_all_dbs

http://127.0.0.1:5984/mydb

http://127.0.0.1:5984/_utils/

You lose some of the joy of schema-less

But you do get lots of boilerplate ;-)

Oh, and strong typing.

Mapping Couch with Ekron

You write a map function to extract data

You always return a key/value pair

function(doc) {

if (doc.title.indexOf(“Hi!") > -1) {

emit(doc.title, doc);

}

}

Writing a Couch MapReduce

Stored data in a graph of nodes and r’ships

Can handle billions of nodes per machine

Means you can query on relationships!

Supports ACID transactions

One 500kb jar (!)

Dual-licensed GPL/Commercial

Neo4j

Sample Code

http://blogs.bytecode.com.au/glen

http://twitter.com/glen_a_smith

http://grailspodcast.com/

Download all the source from today:

http://bitbucket.org/glen_a_smith/cjug-nosql-examples

Blogvertising

http://blogs.bytecode.com.au/glen

http://twitter.com/glen_a_smith

http://grailspodcast.com/






Looking for a good book?

Q & A

nosql - life beyond the outer join

Technology