nosql - life beyond the outer join

30
NoSQL - Life beyond the Outer Join Glen Smith ([email protected])

Upload: glenasmith

Post on 29-Nov-2014

4.440 views

Category:

Technology


2 download

DESCRIPTION

This talk was from the Canberra Java User Group in July 2010. You can download the source to go with these slides from http://bitbucket.org/glen_a_smith/cjug-nosql-examples. The NoSQL (Not Only SQL) movement has been gaining a lot of press over the last year as a means of scaling massive data storage, complex relationships and lightening fast retrieval for the Web's biggest sites. This month we're taking a trip to the big end of town and looking at some of the backend technologies that are powering sites like Twitter, Facebook, LinkedIn, Reddit, Digg and Google. We'll be looking at popular Java clients and servers that play in the NoSQL space and have a brief survey of the following popular NoSQL platforms: Document Databases (CouchDB), Sophisticated Key/Value Stores (Voldemort), Graph Databases (Neo4j), and simple Key/Value stores (Memcached). It'l be a lightening tour of what each technology offers, some source code on how it works, and lots of headshifts about how to store data such that you don't ever need another Left Outer Join!

TRANSCRIPT

Page 1: NoSQL - Life Beyond the Outer Join

NoSQL - Life beyond the Outer Join

Glen Smith

([email protected])

Page 2: NoSQL - Life Beyond the Outer Join

Survey the landscape of NoSQL offerings

Learn some of the terminology

Look at some of the Java offerings in the space

Take away source to play with

Be able to ask questions (but you may not get answers)

Objectives

Page 3: NoSQL - Life Beyond the Outer Join

(N)ot (O)nly SQL not “Anti SQL”

Movement more than “one” technology

Distributed Storage System

Much weaker queries

Scale across many machines

Much larger data, much faster queries

What is NoSQL?

Page 4: NoSQL - Life Beyond the Outer Join

Inspired by Distributed Data Storage problems

Scale easily by adding servers

Not suited to all problem types, but super-suited to certain large problem types

High-write situations (eg activity tracking or timeline rendering for millions of users)

A lot of relational uses are really dumbed down (eg fetch by PK with update)

Why NoSQL?

Page 5: NoSQL - Life Beyond the Outer Join

Nothing ;-)

To scale RDBMS, your approach is typically:

Shard your datasource

Put in a bunch of read replicas

Put memcached in front of those

What could possibly go wrong?

Complex. Custom caching. Partitioning. Migrating of shards. Tons of moving parts.

What’s wrong with RDBMS?

Page 6: NoSQL - Life Beyond the Outer Join

Atomic (it happens or not, no partial completes)

Consistent (DB internals, ref integ, field validate)

Isolated (Can’t modify uncommitted data)

Durable (written to disk/transaction log)

But in a distributed db, life is not so simple...

How can I live w/o ACID?

Page 7: NoSQL - Life Beyond the Outer Join

In a distributed system, when you have state on more than one machine, pick any two:

Consistency (easy in read-only states – copy!)

Availability (can you get at your data? Is it up?)

Partition Tolerance (3 machines on one net, 3 on the other, with a broken link. How do you take updates since you can’t keep people up to date. What if you don’t agree on what’s up?)

The CAP theorum

Page 8: NoSQL - Life Beyond the Outer Join

Basically big distributed hashtables

Push all logic into the write (update two lists – one for userId, one for email)

Things don’t happen transactionally. These are two writes.

There is no free lunch. The programmer is now handling consistency problems.

You were thinking about query optimisation before, and now even more so.

How do these NoSQL things work?

Page 9: NoSQL - Life Beyond the Outer Join

Digg - 3Tb

Facebook Inbox – 50 Tb

eBay – 2 Pb

Think about Twitter’s issues.. Billion of queries a second over Tb of data.

How big are we talking?

Page 10: NoSQL - Life Beyond the Outer Join

Key-Value In-Memory stores (Memcached, Redis)

Key-Value “Eventually Consistent” stores (“Dynamo Clones” like Cassandra, Voldemort, Riak)

Document stores (Couchdb, Mongodb, JCR)

Graph Databases (Neo4j)

Tabular (“BigTable clones” like Hadoop/Hbase)

The NoSQL Taxonomy

Page 11: NoSQL - Life Beyond the Outer Join

Developed for the original LiveJournal site

LRU, distributed hashtable

Logic is in both client and server

Used in Google App Engine, Facebook, Twitter

Ehcache now has similar service

Good for things that outlive an app server

Memcached

Page 12: NoSQL - Life Beyond the Outer Join

Clients know how to: Send items to servers (consistent hashing)

What to do when a server fails

How to fetch keys from servers

Can “weigh” to server capacities

Servers know how to: Store items they receive

Expire them from the cache

No inter-server comms – everything is unaware

How does it work?

Page 13: NoSQL - Life Beyond the Outer Join

Sample Code

Page 14: NoSQL - Life Beyond the Outer Join

Less than Memcached, but also more!

Not a cache, but a distributed key/value store

Developed by LinkedIn

Works on distributed hashmap w/failover

Logic can be in client/server or just server

Pluggable storage (mysql,bdb,mock)

Pluggable serialization (JSON, Google PB, etc)

Voldemort

Page 15: NoSQL - Life Beyond the Outer Join

Eventual consistency – data will come into sync but not immediately on the write. In practice “pretty soon” is milliseconds later

We are actually used to this – eg Google indexes update every so often.

Guarantees to read your own writes (eg your profile on LinkedIn)

Tuneable to better performance/weaker consistency

“Relaxed” Consistency

Page 16: NoSQL - Life Beyond the Outer Join

Data is automatically replicated

Partitioning ensures all servers have subset

Server failure is handled transparently

Data is rebalanced when servers added/removed

Serialization is pluggable

Apache License

What’s attractive?

Page 17: NoSQL - Life Beyond the Outer Join

“We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.”

Impressive Performance

Page 19: NoSQL - Life Beyond the Outer Join

Starting the server (or deploy as a .war) bin\voldemort-server.bat config\single_node_cluster

Starting the console bin\voldemort-shell.bat test tcp://localhost:6666

Run some queries

put “hello” “world”

get “hello”

put “hello” “world 2.0”

delete “hello”

Sample Script

Page 20: NoSQL - Life Beyond the Outer Join

Sample Code

Page 21: NoSQL - Life Beyond the Outer Join

Document-Oriented Db – No Schema

Written in Erlang (!) by a Notes Dev (!!!)

Everything is stored in JSON, Restful API

Clever replication concepts – works in disconnected settings

Every write is a new document, version

Map/Reduce baked in

Apache License

CouchDb

Page 22: NoSQL - Life Beyond the Outer Join

Schemaless operation – Adhoc data

Incremental replication (great for disconnected settings)

Great fault-tolerance (with versioned conflicts)

Fast query with flexibility (MapReduce)

What’s attractive?

Page 23: NoSQL - Life Beyond the Outer Join

Popularized by Google’s BigTable

Map functions collect documents matching criteria and create a B-Tree

Reduce functions operate on the B-Tree

Everything happens in parallel on many machines

Example: distributed grep

So what is this Map/Reduce thing?

Page 24: NoSQL - Life Beyond the Outer Join

http://127.0.0.1:5984/

http://127.0.0.1:5984/_all_dbs

http://127.0.0.1:5984/mydb (PUT)

http://127.0.0.1:5984/_utils/ (Futon)

The Naked Couch

Page 25: NoSQL - Life Beyond the Outer Join

You lose some of the joy of schema-less

But you do get lots of boilerplate ;-)

Oh, and strong typing.

Mapping Couch with Ekron

Page 26: NoSQL - Life Beyond the Outer Join

You write a map function to extract data

You always return a key/value pair

function(doc) {

if (doc.title.indexOf(“Hi!") > -1) {

emit(doc.title, doc);

}

}

Writing a Couch MapReduce

Page 27: NoSQL - Life Beyond the Outer Join

Stored data in a graph of nodes and r’ships

Can handle billions of nodes per machine

Means you can query on relationships!

Supports ACID transactions

One 500kb jar (!)

Dual-licensed GPL/Commercial

Neo4j

Page 28: NoSQL - Life Beyond the Outer Join

Sample Code

Page 30: NoSQL - Life Beyond the Outer Join

Looking for a good book?

Q & A