nosql for great good [hanoi.rb talk]
TRANSCRIPT
NoSQL database for great good
@huydx hanoi.rb
$> whoami
huy Software developer Tokyo base ruby/scala user
nickname: @huydx
Disclaimer
This talk is not going to go detail about any NoSQL
I'm going to talk about: when we need to choose a nosql db, how should we think?
What people often think about NoSQL?
• As cache
• As a magic which can make "any" web system faster
Your system is slow
Just use NoSQLRDBMS is shit
RDBMS is not slow NoSQL is not the cure for everything
RDBMS is awesome
• Can scan 7m rows / sec with index! • Can handle very big data (facebook) • Has very flexible query language (SQL) • Has some awesome analytics feature
(window function-postgresql)
• Has ACID properties
https://www.percona.com/blog/2008/04/09/how-fast-can-mysql-process-data/
Why ACID is important• Atomicity : protect transaction (all or nothing) • Consistency: protect data correctness • Isolation : protect data from concurrency • Durability : protect data from failure
ACID makes a database something you can
TRUST
So• RMDBS is way better than you thought
• You should learn to do RMDBS the right way
• How to make the best performance from RMDBS (index tuning, query optimize, data modeling, master-slave replication, monitoring, shard-ing the right way....)
https://www.percona.com/blog/2014/03/27/a-conversation-with-5-facebook-mysql-gurus/
But this talk is about NoSQL!!!!!!
Where RMDBS is not fit for?• Nature of data: When data is not row-column
style (multidimensional data) • How your data scale : Data shard-ing (You
don't want to shard-ing) • ACID is great, but it degrade performance • Single point of failure : single master • Data usage : when you need realtime, fast data
https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/
Let's talk about NoSQL
We have plenty of playersBut when, and how to use them?
We have a decent answer: It depends!!
What do you want to store?
• Geo-partial data? • Users important data? (password, paying
information..) • Cache data? • Analytics realtime data (write/read intensive)
Where do you want to store?
• On memory? • On disk?
• On Slow Disk (HDD) • On Fast Disk (SSD)
How big is your data
• Able to fit into memory?
• Able to fit into single machine?
• Not able to fit into hundreds of machine?
It's there any factor to category NoSQL database
Data Model
Query Model
+
NoSQL categorized by how data model
Documentpair each key with a complex data structure known as a document (JSON, BSON).
MongoDB, CouchDB, RethinkDB
Column Family
One row key pair with many column (rows in RDBMS) (easy for block partition)
Cassandra, HBase, Hypertable
Graph Store data as nodes + link between nodes
Neo4j, FlockDB
KVSJust a key + a value (a value can be complex, but will not be able to as wide as column family)
Riak, Memcached, Redis, CouchBase
What about merit / demerit
of each data model?
Data model affect how we query data
User always want query method to be as flexible as possible
But sometimes, we have to face the trade-off between
flexibility and scalability
• Document : query can be very flexible because document is examinable (mongodb has very rich query language). Data model can change very flexible
• Column Family : just a key value with a very wide fields, which make it very fast to look up a bunch of values
• Graph : for very special cases when you need to store and query relationship (followers in twitters)
• KVS : when you really need high performance, and just need to look up for simple value
So it really depends, right?
Data model for NoSQL is hard!
So be careful with your selection
Sometimes the borderline of data modeling is blurred
We need other factor to consider
Scalability
First we need to know about CAP theorem
http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf
We can only have two of them!!!!!!!!
NO MORE!!!!
http://blog.flux7.com/blogs/nosql/cap-theorem-why-does-it-matter
Just ask your self: what do you care about
• You need very fast write and read, data can be a little bit stale -> A + P
• You need transaction, and every one must see the same view, but sometimes something must be lock -> C + P
• You don't need a distributed system which is false-tolerance with network problem -> C + A
So we have two options to think about, what's more?
Operation
Programmer may not care but
Infrastructure engineer care
What factors affect operation?
• What is your database distributed model, how they shard, and replicate (master-slave or p2p)
• Do your database run on JVM? (operating a JVM system is waaaayyy bothersome than a system written in C or C++)
• Do your database has single point of failure? • Do your database optimized for SSD only?
Operation is hard
When you fail at operation, you lost your data
So choose what you know very well about
Conclusion
• It's really depends!!!!
• Ask your self: Is it really needed to use nosql?
• First know your requirement, know your data
• Investigate carefully before choosing any solution (when you fail to choose, you lost your data)