intro to nosql
TRANSCRIPT
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Introduction to NoSQLJim Driscoll, MarkLogic
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2
Agenda History of NoSQL NoSQL Terminology Types of NoSQL Databases (with examples of each)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3
HISTORY OF NOSQL
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4
A Short History of Data Application Specific Databases
Size is paramount Relational Databases
Size matters… …but break from the application silo … and provide data integrity
NoSQL Databases Agility Scalability Speed
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5
RELATIONAL DOESN’T MEAN WHAT YOU THINK IT DOES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6
What's wrong with Relational? Nothing, it's perfect for square data ...where you know the relationships in advance ...where the schema doesn't change often ...where all the data can fit on one machine ...where a separate disk seek for every join isn't an issue (or can be
cached)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7
SEEK AND YOU WILL FIND…IN ABOUT 10MS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8
The Rise of NoSQL
1998 2001 2003 2004 2006 2007 2009
Google FileSystem paper
Carlo Strozzi coins term
Google BigTable paper
Eric Evans popularizes term
MarkLogic founded
Google MapReduce paper
Amazon Dynamo paper
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9
MEMCACHED
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10
Memcached Developed at LiveJournal as a frontend cache for websites First released in 2003 Keep disk access at a minimum, pool memory on many machines So useful it found wide popularity, still under active development
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11
Memcached High Performance, Distributed Memory Object Caching System
Distributed – runs across many computers Memory – runs without touching disk Object cache – designed to hold small lumps of data High performance – because it never touches disk, and the
objects are small, it’s optimized for speed
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12
Memcached Client server system Servers are unaware of each other Clients determine server to use via hashing Servers keep content as an LRU cache
So all data transitory
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13
SHARDING
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14
Sharding to Scale Out
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15
BIGTABLE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16
Bigtable Created by Google in 2004
…to store massive amounts of data Made public in famous 2006 paper Used throughout Google
GMail, Google Maps, YouTube, Web Indexing, etc Reportedly over 100 internal projects
Never shipped externally as a product … but available for public use as part the AppEngine hosting API
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17
Bigtable Rows are composed of columns, which in turn belong to column
families Column families are essentially typing, validation and expiration info
It’s helpful to think of them as the “tables” Lookups are done via a Row Key Every cell is versioned via timestamp, and sparsely stored System is robust and crash resistant
Can survive the crash of any machine, including the master Scale out architecture
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18
MAP / REDUCE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19
Map / Reduce Massively Distributed Processes Map - sort, filter, transform data Reduce - summarize data (iteratively)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20
HADOOP
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21
Hadoop First envisioned as “Nutch” at the Internet Archive in 2002 There were 100’s of millions of webpages to index Early versions heavily influenced by Google File System, Map Reduce
papers Goal: Perform work on large datasets using commodity machines Development moved to Yahoo in 2006 Open Sourced to Apache, as Hadoop
A File system (HDFS) A Task Runner (MapReduce) A Task Manager (YARN) Note: Not a database
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22
Hadoop Really good at…
Batch Processing … on incredibly large data sets
Not so good parts Latency Updates Usability
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23
DYNAMO
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24
Amazon Dynamo Created to power Amazon’s Web store Writing with low latency more important than consistency Techniques first made public in 2007 paper Never externally shipped… …but huge influence on market Used for a variety of critical portions of Amazon’s site
Shopping cart User Session
Succeeded by DynamoDB Similar name, but whole new architecture (with better consistency)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25
Amazon Dynamo Distributed Key Value store "always writable” low latency reads and writes, at the expense of consistency
asynchronous replication on put() operations …mean that get() may return a stale value
updates during a network partition can result in conflicts …and the application must handle them
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26
TERMINOLOGY
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27
What is(n't) NoSQL No SQL Schema-less Open Source BASE (Eventually Consistent)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28
ACID Atomicity
Everything either succeeds or fails Consistency
Nothing is saved unless it passes consistency rules Isolation
No two processes can interfere with each other Durability
Once saved, data can not be lost due to system failure
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29
BASE Basically Available Soft state Eventually consistent
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30
What happens without consistency? Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document or row at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that can’t be reconciled Some missing data that can’t be recovered Some inconsistent query results
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31
What is NoSQL? Database Non-relational Schema on read Scale out architecture
Cluster friendly / Cloud Ready
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32
TYPES OF NOSQL DATABASES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33
Types of NoSQL Databases
Graph Databases
Wide Column Databases
Key Value Databases
Document Databases
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34
KEY VALUE STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35
MemcacheDB Very early KV implementation (2008) KV Store based on Memcached source, with BerkleyDB persistent store Speaks the memcached protocol Development stopped (2009), but still quite popular For when you like Memcached, but want persistence
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 36
REDIS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37
Redis First released in 2009
Sponsored by VMWare, then Pivotal Name means Remote Dictionary Server Fully in memory key value store
Whole db must reside in memory of one machine Limits scalability, at the benefit of performance
Often used as a front end cache for other NoSQL databases
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 38
Redis Not just strings as values:
Lists of strings Sets of strings (collections of non-repeating unsorted elements) Sorted sets of strings (collections of non-repeating elements
ordered by a floating-point number called score) Hashes where keys and values are strings
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 39
Redis Master / slave replication - slave may be master to another slave
allowing tree replication also publish/subscribe API slaves may be updated separately from master, allows
inconsistencies (!) Persistent store
Append only journal Flushed every 2 seconds by default
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 40
DOCUMENT STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41
Document vs Key Value Stores Extension of Key Value - the value is a document but also Structurally aware
Indexed searches Self-describing document formats
CouchDB – JSON MongoDB – BSON MarkLogic - JSON, XML
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 42
MARKLOGIC
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43
MarkLogic Founded in 2001, founders were search engine experts Document centric database with search engine features Stores and indexes XML, JSON, text and binaries Enterprise NoSQL
ACID transactions (including XA) HA/DR Government grade security
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 44
MarkLogic Universal Index – index all the things
Index words, elements, the relationships of words and elements Many indexes (automatically) used at once, resolving queries
without touching disk Search on ranges, free text, field values, more… Shared nothing architecture, transactions via MVCC Automatic partitioning and balancing Hadoop support (works on HDFS, and with Map/Reduce jobs) Includes a webserver for building RESTful applications
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45
MarkLogic More than a document store
Range indexes allow in-memory column operations Triple store, supporting RDF Triples and SPARQL
High Availability – multiple copies of updates saved transactionally Disaster Recovery – copies sent to remote site with a window Free to download and try out with a developer license
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 46
MONGODB
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47
MongoDB Development began in 2007 by 10gen Name from “humongous” Originally wanted to create a Google App Engine system 1.4 considered first “production ready” release, 2010 Stores and retrieves BSON documents Horizontally scaling
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 48
MongoDB Stores data in proprietary format
BSON, similar to JSON with more data types Search on field, on range, or on regex Single index per query (secondary index optional) Replication of databases as master/slave, with (tunable) eventual
consistency Sharding handled via a shard key, splitting by range
Be sure the key is evenly distributed Client APIs in many languages
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49
WIDE COLUMN STORES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 50
Column Stores Descended from Big Table approach Excellent for sparse data Column families need to be specified up front
But still stored sparsely No way to list all the columns in the database Append only
Updates via timestamp Deletes via tombstone marker
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51
CASSANDRA
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 52
Cassandra Developed at Facebook, 2008, donated to Apache Descended from Bigtable and Dynamo
One of the primary Dynamo developers helped create Cassandra Focused on maximum throughput
Write lots of data, fast But at the expense of consistency (tunable)
Used by Twitter, Reddit, Netflix …but not Facebook
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 53
Cassandra Partitioned via hash (multiple strategies)
Be careful choosing your Row Key! Async masterless replication Tunable Consistency
from "writes never fail" to "wait until persisted on all slaves” Query with range queries, column family, CQL Hadoop support (replaces HDFS)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 54
GRAPH DATABASES
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 55
Nodes and Vertices
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 56
NEO4J
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 57
Neo4J Released in 2010 Written in Java, APIs are Java centric Most popular Graph Database Powers the recommendation engines of Glassdoor, Walmart
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 58
Neo4J Whole graph in memory – scales to millions of relationships
But does persist to disk Transactional Replicated for performance and robustness, master/slave Proprietary Graph query language (Cypher) Enterprise version adds clustering, sharding
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 59
SEMANTIC WEB
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 60
Semantics: A New Way to Organize DataData is stored in Triples, expressed as: Subject : Predicate : Object
John Smith : livesIn : London London : isIn : England
Query with SPARQL, gives us simple lookup .. and more Find people who live in (a place that's in) England
"John Smith"
"England"livesIn "London" isIn
livesIn
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 61
Context from the World at Large
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Linked Open Data Facts that are freely available In a form that’s easily consumed
DBpedia (wikipedia as structured information) Einstein was born in Germany Ireland’s currency is the Euro
GeoNames Doha is the capital of Qatar Doha has these lat/long coordinates
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 62
IN CONCLUSION…
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 63
Don't Design Your System Like It's 1979
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 64
ANY QUESTIONS?
@MARKLOGIC