intro to nosql

64
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Introduction to NoSQL Jim Driscoll, MarkLogic

Upload: jim-driscoll

Post on 16-Apr-2017

352 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Introduction to NoSQLJim Driscoll, MarkLogic

Page 2: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2

Agenda History of NoSQL NoSQL Terminology Types of NoSQL Databases (with examples of each)

Page 3: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3

HISTORY OF NOSQL

Page 4: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4

A Short History of Data Application Specific Databases

Size is paramount Relational Databases

Size matters… …but break from the application silo … and provide data integrity

NoSQL Databases Agility Scalability Speed

Page 5: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5

RELATIONAL DOESN’T MEAN WHAT YOU THINK IT DOES

Page 6: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6

What's wrong with Relational? Nothing, it's perfect for square data ...where you know the relationships in advance ...where the schema doesn't change often ...where all the data can fit on one machine ...where a separate disk seek for every join isn't an issue (or can be

cached)

Page 7: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7

SEEK AND YOU WILL FIND…IN ABOUT 10MS

Page 8: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8

The Rise of NoSQL

1998 2001 2003 2004 2006 2007 2009

Google FileSystem paper

Carlo Strozzi coins term

Google BigTable paper

Eric Evans popularizes term

MarkLogic founded

Google MapReduce paper

Amazon Dynamo paper

Page 9: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9

MEMCACHED

Page 10: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10

Memcached Developed at LiveJournal as a frontend cache for websites First released in 2003 Keep disk access at a minimum, pool memory on many machines So useful it found wide popularity, still under active development

Page 11: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11

Memcached High Performance, Distributed Memory Object Caching System

Distributed – runs across many computers Memory – runs without touching disk Object cache – designed to hold small lumps of data High performance – because it never touches disk, and the

objects are small, it’s optimized for speed

Page 12: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12

Memcached Client server system Servers are unaware of each other Clients determine server to use via hashing Servers keep content as an LRU cache

So all data transitory

Page 13: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13

SHARDING

Page 14: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14

Sharding to Scale Out

Page 15: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15

BIGTABLE

Page 16: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16

Bigtable Created by Google in 2004

…to store massive amounts of data Made public in famous 2006 paper Used throughout Google

GMail, Google Maps, YouTube, Web Indexing, etc Reportedly over 100 internal projects

Never shipped externally as a product … but available for public use as part the AppEngine hosting API

Page 17: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17

Bigtable Rows are composed of columns, which in turn belong to column

families Column families are essentially typing, validation and expiration info

It’s helpful to think of them as the “tables” Lookups are done via a Row Key Every cell is versioned via timestamp, and sparsely stored System is robust and crash resistant

Can survive the crash of any machine, including the master Scale out architecture

Page 18: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18

MAP / REDUCE

Page 19: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19

Map / Reduce Massively Distributed Processes Map - sort, filter, transform data Reduce - summarize data (iteratively)

Page 20: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20

HADOOP

Page 21: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21

Hadoop First envisioned as “Nutch” at the Internet Archive in 2002 There were 100’s of millions of webpages to index Early versions heavily influenced by Google File System, Map Reduce

papers Goal: Perform work on large datasets using commodity machines Development moved to Yahoo in 2006 Open Sourced to Apache, as Hadoop

A File system (HDFS) A Task Runner (MapReduce) A Task Manager (YARN) Note: Not a database

Page 22: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22

Hadoop Really good at…

Batch Processing … on incredibly large data sets

Not so good parts Latency Updates Usability

Page 23: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23

DYNAMO

Page 24: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24

Amazon Dynamo Created to power Amazon’s Web store Writing with low latency more important than consistency Techniques first made public in 2007 paper Never externally shipped… …but huge influence on market Used for a variety of critical portions of Amazon’s site

Shopping cart User Session

Succeeded by DynamoDB Similar name, but whole new architecture (with better consistency)

Page 25: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25

Amazon Dynamo Distributed Key Value store "always writable” low latency reads and writes, at the expense of consistency

asynchronous replication on put() operations …mean that get() may return a stale value

updates during a network partition can result in conflicts …and the application must handle them

Page 26: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26

TERMINOLOGY

Page 27: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27

What is(n't) NoSQL No SQL Schema-less Open Source BASE (Eventually Consistent)

Page 28: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28

ACID Atomicity

Everything either succeeds or fails Consistency

Nothing is saved unless it passes consistency rules Isolation

No two processes can interfere with each other Durability

Once saved, data can not be lost due to system failure

Page 29: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29

BASE Basically Available Soft state Eventually consistent

Page 30: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30

What happens without consistency? Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document or row at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that can’t be reconciled Some missing data that can’t be recovered Some inconsistent query results

Page 31: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31

What is NoSQL? Database Non-relational Schema on read Scale out architecture

Cluster friendly / Cloud Ready

Page 32: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32

TYPES OF NOSQL DATABASES

Page 33: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33

Types of NoSQL Databases

Graph Databases

Wide Column Databases

Key Value Databases

Document Databases

Page 34: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34

KEY VALUE STORES

Page 35: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35

MemcacheDB Very early KV implementation (2008) KV Store based on Memcached source, with BerkleyDB persistent store Speaks the memcached protocol Development stopped (2009), but still quite popular For when you like Memcached, but want persistence

Page 36: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 36

REDIS

Page 37: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37

Redis First released in 2009

Sponsored by VMWare, then Pivotal Name means Remote Dictionary Server Fully in memory key value store

Whole db must reside in memory of one machine Limits scalability, at the benefit of performance

Often used as a front end cache for other NoSQL databases

Page 38: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 38

Redis Not just strings as values:

Lists of strings Sets of strings (collections of non-repeating unsorted elements) Sorted sets of strings (collections of non-repeating elements

ordered by a floating-point number called score) Hashes where keys and values are strings

Page 39: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 39

Redis Master / slave replication - slave may be master to another slave

allowing tree replication also publish/subscribe API slaves may be updated separately from master, allows

inconsistencies (!) Persistent store

Append only journal Flushed every 2 seconds by default

Page 40: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 40

DOCUMENT STORES

Page 41: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41

Document vs Key Value Stores Extension of Key Value - the value is a document but also Structurally aware

Indexed searches Self-describing document formats

CouchDB – JSON MongoDB – BSON MarkLogic - JSON, XML

Page 42: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 42

MARKLOGIC

Page 43: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43

MarkLogic Founded in 2001, founders were search engine experts Document centric database with search engine features Stores and indexes XML, JSON, text and binaries Enterprise NoSQL

ACID transactions (including XA) HA/DR Government grade security

Page 44: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 44

MarkLogic Universal Index – index all the things

Index words, elements, the relationships of words and elements Many indexes (automatically) used at once, resolving queries

without touching disk Search on ranges, free text, field values, more… Shared nothing architecture, transactions via MVCC Automatic partitioning and balancing Hadoop support (works on HDFS, and with Map/Reduce jobs) Includes a webserver for building RESTful applications

Page 45: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45

MarkLogic More than a document store

Range indexes allow in-memory column operations Triple store, supporting RDF Triples and SPARQL

High Availability – multiple copies of updates saved transactionally Disaster Recovery – copies sent to remote site with a window Free to download and try out with a developer license

Page 46: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 46

MONGODB

Page 47: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47

MongoDB Development began in 2007 by 10gen Name from “humongous” Originally wanted to create a Google App Engine system 1.4 considered first “production ready” release, 2010 Stores and retrieves BSON documents Horizontally scaling

Page 48: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 48

MongoDB Stores data in proprietary format

BSON, similar to JSON with more data types Search on field, on range, or on regex Single index per query (secondary index optional) Replication of databases as master/slave, with (tunable) eventual

consistency Sharding handled via a shard key, splitting by range

Be sure the key is evenly distributed Client APIs in many languages

Page 49: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49

WIDE COLUMN STORES

Page 50: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 50

Column Stores Descended from Big Table approach Excellent for sparse data Column families need to be specified up front

But still stored sparsely No way to list all the columns in the database Append only

Updates via timestamp Deletes via tombstone marker

Page 51: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51

CASSANDRA

Page 52: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 52

Cassandra Developed at Facebook, 2008, donated to Apache Descended from Bigtable and Dynamo

One of the primary Dynamo developers helped create Cassandra Focused on maximum throughput

Write lots of data, fast But at the expense of consistency (tunable)

Used by Twitter, Reddit, Netflix …but not Facebook

Page 53: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 53

Cassandra Partitioned via hash (multiple strategies)

Be careful choosing your Row Key! Async masterless replication Tunable Consistency

from "writes never fail" to "wait until persisted on all slaves” Query with range queries, column family, CQL Hadoop support (replaces HDFS)

Page 54: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 54

GRAPH DATABASES

Page 55: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 55

Nodes and Vertices

Page 56: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 56

NEO4J

Page 57: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 57

Neo4J Released in 2010 Written in Java, APIs are Java centric Most popular Graph Database Powers the recommendation engines of Glassdoor, Walmart

Page 58: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 58

Neo4J Whole graph in memory – scales to millions of relationships

But does persist to disk Transactional Replicated for performance and robustness, master/slave Proprietary Graph query language (Cypher) Enterprise version adds clustering, sharding

Page 59: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 59

SEMANTIC WEB

Page 60: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 60

Semantics: A New Way to Organize DataData is stored in Triples, expressed as: Subject : Predicate : Object

John Smith : livesIn : London London : isIn : England

Query with SPARQL, gives us simple lookup .. and more Find people who live in (a place that's in) England

"John Smith"

"England"livesIn "London" isIn

livesIn

Page 61: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 61

Context from the World at Large

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Linked Open Data Facts that are freely available In a form that’s easily consumed

DBpedia (wikipedia as structured information) Einstein was born in Germany Ireland’s currency is the Euro

GeoNames Doha is the capital of Qatar Doha has these lat/long coordinates

Page 62: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 62

IN CONCLUSION…

Page 63: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 63

Don't Design Your System Like It's 1979

Page 64: Intro to NoSQL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 64

ANY QUESTIONS?

@MARKLOGIC