mongodb for coder training (coding serbia 2013)

Post on 10-May-2015

2.725 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides of my MongoDB Training given at Coding Serbia Conference on 18.10.2013 Agenda: 1. Introduction to NoSQL & MongoDB 2. Data manipulation: Learn how to CRUD with MongoDB 3. Indexing: Speed up your queries with MongoDB 4. MapReduce: Data aggregation with MongoDB 5. Aggregation Framework: Data aggregation done the MongoDB way 6. Replication: High Availability with MongoDB 7. Sharding: Scaling with MongoDB

TRANSCRIPT

Uwe Seileruweseiler

Training: MongoDB for Coder

About me

Big Data Nerd

TravelpiratePhotography Enthusiast

Hadoop Trainer MongoDB Author

About usis a bunch of…

Big Data Nerds Agile Ninjas Continuous Delivery Gurus

Enterprise Java Specialists Performance Geeks

Join us!

Agenda I

1. Introduction to NoSQL & MongoDB

2. Data manipulation: Learn how to CRUD with MongoDB

3. Indexing: Speed up your queries with MongoDB

4. MapReduce: Data aggregation with MongoDB

Agenda

5. Aggregation Framework: Data aggregation done the MongoDB way

6. Replication: High Availability with MongoDB

7. Sharding: Scaling with MongoDB

Ingredients

• Slides

• Live Coding

• Discussion

• Labs on your own computer

And please…

If you have questions, please share them with us!

And now start your downloads…

Lab files: http://bit.ly/1aT8RXY

Buzzword Bingo

NoSQL

Classification of NoSQL

Key-Value StoresK V

K V

K V

K V

K V

11 1 1

1 11 11

11

Column Stores

Graph Databases Document Stores

_id_id_id

Big Data

My favorite definition

The classic definition

• The 3 V’s of Big Data

•VarietyVolume Velocity

«Big Data» != Hadoop

Horizontal Scaling

Vertical Scaling

RAMCPU

Storage

RAMCPU

Storage

Vertical Scaling

RAMCPU

Storage

Vertical Scaling

Horizontal Scaling

RAMCPU

Storage

Horizontal Scaling

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

Horizontal Scaling

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

RAMCPU

Storage

The problemwith

distributeddata

Partition

Tolerancefailure of

single nodes doesn‘t effect

the overall system

Partition

Tolerancefailure of

single nodes doesn‘t effect

the overall system

The CAP Theorem

Consistency

all nodes see the same data

at the same time

Consistency

all nodes see the same data

at the same time

Availability

a guarantee that every

request receives a response

Availability

a guarantee that every

request receives a response

Consistency

all nodes see the same data

at the same time

Consistency

all nodes see the same data

at the same time

Availability

a guarantee that every

request receives a response

Availability

a guarantee that every

request receives a response

Overview of NoSQL systems

Partition Tolerance

failure of single nodes doesn‘t effect

the overall system

Partition Tolerance

failure of single nodes doesn‘t effect

the overall system

The problem with

consistency

ACID

vs.

BASE

ACID vs. BASE

Atomicity

Consistency

Isolation

Durability1983

RDBMS

ACID is a goodconcept but it is not

a written law!

ACID vs. BASE

BasicallyAvailable

Soft State

Eventually consistent2008

NoSQL

ACID vs. BASE

ACID

- Strong consistency- Isolation & Transactions- Two-Phase-Commit- Complex Development- More reliable

BASE

- Eventual consistency- Highly Available- "Fire-and-forget"- Eases development- Faster

ACID vs. BASE

Overview of MongoDB

MongoDB is a…

• document

• open source

• highly performant

• flexible

• scalable

• highly available

• feature-rich…database

Document Database

• Not PDF, Word, etc. … JSON!

Open Source Database

• MongoDB is a open source project

• Available on GitHub– https://github.com/mongodb/mongo

• Uses the AGPL Lizenz

• Started and sponsored by MongoDB Inc. (prior: 10gen)

• Commercial version and support available

• Join the crowd!– https://jira.mongodb.org

Datalocality

Performance

In-Memory Caching

In-Place Updates

Flexible Schema

RDBMS MongoDB

{

_id :

ObjectId("4c4ba5e5e8aabf3"),

employee_name: "Dunham, Justin",

department : "Marketing",

title : "Product Manager, Web",

report_up: "Neray, Graham",

pay_band: “C",

benefits : [

{ type : "Health",

plan : "PPO Plus" },

{ type : "Dental",

plan : "Standard" }

]

}

Scalability

Auto Sharding

• Increase capacity as you go

• Commodity and cloud architectures

• Improved operational simplicity and cost visibility

High Availability

• Automated replication and failover

• Multi-data center support

• Improved operational simplicity (e.g., HW swaps)

• Data durability and consistency

MongoDB Architecture

Rich Query Language

Aggregation Framework

Map/Reduce

MongoDB DataGroup(k)

Sort(k)

Finalize(k, v)

Map()

emit(k,v)

Reduce(k, values)

Shard 1

Shard 2

Shard n

Geo Information

Driver & Shell

Shell to interact with the database

Drivers are available for almost all popular programming languages and frameworks

> db.collection.insert({product:“MongoDB”, type:“Document Database”})> > db.collection.findOne(){

“_id” : ObjectId(“5106c1c2fc629bfe52792e86”),“product” : “MongoDB”“type” : “Document Database”

}

Java

Python

Perl

Ruby

Haskell

JavaScript

Indeed.com Trends

Top Job Trends

1.HTML 5

2.MongoDB

3.iOS

4.Android

5.Mobile Apps

6.Puppet

7.Hadoop

8.jQuery

9.PaaS

10.Social Media

NoSQL TrendsLinkedIn Job Skills

MongoDB

Competitor 1

Competitor 2

Competitor 3

Competitor 4

Competitor 5

All Others

Google Search

MongoDB

Competitor 1

Competitor 2

Competitor 3

Competitor 4

Jaspersoft Big Data Index

Direct Real-Time Downloads

MongoDB

Competitor 1

Competitor 2

Competitor 3

Data manipulation

RDBMS MongoDBTable / View ➜ CollectionRow ➜ DocumentIndex ➜ IndexJoin ➜ Embedded documentForeign Key ➜ Referenced documentPartition ➜ Shard

Terminology

Example: Simple blog model

MongoDB Collections

• User

• Article

• Tag

• Category

Schema design for the blog

Let’s have a look…

// Show all databases> show dbsdigg 0.078125GBenron 1.49951171875GB

// Switch to a database> use blog

// Show all databases again> show dbsdigg 0.078125GBenron 1.49951171875GB

Create a database

// Show all collections> show collections

// Insert a user> db.user.insert(

{ name : “Sheldon“, mail : “sheldon@bigbang.com“ }

)

Create a collection I

No feedback about the result of the insert, use:

db.runCommand( { getLastError: 1} )

// Show all collections> show collectionssystem.indexesuser

// Show all databases> show dbsblog 0.0625GBdigg 0.078125GBenron 1.49951171875GB

Create a collection II

Databases and collections areautomatically created duringthe first insert operation!

// Show the first document> db.user.findOne(){

"_id" : ObjectId("516684a32f391f3c2fcb80ed"),"name" : "Sheldon","mail" : "sheldon@bigbang.com"

}

// Show all documents of a collection> db.user.find(){

"_id" : ObjectId("516684a32f391f3c2fcb80ed"),"name" : "Sheldon","mail" : "sheldon@bigbang.com"

}

Read from a collection

// Find a specific document> db.user.find( { name : ”Penny” } ){

"_id" : ObjectId("5166a9dc2f391f3c2fcb80f1"),"name" : "Penny","mail" : "penny@bigbang.com"

}

// Show only certain fields of the document> db.user.find( { name : ”Penny” },

{_id: 0, mail : 1} )

{ "mail" : "sheldon@bigbang.com" }

Find documents

_id

• _id is the primary key in MongoDB

• _id is created automatically

• If not specified differently, it‘s type isObjectId

• _id can be specified by the user during theinsert of documents, but needs to beunique (and can not be edited afterwards)

ObjectId

• A ObjectId is a special 12 Byte value

• It‘s uniqueness in the whole cluster is guaranteed as following:

ObjectId("50804d0bd94ccab2da652599")|-------------||---------||-----||----------|

ts mac pid inc

// Use a cursor with find()> var myCursor = db.user.find( )

// Get the next document> var myDocument =

myCursor.hasNext() ? myCursor.next() : null;

> if (myDocument) { printjson(myDocument.mail); }

// Show all other documents> myCursor.forEach(printjson);

Cursor

By default the shell displays20 documents

// Find documents using OR> db.user.find(

{$or : [ { name : “Sheldon“ }, { mail : amy@bigbang.com }

] } )

// Find documents using AND> db.user.find(

{$and : [ { name : “Sheldon“ }, { mail : amy@bigbang.com }

] } )

Logical operators

// Sort documents> db.user.find().sort( { name : 1 } ) // Aufsteigend> db.user.find().sort( { name : -1 } ) // Absteigend

// Limit the number of documents> db.user.find().limit(3)

// Skip documents> db.user.find().skip(2)

// Combination of both methods> db.user.find().skip(2).limit(3)

Manipulating results

// Updating only the mail address (How not to do…)> db.user.update( { name : “Sheldon“ },

{ mail : “sheldon@howimetyourmother.com“ } )

// Result of the update operationdb.user.findOne(){

"_id" : ObjectId("516684a32f391f3c2fcb80ed"),"mail" : "sheldon@howimetyourmother.com"

}

Updating documents I

Be careful when updatingdocuments!

// Deleting a document> db.user.remove(

{ mail : “sheldon@howimetyourmother.com“ } )

// Deleting all documents in a collection> db.user.remove()

// Use a condition to delete documents> db.user.remove(

{ mail : /.*mother.com$/ } )

// Delete only the first document using a condition> db.user.remove( { mail : /.*.com$/ }, true )

Deleting documents

// Updating only the mail address (This time for real)> db.user.update( { name : “Sheldon“ },

{ $set : { mail : “sheldon@howimetyourmother.com“

} } )

// Show the result of the update operationdb.user.find(name : “Sheldon“){

"_id" : ObjectId("5166ba122f391f3c2fcb80f5"),"mail" : "sheldon@howimetyourmother.com","name" : "Sheldon"

}

Updating documents II

// Adding a array> db.user.update( {name : “Sheldon“ },

{ $set : {enemies : [ { name : “Wil Wheaton“ },

{ name : “Barry Kripke“ } ]

} } )

// Adding a value to the array> db.user.update( { name : “Sheldon“},

{ $push : {enemies : { name : “Leslie Winkle“}

} } )

Adding to arrays

// Deleting a value from an array> db.user.update( { name : “Sheldon“ },

{$pull : {enemies : {name : “Barry Kripke“ }

} } )

// Deleting of a complete array> db.user.update( {name : “Sheldon“},

{$unset : {enemies : 1}})

Deleting from arrays

// Adding a subdocument to an existing document> db.user.update( { name : “Sheldon“}, {

$set : { mother :{ name : “Mary Cooper“, residence : “Galveston, Texas“, religion : “Evangelical Christian“ }}})

{"_id" : ObjectId("5166cf162f391f3c2fcb80f7"),"mail" : "sheldon@bigbang.com","mother" : {

"name" : "Mary Cooper","residence" : "Galveston, Texas","religion" : "Evangelical Christian"

},"name" : "Sheldon"

}

Adding a subdocument

// Finding out the name of the mother> db.user.find( { name : “Sheldon“},

{“mother.name“ : 1 } )

{"_id" : ObjectId("5166cf162f391f3c2fcb80f7"),"mother" : {

"name" : "Mary Cooper"}

}

Querying subdocuments

Compound field names need tobe in “…“!

For fields:$inc$rename$set$unset

Bitwise operation:$bit

Isolation:$isolated

Overview of all update operators

For arrays:$addToSet$pop$pullAll$pull$pushAll$push$each (Modifier)$slice (Modifier)$sort (Modifier)

Createhttp://docs.mongodb.org/manual/core/create/

Readhttp://docs.mongodb.org/manual/core/read/

Updatehttp://docs.mongodb.org/manual/core/update/

Deletehttp://docs.mongodb.org/manual/core/delete/

Dokumentation

Lab time!

Lab Nr. 02

Time box:20 min

Indexing

What is an index?

Chained lists

1 2 3 4 5 6 7

Find Nr. 7 in the chained list!

1 2 3 4 5 6 7

Find Nr. 7 in a tree!

1

2

3

4

7

6

5

Indices in MongoDB are B-Trees

Find, Insert and Delete Operations:

O(log(n))

Missing or non-optimal indices are the single-

most avoidable performance issue

// Create a non-existing index for a field> db.recipes.createIndex({ main_ingredient: 1 })

// Make sure there is an index on the field> db.recipes.ensureIndex({ main_ingredient: 1 })

* 1 for ascending, -1 for descending

How do I create an index?

// Multiple fields (Compound Key Indexes)> db.recipes.ensureIndex({

main_ingredient: 1,calories: -1

})

// Arrays with values (Multikey Indexes){

name: 'Chicken Noodle Soup’,ingredients : ['chicken', 'noodles']

}

> db.recipes.ensureIndex({ ingredients: 1 })

What can be indexed?

// Subdocuments{

name : 'Apple Pie', contributor: {

name: 'Joe American',id: 'joea123'

}}

db.recipes.ensureIndex({ 'contributor.id': 1 })

db.recipes.ensureIndex({ 'contributor': 1 })

What can be indexed?

// List all indices of a collection

> db.recipes.getIndexes()

> db.recipes.getIndexKeys()

// Drop an index

> db.recipes.dropIndex({ ingredients: 1 })

// Drop and recreate all indices of a collection

db.recipes.reIndex()

How to maintain indices?

More options

• Unique Index– Allows only unique values in the indexed field(s)

• Sparse Index– For fields that are not available in all documents

• Geospatial Index– For modelling 2D and 3D geospatial indices

• TTL Collections – Are automatically deleted after x seconds

// Make sure the name of a recipe is unique

> db.recipes.ensureIndex( { name: 1 }, { unique: true } )

// Force an index on a collection with non-unique values// Duplicates will be deleted more or less randomly!

> db.recipes.ensureIndex(

{ name: 1 },

{ unique: true, dropDups: true }

)

* dropDups should be used only with caution!

Unique Index

// Only documents with the field calories will be indexed

> db.recipes.ensureIndex(

{ calories: -1 },

{ sparse: true }

)

// Combination with unique index is possible

> db.recipes.ensureIndex(

{ name: 1 , calories: -1 },

{ unique: true, sparse: true }

)

* Missing fields will be saved as null in the index!

Sparse Index

// Add longitude and altitude

{

name: ‚codecentric Frankfurt’,

loc: [ 50.11678, 8.67206]

}

// Index the 2D coordinates

> db.locations.ensureIndex( { loc : '2d' } )

// Find locations near codecentric Frankfurt

> db.locations.find({

loc: { $near: [ 50.1, 8.7 ] }

})

Geospatial Index

// Documents need a field of type BSON UTC

{ ' submitted_date ' : ISODate('2012-10-12T05:24:07.211Z'), … }

// Documents will be deleted automatically by a daemon process// after 'expireAfterSeconds'

> db.recipes.ensureIndex(

{ submitted_date: 1 },

{ expireAfterSeconds: 3600 }

)

TTL Collections

Limitations of indices

• Collections can‘t have more than 64 indices

• Index keys are not allowed to be larger than 1024 Byte

• The name of an index (including name space) must beless than 128 character

• Queries can only make use of one index– Exception: Queries using $or

• Indices are tried to be kept in-memory

• Indices slow down the writing of data

Optimizing indices

Best practice

1. Identify slow queries

2. Find out more about the slow queriesusing explain()

3. Create appropriate indices on the fieldsbeing queried

4. Optimize the query taking theavailable indices into account

> db.setProfilingLevel( n , slowms=100ms )

n=0: Profiler off

n=1: Log all operations slower than slowms

n=2: Log all operations

> db.system.profile.find()

* The collection profile is a capped collection with a limited number of entries

1. Identify slow queries

> db.recipes.find( { calories:

{ $lt : 40 } }

).explain( )

{

"cursor" : "BasicCursor" ,

"n" : 42,

"nscannedObjects” : 53641

"nscanned" : 53641,

...

"millis" : 252,

...

}

2. Usage of explain()

2. Metrics of the execution plan I

• Cursor– The type of the cursor: BasicCursor means no idex

has been used

• n – The number of matched documents

• nscannedObjects– The number of scanned documents

• nscanned– The number of scanned entries (Index entries or

documents)

2. Metrics of the execution plan II

• millis– Execution time of the query

• Complete reference can be found here– http://docs.mongodb.org/manual/reference/explain

Optimize for

������������������������

������������ℎ�����������= 1

3. Create appropriate indiceson the fields being queried

// Using the following index…

> db.collection.ensureIndex({ a:1, b:1 , c:1, d:1 })

// … these queries and sorts can make use of the index

> db.collection.find( ).sort({ a:1 })

> db.collection.find( ).sort({ a:1, b:1 })

> db.collection.find({ a:4 }).sort({ a:1, b:1 })

> db.collection.find({ b:5 }).sort({ a:1, b:1 })

4. Optimize queries taking theavailable indices into account

// Using the following index…

> db.collection.ensureIndex({ a:1, b:1, c:1, d:1 })

// … the these queries can not make use of it

> db.collection.find( ).sort({ b: 1 })

> db.collection.find({ b: 5 }).sort({ b: 1 })

4. Optimize queries taking theavailable indices into account

// Using the following index…

> db.recipes.ensureIndex({ main_ingredient: 1, name: 1 })

// … this query can be complete satisfied using the index!

> db.recipes.find(

{ main_ingredient: 'chicken’ },

{ _id: 0, name: 1 }

)

// The metric indexOnly using explain() verifies this:

> db.recipes.find(

{ main_ingredient: 'chicken' },

{ _id: 0, name: 1 }

).explain()

{

"indexOnly": true,

}

4. Optimize queries taking theavailable indices into account

// Tell MongoDB explicitly which index to use

> db.recipes.find({

calories: { $lt: 1000 } }

).hint({ _id: 1 })

// Switch the usage of idices completely off (e.g. for performance// measurements)

> db.recipes.find(

{ calories: { $lt: 1000 } }

).hint({ $natural: 1 })

Use specific indices

Caveats using indices

// MongoDB can only use one index per query!

> db.collection.ensureIndex({ a: 1 })

> db.collection.ensureIndex({ b: 1 })

// For this query only one of those two indices can be used

> db.collection.find({ a: 3, b: 4 })

Using multiple indices

// Compound indices are often very efficient!

> db.collection.ensureIndex({ a: 1, b: 1, c: 1 })

// But only if the query is a prefix of the index...

// This query can make use of the index

db.collection.find({ c: 2 })

// …but this query can

db.collection.find({ a: 3, b: 5 })

Compound indices

// The following field has only few distinct values

> db.collection.distinct('status’)

[ 'new', 'processed' ]

// A index on this field is not the best idea…

> db.collection.ensureIndex({ status: 1 })

> db.collection.find({ status: 'new' })

// Better use a adequate compound index with other fields

> db.collection.ensureIndex({ status: 1, created_at: -1 })

> db.collection.find(

{ status: 'new' }

).sort({ created_at: -1 })

Indices with low selectivity

> db.users.ensureIndex({ username: 1 })

// Left-bound regular expressions can make usage of this index

> db.users.find({ username: /^joe smith/ })

// But not queries with regular expressions in general…

> db.users.find({username: /smith/ })

// Also not case-insensitive queries…

> db.users.find({ username: /^Joe/i })

Regular expressions & Indices

// Negations can not make use of indices

> db.things.ensureIndex({ x: 1 })

// e.g. queries using not equal

> db.things.find({ x: { $ne: 3 } })

// …or queries with not in

> db.things.find({ x: { $nin: [2, 3, 4 ] } })

// …or queries with the $not operator

> db.people.find({ name: { $not: 'John Doe' } })

Negations & Indices

Lab time!

Lab Nr. 03

Time box:20 min

Map/Reduce

What is Map/Reduce?

• Programming model coming fromfunctional languages

• Framework for– parallel processing– of big volume data– using distributed systems

• Made popular by Google– Has been invented to calculate the inverted search

index for web sites to keywords (Page Rank)– http://research.google.com/archive/mapreduce.html

Basics

• Not something special about MongoDB– Hadoop– Disco– Amazon Elastic MapReduce– …

• Based on key-value-pairs

• Prior to version 2.4 and the introduction of the V8 JavaScript engine only one threadper shard

The „Hello world“ of Map/Reduce: Word Count

Word Count: Problem

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{ MongoDB

uses MapReduce

}

{There is a

map phase}

{There is a

reduce phase

}

a: 2is: 2

map: 1

mapreduce: 1mongodb: 1

phase: 2

reduce: 1there: 2uses: 1

Problem:How often doesone word appearin all documents?

Word Count: Mapping

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{MongoDB

uses MapReduce

}

{There is a

map phase}

{There is a

reduce phase

}

(doc1, “…“)

(doc2, “…“)

(doc3, “…“)

(mongodb, 1)(uses, 1)(mapreduce, 1)

(there, 1)(is, 1)(a, 1)(map, 1)(phase, 1)

(there, 1)(is, 1)(a, 1)(reduce, 1)(phase, 1)

Word Count: Group/Sort

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{MongoDB

uses MapReduce

}

{There is a

map phase}

{There is a

reduce phase

}

(doc1, “…“)

(doc2, “…“)

(doc3, “…“)

a-l

m-q

r-z

(phase, 1)

(map, 1)

(reduce, 1)

(there, 1)

Word Count: Reduce

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{MongoDB

uses MapReduce

}

{There is a

map phase}

{There is a

reduce phase

}

(doc1, “…“)

(doc2, “…“)

(doc3, “…“)

(a, [1, 1])(is, [1, 1])(map, [1])

(mapreduce, [1])(mongodb, [1])(phase, [1, 1])

(reduce, [1])(there, [1, 1])

(uses, [1])

Word Count: Result

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{MongoDB

uses MapReduce

}

{There is a

map phase}

{There is a

reduce phase

}

(doc1, “…“)

(doc2, “…“)

(doc3, “…“)

(a, [1, 1])(is, [1, 1])(map, [1])

(mapreduce, [1])(mongodb, [1])(phase, [1, 1])

(reduce, [1])(there, [1, 1])

(uses, [1])

a: 2is: 2

map: 1

mapreduce: 1mongodb: 1

phase: 2

reduce: 1there: 2uses: 1

Word Count: In a nutshell

INPUT MAPPER GROUP/SORT REDUCER OUTPUT

{MongoDB

uses MapReduce

}

(doc1, “…“)

(a, [1, 1])(is, [1, 1])(map, [1])

a: 2is: 2

map: 1

map()Transforms one key-value-pair in 0–N key-value-pairs

reduce()Reduces 0-N key-value-pairs into onekey-value-pair

Map/Reduce: Overview

MongoDB Data group(k)

sort(k)

finalize(k, v)

map()

emit(k,v)

reduce(k, values)

Shard 1

Shard 2

Shard n

Iterates all documents

• Input = Output• Can run multiple

times

// Example: Twitter database with tweets> db.tweets.findOne(){

"_id" : ObjectId("4fb9fb91d066d657de8d6f38"),"text" : "RT @RevRunWisdom: The bravest thing that men do is

love women #love","created_at" : "Thu Sep 02 18:11:24 +0000 2010",

"user" : {"friends_count" : 0,"profile_sidebar_fill_color" : "252429","screen_name" : "RevRunWisdom","name" : "Rev Run",

},…

Word Count: Tweets

// Map function with simple data cleansingmap = function() {

this.text.split(' ').forEach(function(word) {

// Remove whitespaceword = word.replace(/\s/g, "");

// Remove all non-word-charactersword = word.replace(/\W/gm,"");

// Finally emit the cleaned up wordif(word != "") {

emit(word, 1)}

});};

Word Count: map()

// Reduce functionreduce = function(key, values) {

return values.length;};

Word Count: reduc()

// Show the results using the console> db.tweets.mapReduce(map, reduce, { out : { inline : 1 } } );

// Save the results to a collection> db.tweets.mapReduce(map, reduce, { out : "tweets_word_count"} );

{"result" : "tweets_word_count","timeMillis" : 19026,"counts" : {

"input" : 53641,"emit" : 559217,"reduce" : 102057,"output" : 131003

},"ok" : 1,

}

Word Count: Call

// Top-10 of most common words in tweets> db.tweets_word_count.find().sort({"value" : -1}).limit(10)

{ "_id" : "Miley", "value" : 31 }{ "_id" : "mil", "value" : 31 }{ "_id" : "andthenihitmydougie", "value" : 30 }{ "_id" : "programa", "value" : 30 }{ "_id" : "Live", "value" : 29 }{ "_id" : "Super", "value" : 29 }{ "_id" : "cabelo", "value" : 29 }{ "_id" : "listen", "value" : 29 }{ "_id" : "Call", "value" : 28 }{ "_id" : "DA", "value" : 28 }

Word Count: Result

Recommendation

Typical use cases

• Counting, Aggregating & Suming up– Analyzing log entries & Generating log reports– Generating an inversed index– Substitute existing ETL processes

• Counting unique values– Counting the number of unique visitors of a website

• Filtering, Parsing & Validation– Filtering of user data– Consolidation of user-generated data

• Sorting– Data analysis using complex sorting

Summary

• The Map/Reduce framework is veryversatile & powerful

• Is implemented in JavaScript– Necessity to write own map()- und reduce() functions in JavaScript– Difficult to debug– Performance is highly influenced by the JavaScript engine

• Can be used for complex data analytics

• Lots of overhead for simple aggregation tasks– Suming up of data– Average of data– Grouping of data

Map/Reduce should be used asultima ratio!

Lab time!

Lab Nr. 04

Time box:20 min

Aggregation Framework

Why?

SELECT customer_id, SUM(price)FROM orders WHERE active=trueGROUP BY customer_id

That‘s why!

SELECT customer_id, SUM(price)FROM orders WHERE active=trueGROUP BY customer_id

Calculationof fields

Calculationof fields

Groupingof data

Groupingof data

The Aggregation Framework

• Has been introduced to allow 90% of real-world aggregation use cases without usingthe„big hammer“ Map/Reduce

• Framework of methods & operators– Declarative

– No own JavaScript code needed

– Fixed set of methods and operators (but constantly underdevelopment by MongoDB Inc.)

• Implemented in C++– Limitations on JavaScript Engine are avoided

– Better performance

The Aggregation Pipeline

{document}

Pipeline Operator

Pipeline Operator

Pipeline Operator

Result{

sum: 337avg: 24,53min: 2max : 99

}

The Aggregation Pipeline

• Processes a stream of documents– Input is a complete collection– Output is a document containing the results

• Succession of pipeline operators– Each tier filters or transforms the documents– Input documents of a tier are the output documents

of the previous tier

db.tweets.aggregate(

{ $pipeline_operator_1 },

{ $pipeline_operator_2 },

{ $pipeline_operator_3 },

{ $pipeline_operator_4 },

...

);

Call

// Old friends*

$match

$sort

$limit

$skip

Pipeline Operators

// New friends

$project

$group

$unwind

* from the query functionality

// Example: Twitter database with tweets> db.tweets.findOne(){

"_id" : ObjectId("4fb9fb91d066d657de8d6f38"),"text" : "RT @RevRunWisdom: The bravest thing that men do is

love women #love","created_at" : "Thu Sep 02 18:11:24 +0000 2010",

"user" : {"friends_count" : 0,"profile_sidebar_fill_color" : "252429","screen_name" : "RevRunWisdom","name" : "Rev Run",

},…

Example: Tweets

// Show all german users

> db.tweets.aggregate(

{ $match : {"user.lang" : "de"}},

);

// Show all users with 0 to 10 followers

> db.tweets.aggregate(

{ $match : {"user.followers_count" : { $gte : 0, $lt : 10 } } }

);

$match

> Filters documents> Equivalent to .find()

// Sorting using one field

> db.tweets.aggregate(

{ $sort : {"user.friends_count" : -1} },

);

// Sorting using multiple fields

> db.tweets.aggregate(

{ $sort : {"user.lang" : 1, "user.time_zone" : 1, "user.friends_count" : -1} },

);

$sort

> Sorts documents> Equivalent to .sort()

// Limit the number of resulting documents to 3

> db.tweets.aggregate(

{ $sort : {"user.friends_count" : -1} },

{ $limit : 3 }

);

$limit

> Limits resulting documents> Equivalent to .limit()

// Get the No.4-Twitterer according to number of friends

> db.tweets.aggregate(

{ $sort : {"user.friends_count" : -1} },

{ $skip : 3 },

{ $limit : 1 }

);

$skip

> Skips documents> Equivalent to .skip()

// Limit the result document to only one field

> db.tweets.aggregate(

{ $project : {text : 1} },

);

// Remove _id

> db.tweets.aggregate(

{ $project : {_id: 0, text : 1} },

);

$project I

> Limits the fields in resulting documents

// Rename a field

> db.tweets.aggregate(

{ $project : {_id: 0, content_of_tweet : "$text"} },

);

// Add a calculated field

> db.tweets.aggregate(

{ $project : {_id: 0, content_of_tweet : "$text", number_of_friends : {$add: ["$user.friends_count", 10]} } },

);

$project II

// Add a subdocument

> db.tweets.aggregate(

{ $project : {_id: 0,

content_of_tweet : "$text",

user : {

name : "$user.name",

number_of_friends : {$add: ["$user.friends_count", 10]} }

} } );

$project III

// Grouping using a single field

> db.tweets.aggregate(

{ $group : {

_id : "$user.lang",

anzahl_tweets : {$sum : 1} }

}

);

$group I

> Groups documents> Equivalent to GROUP BY in SQL

// Grouping using multiple fields

> db.tweets.aggregate(

{ $group : {

_id : { background_image:

"$user.profile_use_background_image",

language: "$user.lang" },

number_of_tweets: {$max : 1} }

}

);

$group II

// Grouping with multiple calculated fields

> db.tweets.aggregate(

{ $group : {

_id : "$user.lang",

number_of_tweets : {$sum : 1},

average_of_followers : {$avg : "$user.followers_count"},

minimum_of_followers : {$min : "$user.followers_count"},

maximum_of_followers : {$max : "$user.followers_count"} }

}

);

$group III

$min

$max

$avg

$sum

Group Aggregation Functions

$addToSet

$first

$last

$push

// Unwind an array

> db.tweets.aggregate(

{ $project : {_id: 0, content_of_tweet : "$text", mentioned_users : "$entities.user_mentions.name" } },

{ $skip : 18 },

{ $limit : 1 },

{ $unwind : "$mentioned_users" }

);

$unwind I

> Unwinds arrays andcreates one document per value in the array

// Resulting document without $unwind

{

„content_of_tweet" : "RT @Philanthropy: How shouldnonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS",

„mentioned_users" : [

"Philanthropy",

"Allison Fine"

]

}

$unwind II

// Resulting documents with $unwind

{

" content_of_tweet " : "RT @Philanthropy: How shouldnonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS",

" mentioned_users " : "Philanthropy"

},

{

" content_of_tweet " : "RT @Philanthropy: How shouldnonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS",

" mentioned_users " : "Allison Fine"

}

$unwind III

Best Practices

Place $match at the beginning of the pipeline to reduce the number of documents as soon as possible!

Best Practice #1

Use $project to remove not needed fields in the documents as soon as possible!

Best Practice #2

When being placed at the beginning of the pipeline theseoperators can make use of indices:

$match$sort$limit$skip

The above operators can equally use indices when placedbefore these operators:

$project$unwind$group

Best Practice #3

Mapping of MongoDB to SQL

MappingSQL MongoDB Aggregation

WHERE $match

GROUP BY $group

HAVING $match

SELECT $project

ORDER BY $sort

LIMIT $limit

SUM() $sum

COUNT() $sum

joinNo equivalent operator($unwind has somehow equivalentfunctionality for embedded fields)

Example: Online shopping

{ cust_id: “sheldon1", ord_date:

ISODate("2013-04-018T19:38:11.102Z"), status: ‘purchased', price: 105,69, items:

[ { sku: “nobel_price_replica", qty: 3, price: 29,90 },

{ sku: “wheaton_voodoo_doll", qty: 1, price: 15,99 } ]

}

Count all orders

SQL MongoDB Aggregation

SELECT COUNT(*) AScount FROM orders

db.orders.aggregate( [ { $group: { _id: null,

count: { $sum: 1 } } } ] )

Average order price per customer

SQL MongoDB Aggregation

SELECT cust_id, SUM(price) AS total FROM orders GROUP BY cust_id ORDERBY total

db.orders.aggregate( [ { $group: { _id: "$cust_id", total: { $sum: "$price" } } },

{ $sort: { total: 1 } } ] )

Sum up all orders over 250$

SQL MongoDB Aggregation

SELECT cust_id, SUM(price) astotal

FROM orders WHERE status = ‘purchased'GROUP BY cust_idHAVING total > 250

db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: { _id: "$cust_id", total: { $sum: "$price" } } },

{ $match: { total: { $gt: 250} } } ] )

http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

More examples

Lab time!

Lab Nr. 05

Time box:20 min

Replication: High Availability with MongoDB

Why do we need replication?

• Hardware is unreliable and is doomed to fail!

• Do you want to be the person being called at night to do a manual failover?

• How about network latency?

• Different use cases for your data– “Regular” processing– Data for analysis– Data for backup

Life cycle of a replica set

Replica set – Create

Replica set – Initializing

Replica set – Node down

Replica set – Failover

Replica set – Recovery

Replica set – Back to normal

Roles & Configuration

Replica sets - Roles

> conf = {

_id : "mySet",

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C”},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration I

> conf = {

_id : "mySet”,

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C”},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration II

Primary data center

> conf = {

_id : "mySet”,

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C”},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration III

Secondary data center(Default priority = 1)

> conf = {

_id : "mySet”,

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C”},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration IV

Analytical data e.g. for Hadoop, Storm, BI, …

> conf = {

_id : "mySet”,

members : [

{_id : 0, host : "A”, priority : 3},

{_id : 1, host : "B", priority : 2},

{_id : 2, host : "C”},

{_id : 3, host : "D", hidden : true},

{_id : 4, host : "E", hidden : true, slaveDelay : 3600}

]

}

> rs.initiate(conf)

Configuration V

Back-up node

Data consistency

Strong consistency

Eventual consistency

Write Concern

• Different levels of data consistency

• Acknowledged by– Network– MongoDB– Journal– Secondaries– Tagging

Acknowledged by network„Fire and forget“

Acknowledged by MongoDBWait for Error

Acknowledged by JournalWait for Journal Sync

Acknowledged by SecondariesWait for Replication

Tagging while writing data

• Available since 2.0

• Allows for fine granular control

• Each node can have multiple tags– tags: {dc: "ny"}– tags: {dc: "ny", subnet: „192.168", rack: „row3rk7"}

• Allows for creating Write Concern Rules (per replica set)

• Tags can be adapted without code changes and restarts

{

_id : "mySet",

members : [

{_id : 0, host : "A", tags : {"dc": "ny"}},

{_id : 1, host : "B", tags : {"dc": "ny"}},

{_id : 2, host : "C", tags : {"dc": "sf"}},

{_id : 3, host : "D", tags : {"dc": "sf"}},

{_id : 4, host : "E", tags : {"dc": "cloud"}}],

settings : {

getLastErrorModes : {

allDCs : {"dc" : 3},

someDCs : {"dc" : 2}} }

}

> db.blogs.insert({...})

> db.runCommand({getLastError : 1, w : "someDCs"})

Tagging - Example

Acknowledged by TaggingWait for Replication (Tagging)

// Wait for network acknowledgement

> db.runCommand( { getLastError: 1, w: 0 } )

// Wait for error (Default)

> db.runCommand( { getLastError: 1, w: 1 } )

// Wait for journal sync

> db.runCommand( { getLastError: 1, w: 1, j: "true" } )

// Wait for replication

> db.runCommand( { getLastError: 1, w: “majority" } )

> db.runCommand( { getLastError: 1, w: 3 } ) // # of secondaries

Configure the Write Concern

Read Concerns

• Only primary (primary)

• Primary preferred (primaryPreferred)

• Only secondaries (secondary)

• Secondaries preferred (secondaryPreferred)

• Nearest node (Nearest)

General: If more than one node is available, the nearest node will be chosen (All modes except Primary)

Only primary(primary)

Read

Primary preferred(primaryPreferred)

Read

Read

Only secondaries(secondary)

Read

Read

Secondaries preferred(secondaryPreferred)

Read

Read

Read

Nearest node(nearest)

Read

Read

Read

Tagging while reading data

• Allows for a more fine granular control where data will be read from – e.g. { "disk": "ssd", "use": "reporting" }

• Can be combined with other read modes– Except for mode „Only primary“

// Only primary

> cursor.setReadPref( “primary" )

// Primary preferred

> cursor.setReadPref( “primaryPreferred" )

// Only secondaries with tagging

> cursor.setReadPref( “secondary“, [ rack : 2 ] )

Configure the Read Concern

Read Concern must be configured before using the cursor to read data!

MongoDB Operation

Maintenance & Upgrades

• Zero downtime

• Rolling upgrades and maintenance– Start with all secondaries– Step down the current primary– Primary as last one– Restore previous primary (if needed)

• Commands: – rs.stepDown(<secs>) – db.version() – db.serverBuildInfo()

Replica set – 1 data center

• One – Data center– Switch– Power Supply

• Possible errors:

– Failure of 2 nodes– Power Supply– Network– Data Center

• Automatic recovery

Replica set – 2 data center

• Additional node for data recovery

• No writing to both data center since only one node in data center No. 2

Replica set – 3 data center

• Can recover from a complete data center failure

• Allows for usage of w= { dc : 2 } to guarantee writing to 2 data centers (via tagging)

Commands

• Administration of the nodes– rs.conf() – rs.initiate(<conf>) & rs.reconfig(<conf>) – rs.add(host:<port>) & rs.addArb(host:<port>) – rs.status() – rs.stepDown(<secs>)

• Reconfiguration if a minority of the nodes is not available – rs.reconfig( cfg, { force: true} )

Best Practices

Best Practices

• Uneven number of nodes

• Adapt the write concern to your use case

• Read from primary except for– Geographical distribution– Data analytics

• Use logical names and not IP addresses for configuration

• Monitor the lags of the secondaries (e.g. MMS)

Lab time!

Lab Nr. 06

Time box:20 min

Sharding: Scaling with MongoDB

Visual representation of vertical scalingVisual representation of vertical scaling

1970 - 2000: Vertical Scaling„Scale up“

Visual representation of horizontal scalingVisual representation of horizontal scaling

Since 2000: Horizontal Scaling „Scale out“

When to use Sharding?

Not enough disk space

The working set doesn‘t fit into the memory

The needs for read-/write throughput are higher than the I/O capabilities

Sharding MongoDB

Partitioning of data

• The user needs to define a shard key

• The shard key defines the distribution of data across the shards

Partitioning of data into chunks

• Initially all data is in one chunk

• Maximum chunk size: 64 MB

• MongoDB divides and distributes chunks automatically once the maximum size is met

One chunk contains data of a certain value range

Chunks & Shards

• A shard is one node in the cluster

• A shard can be one single mongod or a replica set

Metadata Management

• Config Server– Stores the value ranges of the chunks and their

location – Number of config servers is 1 or 3 (Production: 3)– Two Phase Commit

Balancing & Routing Service

• mongos balances the data in the cluster

• mongos distributes data to new nodes

• mongos routes queries to the correct shard or collects results if data isspread on multiple shards

• No local data

Automatic Balancing

Balancing will be automatically done once the number of chunks between shards hits a certain threshold

Splitting of a chunk

• Once a chunk hits the maximum size it will be split

• Splitting is only a logical operation, no data needs to be moved

• If the splitting of a chunk results in a misbalance of data, automatic rebalancing will be started

Sharding Infrastructure

MongoDB Auto Sharding

• Minimal effort– Usage of the same interfaces for mongod and

mongos

• Easy configuration– Enable sharding for a database

• sh.enableSharding("<database>")

– Shard a collection in a database

• sh.shardCollection("<database>.<collection>", shard-key-pattern)

Configuration example

Example of a very simple cluster

• Never use this in production!– Only one config server (No fault tolerance)– Shard is no replica set (No high availability)– Only one mongos and one shard (No performance

improvement)

// Start the config server (Default port 27019)

> mongod --configsvr

Start the config server

// Start the mongos router (Default port 27017)

> mongos --configdb <hostname>:27019

// When using 3 config servers

> mongos --configdb <host1>:<port1>,<host2>:<port2>,<host3>:<port3>

Start the mongos routing service

// Start a shard with one mongod (Default port 27018)

> mongod --shardsvr

// Shard is not yet added to the cluster!

Start the shard

// Connect to mongos and add the shard

> mongo

> sh.addShard(‘<host>:27018’)

// When adding a replica set, you only need to add one of the nodes!

Add the shard

// Check if the shard has been added

> db.runCommand({ listShards:1 })

{ "shards" : [ { "_id”: "shard0000”, "host”: ”<hostname>:27018” } ],

"ok" : 1}

Check configuration

// Enable the sharding for a database

> sh.enableSharding(“<dbname>”)

// Shard a collection using a shard key

> sh.shardCollection(“<dbname>.user”, { “name” : 1 } )

// Use a compound shard key

> sh.shardCollection(“<dbname>.cars”,{“year”:1, ”uniqueid”:1})

Configure sharding

Shard Key

Shard Key

• The shard key can not be changed

• The values of a shard key can not be changed

• The shard key needs to be indexed

• The uniqueness of the field _id is only guaranteed within a shard

• The size of a shard key is limited to 512 bytes

Considerations for the shard key

• Cardinality of data– The value range needs to be rather large. For example sharding

on the field loglevel with the 3 values error, warning, infodoesn‘t make sense.

• Distribution of data– Always strive for equal distribution of data throughout all

shards!

• Patterns during reading and writing– For example for log data using the timestamp as a shard key

can be useful if chronological very close data needs to be read or written together.

Choices for the shard key

• Single field– If the value range is big enough and data is distributed almost

equally

• Compound fields– Use this if a single field is not enough in respect to value range

and equal distribution

• Hash based– In general a random shard key is a good choice for equal

distribution of data– For performance the shard key should be part of the queries – Only available since 2.4

• sh.shardCollection( “user.name", { a: "hashed" } )

{

_id: 346,

username: “sheldinator”,

password: “238b8be8bd133b86d1e2ba191a94f549”,

first_name: “Sheldon”

last_name: “Cooper”

created_on: “Mon Apr 15 15:30:32 +0000 2013“

modified_on: “Thu Apr 18 08:11:23 +0000 2013“

}

Example: User

Which shard key would you choose and why?

{

log_type: “error” // Possible values “error, “warn”, “info“

application: “JBoss v. 4.2.3”

message: “Fatal error. Application will quit.”

created_on: “Mon Apr 15 15:38:05 +0000 2013“

}

Example: Log data

Which shard key would you choose and why?

Routing of queries

Possible types of queries

• Exact queries– Data is exactly on one shard

• Distributed query– Data is distributed on different shards

• Distributed query with sorting– Data is distributed on different shards and needs to

be sorted

Exact queries

1. mongos receives the query from the client

2. Query is routed to the shard with the data

3. Shard returns the data

4. mongos returns the data to the client

Distributed queries

1. mongos receives the query from the client

2. mongos routes the query to all shards

3. Shards return the data

4. mongos returns the data to the client

Distributed queries with sorting

1. mongos receives the query from the client

2. mongos routes the query to all shards

3. Execute the query and local sorting

4. Shards return sorted data

5. mongos sorts the data globally

6. mongos returns the sorted data to the client

Lab time!

Lab Nr. 07

Time box:20 min

Still want moar?

https://education.mongodb.com

top related