the rough guide to mongodb
DESCRIPTION
Simeon Simeonov, Founder & CTO of Swoop, shares how Swoop uses Mongo behind the scenes for their high-performance core data processing and analytics. The presentation goes over tips and tricks such as zero-overhead hierarchical relationships with MongoDB, high-performance MongoDB atomic update buffering, content-addressed storage using cryptographic hashing and more. Presented to the Boston MongoDB User Group.TRANSCRIPT
The Rough Guide to MongoDB
Simeon Simeonov@simeons
Founding. Funding.
Growing. Startups.
Why MongoDB?
I am @simeons
recruit amazing people
solve hard problems
ship
make users happy
repeat
Why MongoDB?
Again, please
SQL is slow(for our business)
SQL is slow(for our developer workflow)
SQL is slow(for our analytics system)
So what’s Swoop?
Display AdvertisingMakes the Web Suck
User-focused optimizationTens of millions of users
1000+% better than average200+% better than Google
Swoop Fixes That
Mobile SDKsiOS & Android
Web SDKRequireJS & jQuery
ComponentsAngularJS
NLP, etc.Python
TargetingHigh-Perf Java
AnalyticsRuby 2.0
Internal AppsRuby 2.0 / Rails 3
Pub PortalRuby 2.0 / Rails 3
Ad PortalRuby 2.0 / Rails 4
MongoDB: the GoodFast
Flexible
JavaScript
MongoDB: the BadNot Quite Enterprise-Grade
Not Quite Enterprise-Grade
Not Cheap to Run Well
I will write more robust codeI will write more robust codeI will write more robust codeI will write more robust codeI will write more robust codeI will write more robust codeI will write more robust codeI will write more robust codeI will write more robust code
I will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduceI will design a better map-reduce
RAM + locks == $$$
Five Steps to HappinessSharding
Native Relationships
Atomic Update Buffering
Content-Addressed Storage
Shell Tricks
// Google AdWords object modelAccount Campaign AdGroup // this joins ads & keywords Ad Keyword
// For exampleAdGroup has an AccountAdGroup has a CampaignAdGroup has many AdsAdGroup has many Keywords
Slam dunk for SQL
// Let’s play a bitAccount Campaign AdGroup Ad Keyword
// Let’s play some moreAccount Campaign AdGroup Ad Keyword
// There is just one bit leftAccount Campaign AdGroup 1 Ad 0 Keyword
// build a hierarchical IDaccountIDcampaignIDadGroupID((0keywordID)|(1adID))
// a binary ID!10100100001100000000101001100110101010010100< accountID >< campaignID >< …
// Encode it in base 16, 32 or 64{"_id" : "a4300a66a94d20f1", … }
// Example
The 5th adOf the 3rd ad groupOf the 7th campaignOf the 255th account
could have the _id 0x00ff000700031005
The _id for the 10th keyword of the same ad group would be 0x00ff00070003000a
// Neat: the ad’s and keyword’s _ids contain the// IDs of all of their ancestors in the hierarchy.
keywordId = 0x00ff00070003000a
adGroupId = keywordId & 0xffffffffffff0000campaignId = keywordId & 0xffffffff00000000accountId = keywordId & 0xffff000000000000
// has-a relationship is a simple lookupaccount = db.accounts.findOne({_id: accountId})
// Neater: has-many relationships are just// range queries on the _id index.
adGroupId = keywordId & 0xffffffffffff0000startOfAds = adGroupId + 0x1000 endOfAds = adGroupId + 0x1fff
adsForKeyword = db.ads.find({ _id: {$gte: startOfAds, $lte: endOfAds}})
// Technically, that was a join via the ad group.// Who said Mongo can’t do joins???
> db.reports.findOne(){ "_id" : …, "period" : "hour", "shard" : 0, // 16Mb doc limit protection "topic" : "ce-1", "ts" : ISODate("2012-06-12T05:00:00Z"), "variations" : { "2" : { // variationID (dimension set) "hint" : { "present" : 311, // hint.present is a metric "clicks" : 1 } }, "4" : { "hint" : { "present" : 331 } } }}
Content Addressed StorageLazy join abstraction
Very space efficient
Extremely (pre-)cacheable
Join only happens during reporting
// Step 1: take a set of dimensions worth trackingdata = {
"domain_id" : "SW-28077508-16444","hint" : "Find an organic alternative","theme" : "red"
}
// Step 2: compute a digital signature, e.g., MD5sig = "000069569F4835D16E69DF704187AC2F”
// Step 3: if new sig, increment a countercounter = 264034
// Step 4: create a document in the context-// addressed store collection for these
> db.cas.findOne(){
"_id" : "000069569F4835D16E69DF704187AC2F", // MD5 hash"data" : { // data that was digested to the hash above
"domain_id" : "SW-28077508-16444","hint" : "Find an organic alternative",
"theme” : "red"},"meta_data" : {
"id" : 264034 // variationID},"created_at" : ISODate("2013-02-04T12:05:34.752Z")
}
// Elsewhere, in the reports collection…
"variations" : { "264034" : { // metrics here }, …
lazy join
// Use underscore.js in the shell// See http://underscorejs.org/function underscore() {
load("/mongo_hacks/underscore.js");}
// Loads underscore.js on the MongoDB serverfunction server_underscore(force) { force = force || false; if (force || typeof(underscoreLoaded) === 'undefined') {
db.eval(cat("/mongo_hacks/underscore.js")); underscoreLoaded = true;
}}
// Callstack printing on exception -- wraps a functionfunction dbg(f) { try { f(); } catch (e) { print("\n**** Exception: " + e.toString()); print("\n"); print(e.stack); print("\n"); if (arguments.length > 1) { printjson(arguments); print("\n"); } throw e; }}
function minutesAgo(minutes, d) { d = d || new Date(); return new Date(d.valueOf() - minutes * 60 * 1000);}
function hoursAgo(hours, d) { d = d || new Date(); return minutesAgo(60 * hours, d);}
function daysAgo(days, d) { d = d || new Date(); return hoursAgo(24 * days, d);}
// Don’t write in the shell.// Use your fav editor, save & type t() in mongofunction t() { load("/mongo_hacks/bag_of_tricks.js");}
@simeons