Download - Finding the Right NoSQL DB for the Job
![Page 1: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/1.jpg)
Finding the right NoSQL DB for the job
The path to a non-RDBMS solution at
![Page 2: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/2.jpg)
Who we are
• A search engine• A people
search engine• An influencer
search engine• Subscription-
based
![Page 3: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/3.jpg)
George Stathis
VP Engineering14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.
![Page 4: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/4.jpg)
What’s this talk about?
• Why we picked a NoSQL database
• How we picked a NoSQL database
• My NoSQL does not do the job! What now?!
• Nirvana = the right tool for the job
![Page 5: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/5.jpg)
Why did we pick a NoSQL DB?
![Page 6: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/6.jpg)
There are some misconceptions around NoSQL only being appropriate when one needs to achieve
“Web Scale”
![Page 7: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/7.jpg)
I need web scale!http://www.youtube.com/watch?v=b2F-DItXtZs
![Page 8: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/8.jpg)
Traackr picked NoSQL; are we “Web Scale”?
![Page 9: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/9.jpg)
• In terms of users/traffic?
Do we fit the “Web scale” profile?
![Page 10: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/10.jpg)
Source: compete.com
![Page 11: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/11.jpg)
Source: compete.com
![Page 12: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/12.jpg)
Source: compete.com
![Page 13: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/13.jpg)
Source: compete.com
![Page 14: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/14.jpg)
Source: highscalability.com
![Page 15: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/15.jpg)
![Page 16: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/16.jpg)
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
![Page 17: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/17.jpg)
PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){
"db" : "traackr","collections" : 12,"objects" : 68226121,"avgObjSize" : 2972.0800625760330,"dataSize" : 202773493971,"storageSize" : 221491429671,"numExtents" : 199,"indexes" : 33,"indexSize" : 27472394891,"fileSize" : 266623699968,"nsSizeMB" : 16,"ok" : 1
}
That’s a quarter of a terabyte …
![Page 18: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/18.jpg)
Wait! What? My Synology NAS at home can hold 2TB!
![Page 19: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/19.jpg)
No need for us to track the entire web
Web Content
Influencer Content
Not at scale :-)
![Page 20: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/20.jpg)
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
![Page 21: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/21.jpg)
Alternate view of “Web Scale”
Web data is:
Heterogeneous
Unstructured (text)
![Page 22: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/22.jpg)
Source: http://www.opte.org/
Visualization of the Internet, Nov. 23rd 2003
![Page 23: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/23.jpg)
Data sources are
isolated islands of rich
data with lose links to
one another
![Page 24: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/24.jpg)
How do we build a database that models all possible entities found on the web?
![Page 25: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/25.jpg)
Modeling the web: the RDBMS way
![Page 26: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/26.jpg)
Source: socialbutterflyclt.com
![Page 27: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/27.jpg)
or
![Page 28: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/28.jpg)
![Page 29: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/29.jpg)
{ "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]}
Influencer data as JSON
![Page 30: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/30.jpg)
“In the old world of data analysis you knew exactly which questions you wanted to ask,
which drove a very predictable collection and storage model. In the new world of data
analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without
being constrained by resources.”— Werner Vogels, CTO/VP Amazon.com
![Page 31: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/31.jpg)
NoSQL = schema flexibility
![Page 32: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/32.jpg)
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
![Page 33: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/33.jpg)
• In terms of users/traffic?
• In terms of the amount of data?
• In terms of the variety of the data
Do we fit the “Web scale” profile?
✓
![Page 34: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/34.jpg)
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
![Page 35: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/35.jpg)
Requirement: text storage
Variable text length:
< big variance <140
character tweets
multi-page
blog posts
![Page 36: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/36.jpg)
Requirement: text storage
RDBMS’ answer to variable text length:
Plan ahead for largest value
CLOB/BLOB
![Page 37: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/37.jpg)
Requirement: text storage
Issues with CLOB/BLOG for us:
No clue what largest value is
CLOB/BLOB for tweets = wasted space
![Page 38: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/38.jpg)
Requirement: text storage
NoSQL solutions are great for text:
No length requirements (automated
chunking)
Limited space overhead
![Page 39: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/39.jpg)
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
![Page 40: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/40.jpg)
Requirement: batch processing
Some NoSQL
solutions come
with MapReduce
Source: http://code.google.com/
![Page 41: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/41.jpg)
Requirement: batch processing
MapReduce + RDBMS:
Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR
![Page 42: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/42.jpg)
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
A NoSQL option is the right fit
✓
![Page 43: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/43.jpg)
How did we pick a NoSQL DB?
![Page 44: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/44.jpg)
Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
![Page 45: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/45.jpg)
Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
![Page 46: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/46.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure.
We’d rather use these tools for specialized data analysis but not as the
main data store.
![Page 47: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/47.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: memory-based,we need true persistence
![Page 48: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/48.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: not willing to store our data in a proprietary
datastore.
![Page 49: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/49.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters,
better used as queues or distributed caches
![Page 50: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/50.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try
early prototypes.
![Page 51: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/51.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options
(came later on).
![Page 52: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/52.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: in early 2010, maturity questions, adoption questions
and no batch processing options.
![Page 53: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/53.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: very close but in early 2010, we had adoption questions.
![Page 54: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/54.jpg)
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
HBase: came across as the most mature at the time, with several deployments, a
healthy community, "out-of-the box" secondary indexes through a contrib and
support for batch processing using Hadoop/MR .
![Page 55: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/55.jpg)
Climbing the learning curve
![Page 56: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/56.jpg)
When Big-Data = Big Architectures
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Must have a Hadoop HDFS cluster of at least 2x replication
factor nodes
Must have an odd number of
Zookeeper quorum nodes
Then you can run your Hbase nodes but it’s recommended to
co-locate regionservers with hadoop datanodes so you have
to manage resources.
Master/slave architecture means a single point of failure,
so you need to protect your master.
And then we also have to manage the MapReduce
processes and resources in the Hadoop layer.
![Page 57: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/57.jpg)
Source: socialbutterflyclt.com
![Page 58: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/58.jpg)
Jokes aside, no one said open source was easy to use
![Page 59: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/59.jpg)
To be expected
• Hadoop/Hbase are
designed to move
mountains
• If you want to move big
stuff, be prepared to
sometimes use big
equipment
![Page 60: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/60.jpg)
What it means to a startup
Development capacity before
Development capacity after
Congrats, you are now a sysadmin…
![Page 61: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/61.jpg)
Whatever, we can do it!
Source: http://knowyourmeme.com/memes/honey-badger
![Page 62: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/62.jpg)
Mapping an A-List to a column store
Name
Ranks References to influencer records
![Page 63: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/63.jpg)
Mapping an A-List to a column store
Unique key
“attributes” column family
for general attributes
“influencerId” column familyfor influencer ranks and foreign keys
![Page 64: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/64.jpg)
Mapping an A-List to a column store
Qualifiers (basically attribute names)
![Page 65: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/65.jpg)
Mapping an A-List to a column store
“name” attribute
Influencer ranks can be attribute names as well
![Page 66: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/66.jpg)
Mapping an A-List to a column store
Alist name value
Influencer id values assigned to each rank (basically foreign keys to an influencer table)
![Page 67: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/67.jpg)
Mapping an A-List to a column store
Can get pretty long so needs indexing and pagination
![Page 68: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/68.jpg)
Problem: no out-of-the-box row-based indexing and pagination
![Page 69: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/69.jpg)
Whatever, it’s open-source!
Source: http://knowyourmeme.com/memes/honey-badger
![Page 70: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/70.jpg)
Jumping right into the code
![Page 71: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/71.jpg)
MapReduce for batch scoring
• Need to re-score our influencer
database once a week
• M/R cranked through it in 15 mins
![Page 72: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/72.jpg)
Source: http://www.charliesheentshirts.info/
![Page 73: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/73.jpg)
a few months later…
![Page 74: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/74.jpg)
Need to upgrade to Hbase 0.90
• Making sure to remain on recent code base
• Performance improvements
• Mostly to get the latest bug fixes
No thanks!
![Page 75: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/75.jpg)
Looks like something is missing
![Page 76: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/76.jpg)
![Page 77: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/77.jpg)
Our DB indexes depend on this!
![Page 78: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/78.jpg)
Let’s get this straight
• Hbase no longer comes with secondary
indexing out-of-the-box
• It’s been moved out of the trunk to GitHub
• Where only one other company besides us
seems to care about it
![Page 79: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/79.jpg)
Only one other maintainer besides us
![Page 80: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/80.jpg)
What it means to a startup
Development capacity
Congrats, you are now an hbase maintainer…
![Page 81: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/81.jpg)
Source: socialbutterflyclt.com
![Page 82: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/82.jpg)
Whatever, we’ll roll our own indexing!
Source: http://knowyourmeme.com/memes/honey-badger
![Page 83: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/83.jpg)
Homegrown Hbase Indexes
Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
Row ids for Posts
![Page 84: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/84.jpg)
Homegrown Hbase Indexes
Find posts for influencer_id_1234
Row ids for Posts
![Page 85: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/85.jpg)
Homegrown Hbase Indexes
Find posts for influencer_id_5678
Row ids for Posts
![Page 86: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/86.jpg)
Homegrown Hbase Indexes
• No longer depending on
unmaintained code
• Work with out-of-the-box Hbase
installation
![Page 87: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/87.jpg)
What it means to a startup
Development capacity
You are back but you still need to
maintain indexing logic
![Page 88: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/88.jpg)
Source: http://www.charliesheentshirts.info/
Application layer indexes are slow and brittle. The DB should be doing this, not us.
Sort of…
![Page 89: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/89.jpg)
a few months later…
![Page 90: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/90.jpg)
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
![Page 91: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/91.jpg)
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Denormalized/duplicated for fast runtime access
and storage of influencer-to-site relationship
properties
![Page 92: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/92.jpg)
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Content attribution logic could sometimes mis-attribute posts because of the
duplicated data.
![Page 93: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/93.jpg)
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Exacerbated when we started tracking people’s content on a daily basis in mid-
2011
![Page 94: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/94.jpg)
Fixing the cracks in the data model
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
![Page 95: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/95.jpg)
Fixing the cracks in the data model
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Normalize the sites
![Page 96: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/96.jpg)
Fixing the cracks in the data model
• Normalization requires stronger
secondary indexing
• Our application layer indexing would
need revisiting…again!
![Page 97: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/97.jpg)
What it means to a startup
Development capacity
Psych! You are back to writing indexing
code.
![Page 98: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/98.jpg)
Source: socialbutterflyclt.com
![Page 99: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/99.jpg)
Whatever, we’ll change our NoSQL!
Source: http://knowyourmeme.com/memes/honey-badger
![Page 100: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/100.jpg)
Traackr’s Datastore Requirements (Revisited)
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options (maybe)
• Out-of-the-box SECONDARY INDEX support!
• Simple to use and administer
![Page 101: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/101.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
![Page 102: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/102.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Nope!
![Page 103: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/103.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: we looked at Neo4J a bit closer but passed again
for the same reasons as before.
![Page 104: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/104.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: still no
![Page 105: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/105.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: still no.
![Page 106: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/106.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: still no
![Page 107: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/107.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: more mature but still no ad-hoc queries.
![Page 108: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/108.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: matured quite a bit, added secondary indexes and batch processing
options but more restrictive in its’ use than other solutions. After the Hbase lesson,
simplicity of use was now more important.
![Page 109: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/109.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: strong contender still but adoption questions remained.
![Page 110: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/110.jpg)
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing
options, breeze to use, well documented and fit into our existing code base very nicely.
![Page 111: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/111.jpg)
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
![Page 112: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/112.jpg)
What it means to a startup
Development capacity
Yay! I’m back!
![Page 113: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/113.jpg)
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
![Page 114: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/114.jpg)
What it means to a startup
Development capacity
Honestly, I thought I’d never see you
guys again!
![Page 115: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/115.jpg)
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model
![Page 116: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/116.jpg)
many-to-many relationship
![Page 117: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/117.jpg)
Modeling an influencer
Embedded list of references to sites augmented with
influencer-specific site attributes (e.g.
percent contribution to content)
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
![Page 118: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/118.jpg)
Modeling an influencer
siteId indexed for “find influencers
connected to site X”
> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
![Page 119: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/119.jpg)
Embedded list of influencer references
augmented with “usernames” (useful
for content attribution)
{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}
Modeling a site
![Page 120: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/120.jpg)
Modeling a site
Indexed for “find sites associated to
influencer X”
> db.sites.ensureIndex({authors.influencerId: 1});> db.sites.find({authors.influencerId: "0001e86f73cc3975a29e6a98a41a4280"});
{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}
![Page 121: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/121.jpg)
Other index uses
Support for alternate site URLs (a.k.a. URL aliases):{ "_id": "0001e86f73cc3975a29e6a98a41a4280", "url_hash_list": [ { "url": "http://traackr.com/blog", "hash": "770cf5c54492344ad5e45fb791ae5d52" }, { "url": "http://blog.traackr.com/", "hash": "0001e86f73cc3975a29e6a98a41a4280" } ]}
Indexed for “find sites associated to
influencer X”
Index on MD5 hash of URL
![Page 122: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/122.jpg)
Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
![Page 123: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/123.jpg)
Ad hoc report example// File Name: retweetTotal.js// Purpose: report the count of twitter URLs for which we have// computed the the number of total retweetsprint( "NUMBER OF TWITTER URLS where retweetTotal IS SET:" );print( db.sites.find( { platformName: "twitter.com", retweetTotal: { $exists: true } } ).count() );
• Easy to execute JS report script remotely:
> mongo <hostname>:<port>/traackr --quiet retweetTotal.js
• Run as a cron job, pipe the output to a file and email it out• Also, more complex MR-based reports are easily accessible to
someone with some JavaScript knowledge
![Page 124: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/124.jpg)
Other Benefits (cont.)
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
![Page 125: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/125.jpg)
Same binary can be deployed several times for replication & backups
![Page 126: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/126.jpg)
Same binary can be deployed several times for replication & backups
Different Availability Zones for better SPOF
tolerance
![Page 127: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/127.jpg)
Same binary can be deployed several times for replication & backups
priority 0 for backup server so that it never
gets elected as primary
![Page 128: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/128.jpg)
Same binary can be deployed several times for replication & backups
Using xfs_freeze before taking backups
![Page 129: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/129.jpg)
Same binary can be deployed several times for replication & backups
EBS snapshots as backups are portable to new instances (e.g.
QA)
![Page 130: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/130.jpg)
Other Benefits (cont.)
• Ad hoc queries and reports became easier to write with
JavaScript: no need for a Java developer to write map reduce code
to extract the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
• Great documentation
• Great adoption and community
![Page 131: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/131.jpg)
Mongo cursors for batch scoring
• Mongo is fast enough for our data size to
be able to serially score the DB faster
than the MapReduce jobs did in parallel.
• When we grow larger, MapReduce is still
available as an option
![Page 132: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/132.jpg)
looks like we found the right fit!
![Page 133: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/133.jpg)
We have more of this
Development capacity
![Page 134: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/134.jpg)
And less of this
Source: socialbutterflyclt.com
![Page 135: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/135.jpg)
Source: http://www.charliesheentshirts.info/
for now…
![Page 136: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/136.jpg)
Additional takeaways
• Fearless refactoring
• Ease of use and administration cannot be
overstated for a small startup
![Page 137: Finding the Right NoSQL DB for the Job](https://reader034.vdocuments.site/reader034/viewer/2022051111/554cf84cb4c905ae138b50fb/html5/thumbnails/137.jpg)
Q&A