mongodb at mapmyfitness from a devops perspective
TRANSCRIPT
MongoDB at MMFFrom a DevOps Perspective
Jan 24, 2013
Introduction
l MapMyFitness was founded in 2007
l Offices in Denver, CO & AusRn, TX(w/ associates in SF, Boston, New York, LA, and Chicago)
l Over 13 million registered users
l ~80 million geo-‐data routes (runs, rides, walks, hikes, etc)
l Core sites, mobile apps, API, white-‐label(MapMyRun, MapMyRide, MapMyFitness)
MMF Platform Overview
• Python (django) & PHP (legacy API)
• Although MySQL is the core backing db for Django, the majority of MMF data lives in various MongoDB datastores.
• Routes datastore has ~120 million objects, currently 7TB+ of data (3 member replica set backed by a EMC SAN, 48GB RAM each)
• Django sessions converted to using MongoDB (funcRonal scaling example, 600M sessions stored)
• Live Tracking system uRlizes elasRc replica set membership to handle load scaling for events
• Granular API access/error logging via json to MongoDB
Route & Elevation data example (Lost on the way to MongoSeattle)
Implementation Patterns
• Standard Datastore -‐ 3 member replica set (small to med implementaRons)
• Big Data implementaRon – sharded cluster (TB+)
• Buffering Layer -‐ high memory (load all data and index files into RAM)
• Write Heavy -‐ uRlize sharding to opRmize for writes
• Read Heavy -‐ 3+n replica set configuraRon for rapid read scaling (up to 12 nodes)
Implementation Patterns
• In the cloud, tune the instance type to the mongo implementaRon
• On iron, plan carefully and dedicate servers completely to mongo to avoid memory map contenRon
• For DR, spin up a delayed, hidden replica node (preferably in a different datacenter)
• AggregaRon framework can be used in myriad ways, including bridging the gap to SQL data warehousing via ETL.
• Automate install paSerns for rapid development, prototyping, and infrastructure scaling.
Operational Automation( example of automated mongodb install via puppet )
Replica Set Expansion
• MongoDB is “replicaRon made elegant”• Ridiculously simple to add addiRonal members• Be sure to run IniRalSync from a secondary!
rs.add( “host” : “livetrack_db09”, “ini8alSync” : { “state” : 2 } )
• Both rs.add() and rs.remove() can be scripted and connected to Monitoring systems for autoscaling
Monitoring and Introspection
• MMS, 10gen's cloud-‐based monitoring service (best available)
• Supported by Zabbix, Nagios, Munin, Server Density, etc
• mongostat, mongotop, REST interface, database profiler
• Monitoring system triggers can iniRate node addiRons, removals, service restarts, etc
• In addiRon to service-‐level monitoring, use more advanced tests to check for and alert on query latency spikes
10gen's MMS (the one-stop shop for mongdb metrics)
Mongo in Zabbix ( Mikoomi Plugins: http://code.google.com/p/mikoomi )
mongostat ( Very useful for real-time troubleshooting )
Operational Automation( example of automated mongodb restart action )
Security Considerations
• MongoDB provides authenRcaRon support and basic permissions
• Auth is turned off by default to allow for opRmal performance
• Always run databases in a trusted network environment
• Lock down host based firewalls to limit access to required clients
• Automate iptables with puppet or chef, in EC2 use security groups
Network Security Automation
## Puppet Pattern for Mongodb network security
class iptables::public {
iptables::add_rule { '001 MongoDB established': rule => '-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT' }
iptables::add_rule { '002 MongoDB': rule => '-A RH-Firewall-1-INPUT -i eth1 -p tcp -m tcp --dport 27017 -j ACCEPT' }
iptables::add_rule { '003 MongoDB MMF Phase II Network': rule => '-A RH-Firewall-1-INPUT -i eth0 -s 172.16.16.0/20 -p tcp -m tcp --dport 27017 -j ACCEPT' }
iptables::add_rule { '004 MongoDB MMF Cloud Network': rule => '-A RH-Firewall-1-INPUT -i eth0 -s 10.178.52.0/24 -p tcp -m tcp --dport 27017 -j ACCEPT' }
}
Security Considerations
• Use the rule of least-‐privilege to allow access to environments
• Data sensiRvity should determine the extent of security measures
• For non-‐sensiRve data, good network security can be sufficient
• In open environments, be sure experience matches access level
• Lack of granular perms allows for full admin access, use discreRon
Maintenance
• Far less maintenance required than tradiRonal RDMBS systems
• Regularly perform query profile analysis and index audiRng
• Rebuild databases to reclaim space lost due to fragmentaRon
• Automate checks of log files for known red-‐flags
• Regularly review data throughput rate, storage growth rate, and overall business growth graphs to inform capacity planning.
• For HA tesRng, periodically step-‐down the primary to force failover
Indexing Patterns or “Know Your App”
• Proper indexing criRcal to performance at scale(monitor slow queries to catch non-‐performant requests)
• MongoDB is ulRmately flexible, being schemaless(mongo gives you enough rope to hang yourself, choose wisely)
• Avoid un-‐indexed queries at all costs (it's quickest way to crater your app... consider -‐-‐notablescan)
• Onus on DevOps to match applicaRon to indexes(know your query profile, never assume)
• Shoot for 'covered queries' wherever possible(answer can be obtained from indexes only)
Capped Collections
• Use standard capped collecRons for retaining a fixed amount of data. Uses a FIFO strategy for pruning.(based on data size, not number of rows)
• TTL CollecRons (2.2) age out data based on a retenRon Rme configuraRon. (great for data retenRon requirements of all types)
Gotcha!
Explicitly create the capped collecRon before any data is put into the system to avoid auto-‐creaRon of collecRon
Lessons Learned
• Mongo 2.2 upgrade containing a capped collecRon created in 1.8.4. This severely impacted replicaRon (RC: no "_id" index, FIX: add "_id" index)
• Never start mongo when a mount point is missing or incorrectly configured. Mongo may decide to take maSers into it's own hands and resync itself with the replica set. Make sure your devops and your hos2ng provider admins are aware of this
• Some drivers that use connecRon pooling can freak the freaky freak when the primary member changes (older pymongo). Kicking the applicaRon can fix, also: upgrade drivers
• High locked % is a big red-‐flag, and can be caused by a large number of simultaneous dml acRons (high insert rate, high update rate). Consider this in the design phase.
• Be wary of automaRon that can change the state of a node during maintenance mode. Disable automaRon agents for reduced risk during criRcal administraRve operaRons (filesystem maint, etc)
Thank you!