production mongodb in the cloud
DESCRIPTION
This talk was given by Bridget Kromhout and Mike Hobbs at MinneBar 8, April 6th 2013. We’ve been using MongoDB on EC2 for about a year now; our production deployment includes a 12-node cluster (4 shards x 3 replica sets) as well as several other non-sharded replica sets. (The sharded cluster used to be bigger, which, as it turns out, isn’t always better.) Join us to benefit from our lessons learned, discuss what works and what doesn’t, and marvel at oddities we’ve encountered.TRANSCRIPT
Production MongoDB
in the CloudFrom Essentials to Corner Cases
Who are we?
Mike Hobbs & Bridget Kromhout
Social Commerce&
Brand Interest Graph Analytics
Why MongoDB?
● Scalable, high-performance, open source● Dynamic schemas for unstructured data● Query language close to SQL in power● "Eventually consistent" is hard to program right
Our configuration12-node cluster (4 shards x 3 replica sets)Several other non-sharded replica setsDesired webapp response time is < 10ms
Total data size: 110 GBTotal index size: 28 GBLargest collection: 49 GBLargest index: 8.1 GB
EC2: EBS, instance size, replicationMongoDB: right for only some data sets
Memory & iowait
Working set needs to fit in memory
● Indexes● Frequently accessed records
Avoid swapping!!!EBS latency in EC2 is an issue.
FragmentationFragmentation steals from your most precious resource by reserving memory that is not used.
Run a compaction when your storageSize significantly exceeds your data sizemongos> db.widgets.stats()
..."size" : 5097988,"storageSize" : 22507520,
Padding can reduce fragmentation and I/Odb.widgets.insert({widg_id: "72120", padding: "XXXX...XXX"})db.widgets.update({widg_id: "72120"}, { $unset: {padding: ""}, $set: {desc: "Grout remover", price: "13.39", instock: true} })
Replica sets
"optime" : { "t" : 1365165841000 , "i" : 1 }, "optimeDate" : { "$date" : "Fri Apr 5 07:44:01 2013" },
test-3-1.yourdomain
test-3-2.yourdomain
test-3-3.yourdomain
test-3-1.yourdomain
test
Elections
08:52:06 [rsMgr] can't see a majority of the set, relinquishing primary 08:52:06 [rsMgr] replSet relinquishing primary state 08:52:06 [rsMgr] replSet SECONDARY 08:52:12 [rsMgr] replSet can't see a majority, will not try to elect self
Primary always determined by an election.
2-member replSet without an arbiter: if the secondary goes offline, the primary will step down:
Priorities can rig elections.
Ensure availability of an odd number of voting members.
Manual primary changes
No "become primary now" command. Manual stepdowns with recusal timeout are best option.
test-1:PRIMARY> rs.stepDown(300)Wed Apr 3 11:45:36 DBClientCursor::init call() failedWed Apr 3 11:45:36 query failed : admin.$cmd { replSetStepDown: 300.0 } to: 127.0.0.1:27017Wed Apr 3 11:45:36 Error: error doing query: failed src/mongo/shell/collection.js:155Wed Apr 3 11:45:36 trying reconnect to 127.0.0.1:27017Wed Apr 3 11:45:36 reconnect 127.0.0.1:27017 oktest-1:SECONDARY>
This triggers an election.
(Obviously, make sure your preferred candidate(s) can win.)
States: down (initializing), startup2, secondary, primary
replSet back to standalone? No. Test server: replicaset of 1, shard of 1. removed --replSet but shard configuration needed manual update:db.shards.update({host:"testreplset/test.domain.net"}, {$set:{host:"test.domain.net"}})
UpdatedExisting values no longer returned by mongos, butvisible when connected to mongod:> db.schedule.update({_id:...}, {$set:{lock:true}}, false, true); db.runCommand("getlasterror"){ "updatedExisting" : true, "n" : 1, "connectionId" : 73, "err" : null, "ok" : 1}
Solution: re-adding --replSet to the mongod startup line and reverting shard configs. (Bug open with 10gen.)
ShardingCan increase parallelization of CPU & I/OCarefully choose a shard key (nontrivial to change)Must run config servers & mongosDoesn't ensure high availabilityDoesn't help if you're already out of memory
256GB collection max for initial sharding
Rebalancing data across shardsQueries block while servers negotiate final hand-off.
Updating indexes after hand-off can be slow.
Best run off-peakmongos> use configswitched to db configmongos> db.settings.find(){ "_id" : "balancer", "activeWindow" : { "start" : "23:00", "stop" : "6:00" }}
Mongos & replSet primary changesApplication-level errors talking to mongos after an election:
pymongo.errors.AutoReconnect: could not connect to localhost:27020: [Errno 111] Connection refusedpymongo.errors.OperationFailure: database error: error querying server
Mongos errors talking to mongod on original primary:
Tue Apr 2 09:01:05 [conn3288] Socket say send() errno:110 Connection timed out 10.141.131.214:27017Tue Apr 2 09:01:05 [conn3288] DBException in process: socket exception [SEND_ERROR] for 10.141.131.214:27017
Connection pool checked lazily; invalid connections can persist for days, depending on load. Can clear manually:mongos> db.adminCommand({connPoolSync:1});{ "ok" : 1 }mongos>
Failure handlingApplications must handle fail-over outages:AutoReconnect & OperationFailure in pymongo
def auto_reconnect(func, *args, **kwargs):""" Executes func, retrying on AutoReconnect """for _ in range(100):
try:return func(self, *args, **kwargs)
except pymongo.errors.AutoReconnect:pass
except pymongo.errors.OperationFailure:pass
time.sleep(0.1)raise TimeoutError()
MMS (MongoDB Monitoring Service)● free; hosted by 10gen● need to run agent locally● 10gen's commercial support relies on MMS
Profiling queries [1]Finding bad queries that are actively running:$ mongo | tee mongo.log> db.currentOp()...bye$ grep numYields mongo.log
"numYields" : 0,"numYields" : 62247,"numYields" : 0,...
# Use your favorite viewer to find the op with 62247 yields
Helpful to get server back to a responsive state:$ mongo> db.killOp(10883898)
Profiling queries [2]Using nscanned to find queries that likely aren't using indexes:$ grep -P 'nscanned:\d\d' /var/log/mongodb.log
... or in real-time:$ tail -f /var/log/mongodb.log | grep -P 'nscanned:\d\d'
MongoDB also provides the setProfilingLevel() command which can log all queries to system.profile collection. > db.system.profile.find({nscanned:{$gte:10}})
system.profile does incur some performance overhead, though.
Nagios● plugin uses pymongo● set up service groups
Ideas for the future
● Better reconnect handling in applications● Lose the EBS? Ephemeral disk faster; rely
on replication to keep data persistent.● Intelligent use of mongo profiling (reduce
observer effect of setProfilingLevel)● Use more MMS alerts● Going to 2.4.x (fast counts, hashed
sharding)