webinar: best practices for mongodb on aws
DESCRIPTION
In this session we will look at best practices for administering large MongoDB deployments in the cloud. We will discuss tips and tools for capacity planning, fully scripted provisioning using chef and knife-ec2, and snapshotting your data safely, as well as using replica sets for high availability across AZs. We will cover the good, the bad and the ugly of disk performance options on EC2, as well as several filesystem tricks for wringing more performance out of your block devices. And finally we will talk about some ways to prevent Mongo disaster spirals and minimize your downtime. This session is appropriate for anyone who already has experience administering MongoDB. Some experience with AWS or cloud computing is useful, but not required, for all of the material.TRANSCRIPT
Charity Majors@mipsytipsy
• Replica sets
• Resources and capacity planning
• Provisioning with chef
• Snapshotting
• Scaling tips
• Monitoring
• Disaster mitigation
Topics:
Replica sets• Always use replica sets
• Distribute across Availability Zones
• Avoid situations where you have even # voters
• 50% is not a majority!
• More votes are better than fewer (max is 7)
• Add an arbiter for more flexibility
• Always explicitly set the priority of your nodes. Surprise elections are terrible.
Basic sane replica set config
• Each node has one vote (default)
• Snapshot node does not serve read queries, cannot become master
• This configuration can survive any single node or Availability Zone outage
Or manage votes with arbiters
• Three separate arbiter processes on each AZ arbiter node, one per cluster
• Maximum of seven votes per replica set
• Now you can survive all secondaries dying, or an AZ outage
• If you have even one healthy node, you can continue to serve traffic
• Arbiters tend to be more reliable than nodes because they have less to do.
Provisioning
• Memory is your primary constraint, spend your money there
• Especially for read-heavy workloads
• Your working set should fit into RAM
• lots of page faults means it doesn’t fit
• 2.4 has a working set estimator in db.serverStatus!
• Your snapshot host can usually be smaller, if cost is a concern
Disk options
• EBS -- just kidding, EBS is not an option
• EBS with Provisioned IOPS
• Ephemeral storage
• SSD
EBS classic
EBS with PIOPS:
• Guaranteed # of IOPS, up to 2000/volume
• Variability of <0.1%
• Raid together multiple volumes for higher performance
• Supports EBS snapshots
• Costs 2x regular EBS
• Can only attach to certain instance types
PIOPS
• multiply that by 2-3x depending on your spikiness
• when you exceed your PIOPS limit, your disk stops for a few seconds Avoid this.
Estimating PIOPS
• estimate how many IOPS to provision with the “tps” column of sar -d 1
• Cheap
• Fast
• No network latency
• You can snapshot with LVM + S3
• Data is lost forever if you stop or resize the instance
• Can use EBS on your snapshot node to take advantage of EBS tools
• makes restore a little more complicated
Ephemeral storage
Filesystem
• Use ext4
• Raise file descriptor limits (cat /proc/<mongo pid>/limits to verify)
• If you’re using ubuntu, use upstart
• Set your blockdev --set-ra to something sane, or you won’t use all your RAM
• If you’re using mdadm, make sure your md device and its volumes have a small enough block size
• RAID 10 is the safest and best-performing, RAID 0 is fine if you understand the risks
Chef everything
• Role attributes for backup volumes, cluster names
• Nodes are effectively disposable
• Provision and attach EBS RAID arrays via AWS cookbook
• Delete volumes and AWS attributes, run chef-client to re-provision
• Restore from snapshot automatically with our backup scripts
Our mongo cookbook and backup scripts: https://github.com/ParsePlatform/Ops/
Bringing up a new node from the most recent mongo snapshot is as simple as this:
It’s faster for us to re-provision a node from scratch than to repair a RAID array or fix most problems.
Each replica set has its own role, where it sets the cluster name, the snapshot host name, and the EBS volumes to snapshot.
When you provision a new node for this role, mongodb::raid_data will build it off the most recent completed set of snapshots for the volumes specified in backups => mongo_volumes.
Snapshots
• Snapshot often
• Set snapshot node to priority = 0, hidden = 1
• Lock Mongo OR stop mongod during snapshot
• Snapshot all RAID volumes
• We use ec2-consistent-snapshot: http://eric.lubow.org/2011/databases/mongodb/ec2-consistent-snapshot-with-mongo/, with a wrapper script for chef to generate the backup volume ids
• Always warm up a snapshot before promoting
• Warm up both indexes and data
• Use dd or vmtouch to load files from S3
• Scan for most commonly used collections on primary, read those into memory on secondary
• Read collections into memory
• Natural sort
• Full table scan• Search for something that doesn’t exist
Warming a secondary
http://blog.parse.com/2013/03/07/techniques-for-warming-up-mongodb/
Fragmentation
• Your RAM gets fragmented too!
• Leads to underuse of memory
• Deletes are not the only source of fragmentation
• db.<collection>.stats to find the padding factor (between 1 - 2, the higher the more fragmentation)
• Repair, compact, or reslave regularly (db.printReplicationInfo() to get the length of your oplog to see if repair is a viable option)
Compaction: before and after
Compaction• We recommend running a continuous
compaction script on your snapshot host
• Every time you provision a new host, it will be freshly compacted.
• Plan to rotate in a compacted primary regularly (quarterly, yearly depending on rate of decay)
• If you also delete a lot of collections, you may need to periodically run db.repairDatabase() on each dbhttp://blog.parse.com/2013/03/26/always-be-compacting/
Scaling strategies
• Horizontal scaling
• Query optimization, index optimization
• Throw money at it (hardware)
• Upgrade to > 2.2 to get rid of global lock
• Read from secondaries
• Put the journal on a different volume
• Repair, compact, or reslave
Monitoring
• MMS
• Ganglia + nagios
• correlate graphs with local metrics like disk i/o
• graph your own index ops
• graph your own aggregate lock percentages
• alert on replication lag, replication error
• alert if the primary changes, connection limit
• Use chef! Generate all your monitoring from roles
fun with MMS
opcounters are color-coded by op type!
big bgflush spike means there was an EBS event
lots of page faults means reading lots of cold data into memory from disk
lock percentage is your single best gauge of fragility.
so ... what can go wrong?
• Your queues are rising and queries are piling up
• Everything seems to be getting vaguely slower
• Your secondaries are in a crash loop
• You run out of available connections
• You can’t elect a primary
• You have an AWS or EBS outage or degradation
• You have terrible latency spikes
• Replication stops
• Know what your healthy cluster looks like
• Don’t switch your primary or restart when overloaded
• Do kill queries before the tipping point
• Write your kill script before you need it
• Read your mongodb.log. Enable profiling!
• Check db.currentOp():
... when queries pile up ...
• check to see if you’re building any indexes• check queries with a high numYields• check for long running queries• use explain() on them, check for full table scans• sort by number of queries/write locks per namespace
• Is your RAID array degraded?
• Do you need to compact your collections or databases?
• Are you having EBS problems? Check bgflush
• Are you reaching your PIOPS limit?
• Are you snapshotting while serving traffic?
... everything getting slower ...
... terrible latency spikes ...
mongodb.log is your friend.
• Full outages are often less painful than degradation
• Take down the degraded nodes
• Stop mongodb to close all connections
• Hopefully you have balanced across AZs and are coasting
• If you are down and can’t elect a primary, bring up a new node with the same hostname and port as a downed node
... AWS or EBS outage ...
Charity Majors@mipsytipsy
that’s all folks :)