preventing and resolving mysql downtime
TRANSCRIPT
Jervin Real, Michael CoburnPercona
Preventing and Resolving MySQL Downtime
About Us
•Jervin Real, Technical Services Manager• Engineer Engineering Engineers
• APAC
•Michael Coburn, Principal Technical Account Manager
• Responsible for managing technical relationship with Percona's
highest revenue customers
2
What is Downtime?
•When your Application is completely unavailable
•When your Application is in a degraded state
•Whenever your boss says so :)
3
Why Prevent Downtime?
•Your business loses money when the Application is down
•You and your team's reputation suffers
4
•Real world adventures• Problems
• Solutions
• Prevention
•Putting them all together
Agenda
5
I Had a Crash On You
6
I Had a Crash On You (1): Page Corruption
7
•Disk bad sectors problem, not monitored or checked
•Page corruption on disk level
•Server crashes when reading page from disk
•Keeps crashing :(
I Had a Crash On You (1): Page Corruption > About
8
•Percona Server, we tried:• innodb_table_corrupt_action = salvage
•Worked!
•Dropped table, recreated - application back online
•Worst case:• innodb_force_recovery > 0
• Data Recovery
I Had a Crash On You (1): Page Corruption > Solutions
9
•Running 5.6.11, early adopter, InnoDB FULLTEXT
•Upgrade to 5.6.18, MySQL crashed
•Data was unusable - bug#72079
I Had a Crash On You (2): Assertion > About
10
•Downgrade and restore from backup
•Re-execute upgrade to avoid the bug
I Had a Crash On You (2): Assertion > Solutions
11
•innodb_corrupt_table_action=salvage / warn
•pt-table-checksum• Regularly recurse your data and check for errors in error log
•RAID card health checks• Can vary by vendor
•SMART checks• Be vigilant for disk level errors
I Had a Crash On You (1): Page Corruption > Preventions
12
Nobody’s Watching
13
•Percona XtraDB Cluster, 3 nodes
•Few months ago node 3 went down due to conflict, but
nobody noticed
•Few hours ago, node 2 was killed by OOM, cluster lost
quorum
•EVERYBODY NOTICED!
Nobody’s Watching (1): Nobody Cared > About
14
•Bootstrap remaining node• SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;
•SST second and 3rd node
•Define wsrep_notify_cmd temporarily
•Implement better alerting
Nobody’s Watching (1): Nobody Cared > Solutions
15
•New sysadmin received disk space alert
•du -hx --max-depth=1 /
•/var has lots of data
•find /var/ -size +5G -exec rm -rf {} \;
•Bam, ibdata1 gone!
•Restart maintenance occurred later in the day ...
Nobody’s Watching (2): Dropped the Bomb > About
16
•Restore from backup
•Really, they were lucky!
Nobody’s Watching (2): Dropped the Bomb > Solutions
17
•Percona Monitoring Plugins• pmp-check-deleted-files
• pmp-check-mysql-status
• pmp-check-mysql-innodb
•Define a script executable by mysql user• Triggered on node state changes
•Take backups, and alert on failure
•Don't restart the server - file handles are still open!
Nobody’s Watching: Prevention
18
Self Induced Pain
19
•“Waiting for query cache lock”
root# ~> pt-sift /var/lib/pt-stalk/
...
--processlist--
State
226
90 Waiting for query cache lock
4 Sending data
4 Master has sent all binlog to slave; waiting for binlog to be updated
2 init
Self Induced Pain (1): Query Cache
20
● Global mutex
● Point of contention
● Especially on hot dataset/table
● More so, with large QC
Self Induced Pain (1): Query Cache > About
21
Self Induced Pain (1): Query Cache > Solutions
22
● Set it to small size - to reduce performance overhead
● Disable completely to to avoid contention
● Hint offending queries to skip the query cache i.e. SELECT
SQL_NO_CACHE
Self Induced Pain (2): Buffer Pool Dump/Restore
23
● Dumps buffer pool page list to disk
● Reloads buffer pool based on this list at startup
● Meant to help speed up buffer pool warmup
● Maintenance restart, buffer dump and restore enabled
● Yey! Expecting everything to go well.
● 30mins in performance still really bad, IO trashing
● Large buffer pool, busy read/write
Self Induced Pain (2): Buffer Pool Dump/Restore > About
24
● Extend your maintenance period to let the server warmup
if possible, otherwise they will contend on IO
● RAID1 of 2 SATA disks is not a license to use buffer pool
warmup on 240GB of buffer pool
Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions
25
Self-Induced Pain Prevention
•Percona Toolkit• pt-stalk
• pt-sift
• pt-kill
•Disable OOM killer
•Configure appropriate disk scheduler
•Check the error log for "Buffer pool load complete"
26
MySQL, MySQL! What Have Suffereth Ye Thee?
27
•Slow queries
•Connections build up
•Slow response times
•Long running transactions
•Stop the World scenario
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
28
--innodb--
txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)
0 queries inside InnoDB, 0 queries in queue
Main thread: sleeping, pending reads 0, writes 28, flush 1
Log: lsn = 2147483647, chkp = 2147483647, chkp age =
210625191
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
29
---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows
mysql tables in use 1, locked 1
80337 lock struct(s), heap size 8271400, 10979242 row lock(s)
MySQL thread id 671621, OS thread handle 0x7fe03528a700,
query id 37505085 localhost magento Sending data
SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item`
LIMIT 376 OFFSET 491056
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About
30
•KILL long running trx
•pt-kill for persistent long running trx
•Deploy immediate code changes to disable erroring code
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions
31
•MySQL is still responding
•All sorts of mutexes• trx_sys->mutex
• block->lock
• lock_sys->mutex
• lock_sys->wait_mutex
•… and is killing latency
•Service impact means lost income
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About
32
•innodb_thread_concurrency > 0
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions
33
● “Opening tables”, “Closing tables”
--processlist--
State
578 Opening tables
32 closing tables
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About
34
● Contention on LOCK_open mutex
● Risk of negative scalability
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About
35
● Tune table_open_cache/table_definition_cache
● table_open_cache_instances (5.6+)
● Shard either logically/horizontally, run multiple mysql
instances to reduce object size by instance
MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions
36
•pt-kill --log
•MySQL Server Configurationa. Remember to tune innodb_thread_ concurrency (default is 0)
b. innodb_table_cache + innodb_table_cache_instances
•Application Stack Configuration (Schema Design)a. Single tenant per schema
b. Multiple tenants per schema (each table has client_id column)
c. All tenants in one schema
MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention
37
•Disk performance cascading to MySQL to application
Wizard of OS (1): Disk Performance
38
•Slow writes, binlogs, redo logs, syncs
•Transactions stalling on COMMIT, updating, inserting …•Replication getting delayed if node is a slave
•Translates to latency
Wizard of OS (1): Disk Performance > About
39
● RAID Controller in Write-Through
● Could also be a bad disk!
Wizard of OS (1): Disk Performance > Solutions
40
● Swapping heavily, with significant amount of RAM free
Wizard of OS (2): Swapping
41
● Swapping induces significant amount of IO
● Swapping in and out of disk is mighty expensive
● Affects MySQL in magnificent ways
● Swap Insanity!
Wizard of OS (2): Swapping > About
42
● NUMA Interleave
● Percona Server is NUMA configurable○ numa_interleave
○ Flush_caches
● Check numastat - perl check_numa.pl
Wizard of OS (2): Swapping > Solutions
43
● Tune:○ Vm.swappiness
○ NUMA policy
○ disk scheduler
○ mount options appropriately (ext4, xfs)■ (nobarrier, noatime)
● pt-heartbeat - monitor replication delay
Wizard of OS : Prevention
44
Percona Server Features
•Enable InnoDB Buffer Pool warming
•Enable userstat for table & index statistics
•Enable verbose slow log
•Enable Query Response Time plugin
45
Thank You!
•Jervin Real [email protected]• Technical Services Manager, APAC
•Michael Coburn [email protected]• Principal Technical Account Manager, USA
46