preventing and resolving mysql downtime

46
Jervin Real, Michael Coburn Percona Preventing and Resolving MySQL Downtime

Upload: jervin-real

Post on 16-Apr-2017

205 views

Category:

Engineering


6 download

TRANSCRIPT

Page 1: Preventing and Resolving MySQL Downtime

Jervin Real, Michael CoburnPercona

Preventing and Resolving MySQL Downtime

Page 2: Preventing and Resolving MySQL Downtime

About Us

•Jervin Real, Technical Services Manager• Engineer Engineering Engineers

• APAC

•Michael Coburn, Principal Technical Account Manager

• Responsible for managing technical relationship with Percona's

highest revenue customers

2

Page 3: Preventing and Resolving MySQL Downtime

What is Downtime?

•When your Application is completely unavailable

•When your Application is in a degraded state

•Whenever your boss says so :)

3

Page 4: Preventing and Resolving MySQL Downtime

Why Prevent Downtime?

•Your business loses money when the Application is down

•You and your team's reputation suffers

4

Page 5: Preventing and Resolving MySQL Downtime

•Real world adventures• Problems

• Solutions

• Prevention

•Putting them all together

Agenda

5

Page 6: Preventing and Resolving MySQL Downtime

I Had a Crash On You

6

Page 7: Preventing and Resolving MySQL Downtime

I Had a Crash On You (1): Page Corruption

7

Page 8: Preventing and Resolving MySQL Downtime

•Disk bad sectors problem, not monitored or checked

•Page corruption on disk level

•Server crashes when reading page from disk

•Keeps crashing :(

I Had a Crash On You (1): Page Corruption > About

8

Page 9: Preventing and Resolving MySQL Downtime

•Percona Server, we tried:• innodb_table_corrupt_action = salvage

•Worked!

•Dropped table, recreated - application back online

•Worst case:• innodb_force_recovery > 0

• Data Recovery

I Had a Crash On You (1): Page Corruption > Solutions

9

Page 10: Preventing and Resolving MySQL Downtime

•Running 5.6.11, early adopter, InnoDB FULLTEXT

•Upgrade to 5.6.18, MySQL crashed

•Data was unusable - bug#72079

I Had a Crash On You (2): Assertion > About

10

Page 11: Preventing and Resolving MySQL Downtime

•Downgrade and restore from backup

•Re-execute upgrade to avoid the bug

I Had a Crash On You (2): Assertion > Solutions

11

Page 12: Preventing and Resolving MySQL Downtime

•innodb_corrupt_table_action=salvage / warn

•pt-table-checksum• Regularly recurse your data and check for errors in error log

•RAID card health checks• Can vary by vendor

•SMART checks• Be vigilant for disk level errors

I Had a Crash On You (1): Page Corruption > Preventions

12

Page 13: Preventing and Resolving MySQL Downtime

Nobody’s Watching

13

Page 14: Preventing and Resolving MySQL Downtime

•Percona XtraDB Cluster, 3 nodes

•Few months ago node 3 went down due to conflict, but

nobody noticed

•Few hours ago, node 2 was killed by OOM, cluster lost

quorum

•EVERYBODY NOTICED!

Nobody’s Watching (1): Nobody Cared > About

14

Page 15: Preventing and Resolving MySQL Downtime

•Bootstrap remaining node• SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;

•SST second and 3rd node

•Define wsrep_notify_cmd temporarily

•Implement better alerting

Nobody’s Watching (1): Nobody Cared > Solutions

15

Page 16: Preventing and Resolving MySQL Downtime

•New sysadmin received disk space alert

•du -hx --max-depth=1 /

•/var has lots of data

•find /var/ -size +5G -exec rm -rf {} \;

•Bam, ibdata1 gone!

•Restart maintenance occurred later in the day ...

Nobody’s Watching (2): Dropped the Bomb > About

16

Page 17: Preventing and Resolving MySQL Downtime

•Restore from backup

•Really, they were lucky!

Nobody’s Watching (2): Dropped the Bomb > Solutions

17

Page 18: Preventing and Resolving MySQL Downtime

•Percona Monitoring Plugins• pmp-check-deleted-files

• pmp-check-mysql-status

• pmp-check-mysql-innodb

•Define a script executable by mysql user• Triggered on node state changes

•Take backups, and alert on failure

•Don't restart the server - file handles are still open!

Nobody’s Watching: Prevention

18

Page 19: Preventing and Resolving MySQL Downtime

Self Induced Pain

19

Page 20: Preventing and Resolving MySQL Downtime

•“Waiting for query cache lock”

root# ~> pt-sift /var/lib/pt-stalk/

...

--processlist--

State

226

90 Waiting for query cache lock

4 Sending data

4 Master has sent all binlog to slave; waiting for binlog to be updated

2 init

Self Induced Pain (1): Query Cache

20

Page 21: Preventing and Resolving MySQL Downtime

● Global mutex

● Point of contention

● Especially on hot dataset/table

● More so, with large QC

Self Induced Pain (1): Query Cache > About

21

Page 22: Preventing and Resolving MySQL Downtime

Self Induced Pain (1): Query Cache > Solutions

22

● Set it to small size - to reduce performance overhead

● Disable completely to to avoid contention

● Hint offending queries to skip the query cache i.e. SELECT

SQL_NO_CACHE

Page 23: Preventing and Resolving MySQL Downtime

Self Induced Pain (2): Buffer Pool Dump/Restore

23

● Dumps buffer pool page list to disk

● Reloads buffer pool based on this list at startup

● Meant to help speed up buffer pool warmup

Page 24: Preventing and Resolving MySQL Downtime

● Maintenance restart, buffer dump and restore enabled

● Yey! Expecting everything to go well.

● 30mins in performance still really bad, IO trashing

● Large buffer pool, busy read/write

Self Induced Pain (2): Buffer Pool Dump/Restore > About

24

Page 25: Preventing and Resolving MySQL Downtime

● Extend your maintenance period to let the server warmup

if possible, otherwise they will contend on IO

● RAID1 of 2 SATA disks is not a license to use buffer pool

warmup on 240GB of buffer pool

Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions

25

Page 26: Preventing and Resolving MySQL Downtime

Self-Induced Pain Prevention

•Percona Toolkit• pt-stalk

• pt-sift

• pt-kill

•Disable OOM killer

•Configure appropriate disk scheduler

•Check the error log for "Buffer pool load complete"

26

Page 27: Preventing and Resolving MySQL Downtime

MySQL, MySQL! What Have Suffereth Ye Thee?

27

Page 28: Preventing and Resolving MySQL Downtime

•Slow queries

•Connections build up

•Slow response times

•Long running transactions

•Stop the World scenario

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

28

Page 29: Preventing and Resolving MySQL Downtime

--innodb--

txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)

0 queries inside InnoDB, 0 queries in queue

Main thread: sleeping, pending reads 0, writes 28, flush 1

Log: lsn = 2147483647, chkp = 2147483647, chkp age =

210625191

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

29

Page 30: Preventing and Resolving MySQL Downtime

---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows

mysql tables in use 1, locked 1

80337 lock struct(s), heap size 8271400, 10979242 row lock(s)

MySQL thread id 671621, OS thread handle 0x7fe03528a700,

query id 37505085 localhost magento Sending data

SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item`

LIMIT 376 OFFSET 491056

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

30

Page 31: Preventing and Resolving MySQL Downtime

•KILL long running trx

•pt-kill for persistent long running trx

•Deploy immediate code changes to disable erroring code

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions

31

Page 32: Preventing and Resolving MySQL Downtime

•MySQL is still responding

•All sorts of mutexes• trx_sys->mutex

• block->lock

• lock_sys->mutex

• lock_sys->wait_mutex

•… and is killing latency

•Service impact means lost income

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About

32

Page 33: Preventing and Resolving MySQL Downtime

•innodb_thread_concurrency > 0

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions

33

Page 34: Preventing and Resolving MySQL Downtime

● “Opening tables”, “Closing tables”

--processlist--

State

578 Opening tables

32 closing tables

MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About

34

Page 35: Preventing and Resolving MySQL Downtime

● Contention on LOCK_open mutex

● Risk of negative scalability

MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About

35

Page 36: Preventing and Resolving MySQL Downtime

● Tune table_open_cache/table_definition_cache

● table_open_cache_instances (5.6+)

● Shard either logically/horizontally, run multiple mysql

instances to reduce object size by instance

MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions

36

Page 37: Preventing and Resolving MySQL Downtime

•pt-kill --log

•MySQL Server Configurationa. Remember to tune innodb_thread_ concurrency (default is 0)

b. innodb_table_cache + innodb_table_cache_instances

•Application Stack Configuration (Schema Design)a. Single tenant per schema

b. Multiple tenants per schema (each table has client_id column)

c. All tenants in one schema

MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention

37

Page 38: Preventing and Resolving MySQL Downtime

•Disk performance cascading to MySQL to application

Wizard of OS (1): Disk Performance

38

Page 39: Preventing and Resolving MySQL Downtime

•Slow writes, binlogs, redo logs, syncs

•Transactions stalling on COMMIT, updating, inserting …•Replication getting delayed if node is a slave

•Translates to latency

Wizard of OS (1): Disk Performance > About

39

Page 40: Preventing and Resolving MySQL Downtime

● RAID Controller in Write-Through

● Could also be a bad disk!

Wizard of OS (1): Disk Performance > Solutions

40

Page 41: Preventing and Resolving MySQL Downtime

● Swapping heavily, with significant amount of RAM free

Wizard of OS (2): Swapping

41

Page 42: Preventing and Resolving MySQL Downtime

● Swapping induces significant amount of IO

● Swapping in and out of disk is mighty expensive

● Affects MySQL in magnificent ways

● Swap Insanity!

Wizard of OS (2): Swapping > About

42

Page 43: Preventing and Resolving MySQL Downtime

● NUMA Interleave

● Percona Server is NUMA configurable○ numa_interleave

○ Flush_caches

● Check numastat - perl check_numa.pl

Wizard of OS (2): Swapping > Solutions

43

Page 44: Preventing and Resolving MySQL Downtime

● Tune:○ Vm.swappiness

○ NUMA policy

○ disk scheduler

○ mount options appropriately (ext4, xfs)■ (nobarrier, noatime)

● pt-heartbeat - monitor replication delay

Wizard of OS : Prevention

44

Page 45: Preventing and Resolving MySQL Downtime

Percona Server Features

•Enable InnoDB Buffer Pool warming

•Enable userstat for table & index statistics

•Enable verbose slow log

•Enable Query Response Time plugin

45

Page 46: Preventing and Resolving MySQL Downtime

Thank You!

•Jervin Real [email protected]• Technical Services Manager, APAC

•Michael Coburn [email protected]• Principal Technical Account Manager, USA

46