Download - SSDs at Etsy - Percona...Why Are We Here? What is this talk about? • Evolution of sharded databases at Etsy • What problems we faced along the way • How SSDs made everything

PERCONA LIVE EUROPE: AMSTERDAM

OCTOBER 5, 2016

Jeremy Tinley, Senior MySQL Operations Engineer

Twitter: @techwolf359

SSDs at Etsy: A War Story

1

Why Are We Here?

What is this talk about? • Evolution of sharded databases at Etsy • What problems we faced along the way • How SSDs made everything better

What is this talk NOT about? • Cloud, Serverless, DevOps, Containers • A deep dive into how SSDs work (hint: it's magic)

What will there definitely be? • Hardware Specs, Vendors and Models • Slides Online After Presentation • Cat Pictures

2

MySQL Architecture at Etsy

Three Main Databases • Shards: All User Generated Data • Tickets: Globally Unique IDs • Index: ID to Shard Mapping, Convenience Data

Active-Active Reads+Writes • id % 2: odd goes to A, even goes to B

3

MySQL Architecture at Etsy

Data Lifecycle • Fetch a new unique ID from tickets • Pick a shard location and write mapping to index • Write user data to shards

All production databases are physical hosts in a data center • No containers • No virtualization

4

Shards v1

5

Shards v1 - Architecture

Hardware • (60) HP G8 / 96GB / 160GB x16 RAID-10 (1.1TB)

Logical Layout • Active-Active / Master-Master Replication • 1 database on 1 MySQL instance per server • MySQL 5.1 -> 5.5

6

Shards v1 - Problems

Problem: Consistently Running out of Disk Capacity • User generated data was growing fairly linearly • Data generated *about* users grew faster • Ended with 30 pairs of servers

Problem: Migration of Data Was Painful • Row-by-Row migration of data • Set a migration lock on index to stop writes

7

Shards v2

8


Hardware • (60) Dell R720 / 128GB / 320GB x16 RAID-10 (2.2TB) • (60) HP G8 / 96GB / 160GB x16 RAID-10 (1.1TB)

Logical Layout • Active-Active / Master-Master Replication • 1.1TB: 10 databases on 1 MySQL instance per server • 2.2TB: 22 databases on 1 MySQL instance per server • MySQL 5.5

9


Problem Solved: Disk Capacity • Was 60TB, Now 180TB • Double the server footprint vs triple

Problem Solved: Migration Complexity • 960 database “buckets” • Expand by relocating a database onto another host

10


Problem: Data Redundancy • Starting with 60a+60b physical servers • Adding 60 4-hour delayed replicas • Adding 60 offsite replicas • Faced with 240 servers

Problem: Running on Half Servers Every Week • Schema change process is to pull A, apply on A, put A back in &

repeat on B • Suffering a double server failure unlikely but why risk it? • Adding another realtime replica to A+B == 6 copies of data • Faced with 360 servers!

11


Problem: 360 Servers is Too Many • DBA staff of 2 • Automation exists but not evolved • Cost inefficient by ways of power, data center space and time • Maintenance very time consuming (patching, upgrades, firmware)

Problem: Warranty Expiration • Half of production expiring within 12 months

12

Shards v3

13

Shards v3 - Non-Master Replicas First

Hardware • (13) Dell R630 / 384GB / 960GB 12/12 RAID-6 SSD (19.2TB) • (13) Dell R630 / 384GB / 960GB x10 RAID-6 SSD (7.6TB)

Logical Layout • 19.2TB: Multi-Instance per Server for real-time, delayed • 7.6TB: Multi-Instance per Server for offsite • MySQL 5.5

14

Shards v3 - Non-Master Replicas First

Problems Solved: Data Redundancy, Running on Half Servers, 360 is Too Many • 26 servers doing the work of 240 servers • 1U instead of 2U chassis • Testing running a master on a consolidated server — Worked!

Confidence made us think, why not start replacing everything with SSDs?

15

Shards v3 - Hardware Issues

Upgrading Index • Replaced with similar hardware as previous servers • Ran for less than 24 hours before it crashed • Multiple disk failure, 3 in RAID-6 is an array failure

Time to go to Dell • Replaced with Intel 800GB (3610) • Problem solved!

16

Shards v3 - Hardware Issues

Consolidated Servers Started Crashing • SSD vendor was LITEON • Issue with garbage collection and controller timeouts • Firmware upgrade to fix it, but it didn’t • Continued to have issues with drives being kicked out of the array • Also had problems with over-utilization/write endurance on SSDs • Replaced with Samsung 960GB (PM863) that have a higher write

endurance • Both problems solved!

17

Shards v3 - Planning

Slow Down, Re-evaluate • What is our goal? • How can we avoid more nightmares?

Goal was Server Density • How much can you fit into a single server? • How can this continue to be easy to expand capacity?

18

Shards v3 - Planning

Wrote a Document Detailing the Project • Start with Problem Statement: “We Have Too Many Servers” • Key Wins:

• Schema Change Speed Faster on SSDs • Power Utilization • Data Center Space Reduction

• Detailed Technical Implementation • “…but will it scale?” • How do splits work?

• Deployment Plan • Risks and Unknowns

Circulated the Document Widely

19


Hardware • (30) Dell R630 / 512GB / 800GB 12x12 RAID-6 (15TB)

Logical Layout • Active-Active / Master-Master Replication • (20) 22 databases x 3 instances per server [66 dbs] • (10) 10 databases x 6 instances per server [60 dbs] • MySQL 5.5

20


Problem Solved: 360 is Too Many • Originally 120 servers, was projected to be 360 servers • Now we only have 56! • Started with 60TB, then 180TB, now 450TB

21

Shards v3 - Graphs - Site Performance

Site Performance During Schema Change is Bad! • We Pull Side A • Side B Receives Side A Traffic but is Cold • I/O Wait Jumps • PHP Response Time Gets Much Slower • 15-30 Minutes for Warm-Up

SSDs Solve This! • Random Reads are Faster • Swinging A to B Still Incurs Buffer Pool Churn • I/O is no longer a bottleneck • Site Performance stays Steady

22

Shards v3 - Graphs - CPU Utilization

How do 2 years of CPU evolution stack up? • Pretty amazing, actually. • Single 10-database instance runs at 10% CPU • 6 10-database instances run at 15% CPU • 50% increase in CPU for 6x density

23

Shards v3 - Graphs - Query Performance

At 3-6x density, how will this impact query latency? • Old hardware was 707µ, new hardware is 359µ!

24

Shards v3 - Graphs - Other Wins

What kind of wins do we see by reducing hardware counts so significantly? • 24k watts of power down to 8k watts of power • Apparently it uses a lot of power to keep disks spinning

Backup Times Improved • New servers had 10gbit NICs • Shuffled the backup servers around to eliminate port congestion • 150mb throttle to no throttle • 9 hours to 1 hour for backups!

Management of Servers Greatly Improved • Upgraded to MySQL 5.6 in a week • Top level masters were only 2 days

25

Lessons Learned

1. Planning Gives You Confidence 2. Team Smart vs You Smart 3. Estimating Scaling Can Be Tricky 4. Learn How to Performance Test Disks 5. Don’t Fear Large Change 6. Monitor Write Endurance 7. Graph Disk Performance

26

PERCONA LIVE EUROPE: AMSTERDAM

OCTOBER 5, 2016

Jeremy Tinley, Senior MySQL Operations Engineer

Twitter: @techwolf359

SSDs at Etsy: A War Story

27

Download - SSDs at Etsy - Percona...Why Are We Here? What is this talk about? • Evolution of sharded databases at Etsy • What problems we faced along the way • How SSDs made everything

Top Related