techniques for managing huge data lisa10

67
USENIX LISA10 November 7, 2010 Techniques for Handling Huge Storage [email protected] USENIX LISA’10 Conference November 8, 2010 Sunday, November 7, 2010

Upload: richard-elling

Post on 26-Jun-2015

575 views

Category:

Technology


1 download

DESCRIPTION

Slides from the USENIX LISA10 Tutorial on Techniques for Managing Huge Data

TRANSCRIPT

Page 1: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Techniques for Handling Huge Storage

[email protected]

USENIX LISA’10 ConferenceNovember 8, 2010

Sunday, November 7, 2010

Page 2: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

AgendaHow did we get here?When good data goes badCapacity, planning, and design What comes next?

2

Note: this tutorial uses live demos, slides not so much

Sunday, November 7, 2010

Page 3: Techniques for Managing Huge Data LISA10

3

History

Sunday, November 7, 2010

Page 4: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Milestones in Tape Evolution

4

1951 - magnetic tape for data storage1964 - 9 track1972 - Quarter Inch Cartridge (QIC)1977 - Commodore Datasette1984 - IBM 34801989 - DDS/DAT1995 - IBM 35902000 - T99402000 - LTO2006 - T100002008 - TS1130

Sunday, November 7, 2010

Page 5: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Milestones in Disk Evolution

5

1954 - hard disk invented1950s - Solid state disk invented1981 - Shugart Associates System Interface (SASI)1984 - Personal Computer Advanced Technology (PC/AT)Attachment,

later shortened to ATA1986 - “Small” Computer System Interface (SCSI)1986 - Integrated Drive Electronics (IDE)1994 - EIDE1994 - Fibre Channel (FC)1995 - Flash-based SSDs2001 - Serial ATA (SATA)2005 - Serial Attached SCSI (SAS)

Sunday, November 7, 2010

Page 6: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Architectural ChangesSimple, parallel interfacesSerial interfacesAggregated serial interfaces

6

Sunday, November 7, 2010

Page 7: Techniques for Managing Huge Data LISA10

7

When Good Data Goes Bad

Sunday, November 7, 2010

Page 8: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Failure RatesMean Time Between Failures (MTBF)

Statistical interarrival error rate Often cited in literature and data sheetsMTBF = total operating hours / total number of failures

Annualized Failure Rate (AFR)AFR = operating hours per year / MTBFExpressed as a percentExample

MTBF = 1,200,000 hoursYear = 24 x 365 = 8,760 hoursAFR = 8,760 / 1,200,000 = 0.0073 = 0.73%

AFR is easier to grok than MTBF

8

Operating hours per year is a flexible definition

Sunday, November 7, 2010

Page 9: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Multiple Systems and Statistics

Consider 100 systems each with an MTBF = 1,000 hoursAt time=1,000 hours, 100 failures occurredNot all systems will see one failure

9

0

10

20

30

40

0 1 2 3 4

Num

ber o

f Sys

tem

s

Number of Failures

Very, Very Unlucky

Unlucky

Very Unlucky

Sunday, November 7, 2010

Page 10: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Failure RatesMTBF is a summary metric

Manufacturers estimate MTBF by stressing many units for short periods of qualification time

Summary metrics hide useful informationExample: mortality study

Study mortality of children aged 5-14 during 1996-1998Measured 20.8 per 100,000MTBF = 4,807 yearsCurrent world average life expectancy is 67.2 years

For large populations, such as huge disk farms, the summary MTBF can appear constant

Better question to be answered, “is my failure rate increasing or decreasing?”

10

Sunday, November 7, 2010

Page 11: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Why Do We Care?Summary statistics, like MTBF or AFR, can me misleading or risky if

we do not also distinguish between stable and trending processesWe need to analyze the ordered times between failure in relationship

to the system age to describe system reliability

11

Sunday, November 7, 2010

Page 12: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Time Dependent ReliabilityUseful for repairable systems

System can be repaired to satisfactory operation by any actionFailures occur sequentially in time

Measure the age of the components of a systemNeed to distinguish age from interarrival times (time between

failures)Doesn’t have to be precise, resolution of weeks works okSome devices report Power On Hours (POH)

SMART for disksOSesClerical solutions or inventory asset systems work fine

12

Sunday, November 7, 2010

Page 13: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

TDR Example 1

13

0

5

10

15

20

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

Mea

n C

umul

ativ

e Fa

ilure

s

System Age (months)

Disk Set ADisk Set BDisk Set CTarget MTBF

Sunday, November 7, 2010

Page 14: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

TDR Example 2

14

Did a common event occur?

0

5

10

15

20

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

Mea

n C

umul

ativ

e Fa

ilure

s

System Age (months)

Disk Set ADisk Set BDisk Set CTarget MTBF

Sunday, November 7, 2010

Page 15: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

TDR Example 2.5

15

0

5

10

15

20

Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014

Mea

n C

umul

ativ

e Fa

ilure

s

Date

Sunday, November 7, 2010

Page 16: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Long Term StorageNear-line disk systems for backup

Access time and bandwith advantages over tapeEnterprise-class tape for backup and archival

15-30 years shelf lifeSignificant ECC

Read error rate: 1e-20Enterprise-class HDD read error rate: 1e-15

16

Sunday, November 7, 2010

Page 17: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reliability

17

Reliability is time dependentTDR analysis reveals trendsUse cumulative plots, mean cumulative plots, and recurrance ratesGraphs are goodTrack failures and downtime by system versus age and calendar datesCorelate anomalous behaviorManage retirement, refresh, preventative processes using real data

Sunday, November 7, 2010

Page 18: Techniques for Managing Huge Data LISA10

18

Data Sheets

Sunday, November 7, 2010

Page 19: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reading Data SheetsManufacturers publish useful data sheets and product guidesReliability information

MTBF or AFRUER, or equivalentWarranty

PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size

EnvironmentalsPower

19

AFR operating hours per year can be a footnote

Sunday, November 7, 2010

Page 20: Techniques for Managing Huge Data LISA10

20

Availability

Sunday, November 7, 2010

Page 21: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Nines MatterIs the Internet up?

21

Sunday, November 7, 2010

Page 22: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Nines MatterIs the Internet up?Is the Internet down?

22

Sunday, November 7, 2010

Page 23: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Nines MatterIs the Internet up?Is the Internet down?Is the Internet reliability 5-9’s?

23

Sunday, November 7, 2010

Page 24: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Nines Don’t MatterIs the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?

24

Sunday, November 7, 2010

Page 25: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reliability Matters!Is the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?Reliability matters!

25

Sunday, November 7, 2010

Page 26: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Designing for FailureChange design perspectiveDesign to success

How to make it work?What you learned in school: solve the equationCan be difficult...

Design for failureHow to make it work when everything breaks?What you learned in the army: win the warCan be difficult... at first...

26

Sunday, November 7, 2010

Page 27: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

HA-Cluster plugin

Example: Design for Success

x86 ServerNexentaStor

Shared Storage

Shared Storage

x86 ServerNexentaStor

FCSASiSCSI

Sunday, November 7, 2010

Page 28: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Designing for FailureApplication-level replicationHard to implement - coding required

Some activity in open communityHard to apply to general purpose computing

ExamplesDoD, Google, Facebook, Amazon, ...The big guys

Tends to scale well with sizeMultiple copies of data

28

Sunday, November 7, 2010

Page 29: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reliability - AvailabilityReliability trumps availability

If disks didn’t break, RAID would not existIf servers didn’t break, HA cluster would not exist

Reliability measured in probabilitiesAvailability measured in nines

29

Sunday, November 7, 2010

Page 30: Techniques for Managing Huge Data LISA10

30

Data Retention

Sunday, November 7, 2010

Page 31: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Evaluating Data RetentionMTTDL = Mean Time To Data LossNote: MTBF is not constant in the real world, but keeps math simpleMTTDL[1] is a simple MTTDL modelNo parity (single vdev, striping, RAID-0)

MTTDL[1] = MTBF / NSingle Parity (mirror, RAIDZ, RAID-1, RAID-5)

MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)Double Parity (3-way mirror, RAIDZ2, RAID-6)

MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)Triple Parity (4-way mirror, RAIDZ3)

MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)

31

Sunday, November 7, 2010

Page 32: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Another MTTDL ModelMTTDL[1] model doesn't take into account unrecoverable readBut unrecoverable reads (UER) are becoming the dominant failure

modeUER specifed as errors per bits readMore bits = higher probability of loss per vdev

MTTDL[2] model considers UER

32

Sunday, November 7, 2010

Page 33: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Why Worry about UER?Richard's study

3,684 hosts with 12,204 LUNs11.5% of all LUNs reported read errors

Bairavasundaram et.al. FAST08www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf1.53M LUNs over 41 monthsRAID reconstruction discovers 8% of checksum mismatches“For some drive models as many as 4% of drives develop

checksum mismatches during the 17 months examined”Manufacturers trade UER for space

33

Sunday, November 7, 2010

Page 34: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Why Worry about UER?

RAID array study

34

Sunday, November 7, 2010

Page 35: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Why Worry about UER?

RAID array study

35

UnrecoverableReads

Disk Disappeared“disk pull”

“Disk pull” tests aren’t very useful

Sunday, November 7, 2010

Page 36: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

MTTDL[2] ModelProbability that a reconstruction will fail

Precon_fail = (N-1) * size / UERModel doesn't work for non-parity schemes

single vdev, striping, RAID-0Single Parity (mirror, RAIDZ, RAID-1, RAID-5)

MTTDL[2] = MTBF / (N * Precon_fail)Double Parity (3-way mirror, RAIDZ2, RAID-6)

MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)Triple Parity (4-way mirror, RAIDZ3)

MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)

36

Sunday, November 7, 2010

Page 37: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Practical View of MTTDL[1]

37

Sunday, November 7, 2010

Page 38: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

MTTDL[1] Comparison

38

Sunday, November 7, 2010

Page 39: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

MTTDL Models: Mirror

39

Spares are not always better...

Sunday, November 7, 2010

Page 40: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

MTTDL Models: RAIDZ2

40

Sunday, November 7, 2010

Page 41: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Space, Dependability, and Performance

41

Sunday, November 7, 2010

Page 42: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Dependability Use CaseCustomer has 15+ TB of read-mostly data16-slot, 3.5” drive chassis2 TB HDDsOption 1: one raidz2 set

24 TB available space12 data2 parity2 hot spares, 48 hour disk replacement time

MTTDL[1] = 1,790,000 yearsOption 2: two raidz2 sets

24 TB available space (each set)6 data2 parityno hot spares

MTTDL[1] = 7,450,000 years

42

Sunday, November 7, 2010

Page 43: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Planning for Spares Number of systems Need for sparesHow many spares do you need?How often do you plan replacements?

Replacing devices immediately becomes impracticalNot replacing devices increases risk, but how much?There is no black/white answer, it depends...

43

Sunday, November 7, 2010

Page 44: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

SparesOptimizer Demo

44

Sunday, November 7, 2010

Page 45: Techniques for Managing Huge Data LISA10

Capacity, Planning, and Design

45

Sunday, November 7, 2010

Page 46: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 201046

SpaceSpace is a poor sizing metric, really!Technology marketing heavily pushes space

Maximizing space can mean compromising performance AND reliability

As disks and tapes get bigger, they don’t get better$150 rulePHB’s get all excited about space

Most current capacity planning tools manage by space

Sunday, November 7, 2010

Page 47: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

BandwidthBandwidth constraints in modern systems are rareOverprovisioning for bandwidth is relatively simpleWhere to gain bandwidth can be tricky

Link aggregationEthernetSAS

MPIOAdding parallelism beyond 2 trades off reliability

47

Sunday, November 7, 2010

Page 48: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

LatencyLower latency == better performanceLatency != IOPS

IOPS also achieved with parallelismParallelism only delivers latency when latency is constrained by

bandwidthLatency = access time + transfer timeHDD

Access time limited by seek and rotateTransfer time usually limited by media or internal bandwidth

SSDAccess time limited by architecture more than cTransfer time limited by architecture and interface

TapeAccess time measured in seconds

48

Sunday, November 7, 2010

Page 49: Techniques for Managing Huge Data LISA10

49

Deduplication

Sunday, November 7, 2010

Page 50: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

What is Deduplication?A $2.1 Billion feature2009 buzzword of the yearTechnique for improving storage space efficiency

Trades big I/Os for small I/OsDoes not eliminate I/O

Implementation stylesoffline or post processing

data written to nonvolatile storageprocess comes along later and dedupes dataexample: tape archive dedup

inlinedata is deduped as it is being allocated to nonvolatile storageexample: ZFS

50

Sunday, November 7, 2010

Page 51: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Dedup how-toGiven a bunch of dataFind data that is duplicatedBuild a lookup table of references to dataReplace duplicate data with a pointer to the entry in the lookup tableGrainularity

fileblockbyte

51

Sunday, November 7, 2010

Page 52: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Dedup ConstraintsSize of the deduplication tableQuality of the checksums

Collisions happenAll possible permutations of N bits cannot be stored in N/10 bitsChecksums can be evaluated by probability of collisionsMultiple checksums can be used, but gains are marginal

Compression algorithms can work against deduplicationDedup before or after compression?

52

Sunday, November 7, 2010

Page 53: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Verification

add reference

checksum

compress

DDT entry lookup

write()

read data

data match?

new entry

yes

no

verify?

yes

no

yes

noDDT

match?

53

Sunday, November 7, 2010

Page 54: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reference Counts

54

Eggs courtesy of Richard’s chickens

Sunday, November 7, 2010

Page 55: Techniques for Managing Huge Data LISA10

55

Replication

Sunday, November 7, 2010

Page 56: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Replication Services

Recovery Point Objective

System I/O Performance

Text

Days

Seconds

Slower Faster

Mirror

Application Level

Replication

Block ReplicationDRBD, SNDR

Object-level syncDatabases, ZFS

File-level syncrsync

Traditional Backup NDMP, tar

Hours

56

Sunday, November 7, 2010

Page 57: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

How Many Copies Do You Need?Answer: at least one, more is better...One production, one backupOne production, one near-line, one backupOne production, one near-line, one backup, one at DR siteOne production, one near-line, one backup, one at DR site, one

archived in a vaultRAID doesn’t countConsider 3 to 4 as a minimum for important data

57

Sunday, November 7, 2010

Page 58: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Tiering Example

58

Big, honkingdisk array

Big, honkingtape library

File-basedbackup

Works great, but...

Sunday, November 7, 2010

Page 59: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Tiering Example

59

Big, honkingdisk array

Big, honkingtape library

File-basedbackup

... backups never complete

10 million files1 million daily changes

12 hourbackup window

Sunday, November 7, 2010

Page 60: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Tiering Example

60

Big, honkingdisk array

Big, honkingtape library

Near-linebackup

Backups to near-line storage and tape have different policies

10 million files1 million daily changes

weeklybackup window

hourly block-levelreplication

Sunday, November 7, 2010

Page 61: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Tiering Example

61

Big, honkingdisk array

Big, honkingtape library

Near-linebackup

Quick file restoration possible

Sunday, November 7, 2010

Page 62: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Application-Level Replication Example

62

Site 2

Long-termarchive option

Site 1

Data stored atdifferent sites

Site 3

Application

Sunday, November 7, 2010

Page 63: Techniques for Managing Huge Data LISA10

63

Data Sheets

Sunday, November 7, 2010

Page 64: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Reading Data Sheets ReduxManufacturers publish useful data sheets and product guidesReliability information

MTBF or AFRUER, or equivalentWarranty

PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size

EnvironmentalsPower

64

AFR operating hours per year can be a footnote

Sunday, November 7, 2010

Page 65: Techniques for Managing Huge Data LISA10

65

Summary

Sunday, November 7, 2010

Page 66: Techniques for Managing Huge Data LISA10

USENIX LISA10 November 7, 2010

Key Points

66

You will need many copies of your data, get used to itThe cost/byte decreases faster than kicking old habitsReplication is a good thing, use oftenTiering is a good thing, use often

Beware of designing for success, design for failure, tooReliability trumps availabilitySpace, dependability, performance: pick two

Sunday, November 7, 2010

Page 67: Techniques for Managing Huge Data LISA10

67

Thank You!

Questions?

[email protected]

[email protected]

Sunday, November 7, 2010