techniques for managing huge data lisa10

USENIX LISA10 November 7, 2010

Techniques for Handling Huge Storage

[email protected]

USENIX LISA’10 ConferenceNovember 8, 2010

Sunday, November 7, 2010


AgendaHow did we get here?When good data goes badCapacity, planning, and design What comes next?

2

Note: this tutorial uses live demos, slides not so much


3

History



Milestones in Tape Evolution

4

1951 - magnetic tape for data storage1964 - 9 track1972 - Quarter Inch Cartridge (QIC)1977 - Commodore Datasette1984 - IBM 34801989 - DDS/DAT1995 - IBM 35902000 - T99402000 - LTO2006 - T100002008 - TS1130



Milestones in Disk Evolution

5

1954 - hard disk invented1950s - Solid state disk invented1981 - Shugart Associates System Interface (SASI)1984 - Personal Computer Advanced Technology (PC/AT)Attachment,

later shortened to ATA1986 - “Small” Computer System Interface (SCSI)1986 - Integrated Drive Electronics (IDE)1994 - EIDE1994 - Fibre Channel (FC)1995 - Flash-based SSDs2001 - Serial ATA (SATA)2005 - Serial Attached SCSI (SAS)



Architectural ChangesSimple, parallel interfacesSerial interfacesAggregated serial interfaces

6


7

When Good Data Goes Bad



Failure RatesMean Time Between Failures (MTBF)

Statistical interarrival error rate Often cited in literature and data sheetsMTBF = total operating hours / total number of failures

Annualized Failure Rate (AFR)AFR = operating hours per year / MTBFExpressed as a percentExample

MTBF = 1,200,000 hoursYear = 24 x 365 = 8,760 hoursAFR = 8,760 / 1,200,000 = 0.0073 = 0.73%

AFR is easier to grok than MTBF

8

Operating hours per year is a flexible definition



Multiple Systems and Statistics

Consider 100 systems each with an MTBF = 1,000 hoursAt time=1,000 hours, 100 failures occurredNot all systems will see one failure

9

0

10

20

30

40

0 1 2 3 4

Num

ber o

f Sys

tem

s

Number of Failures

Very, Very Unlucky

Unlucky

Very Unlucky



Failure RatesMTBF is a summary metric

Manufacturers estimate MTBF by stressing many units for short periods of qualification time

Summary metrics hide useful informationExample: mortality study

Study mortality of children aged 5-14 during 1996-1998Measured 20.8 per 100,000MTBF = 4,807 yearsCurrent world average life expectancy is 67.2 years

For large populations, such as huge disk farms, the summary MTBF can appear constant

Better question to be answered, “is my failure rate increasing or decreasing?”

10



Why Do We Care?Summary statistics, like MTBF or AFR, can me misleading or risky if

we do not also distinguish between stable and trending processesWe need to analyze the ordered times between failure in relationship

to the system age to describe system reliability

11



Time Dependent ReliabilityUseful for repairable systems

System can be repaired to satisfactory operation by any actionFailures occur sequentially in time

Measure the age of the components of a systemNeed to distinguish age from interarrival times (time between

failures)Doesn’t have to be precise, resolution of weeks works okSome devices report Power On Hours (POH)

SMART for disksOSesClerical solutions or inventory asset systems work fine

12



TDR Example 1

13

0

5

10

15

20

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

Mea

n C

umul

ativ

e Fa

ilure

s

System Age (months)

Disk Set ADisk Set BDisk Set CTarget MTBF



TDR Example 2

14

Did a common event occur?

0

5

10

15

20

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

Mea

n C

umul

ativ

e Fa

ilure

s

System Age (months)

Disk Set ADisk Set BDisk Set CTarget MTBF



TDR Example 2.5

15

0

5

10

15

20

Jan 1, 2010 May 14, 2011 Sep 23, 2012 Feb 3, 2014

Mea

n C

umul

ativ

e Fa

ilure

s

Date



Long Term StorageNear-line disk systems for backup

Access time and bandwith advantages over tapeEnterprise-class tape for backup and archival

15-30 years shelf lifeSignificant ECC

Read error rate: 1e-20Enterprise-class HDD read error rate: 1e-15

16



Reliability

17

Reliability is time dependentTDR analysis reveals trendsUse cumulative plots, mean cumulative plots, and recurrance ratesGraphs are goodTrack failures and downtime by system versus age and calendar datesCorelate anomalous behaviorManage retirement, refresh, preventative processes using real data


18

Data Sheets



Reading Data SheetsManufacturers publish useful data sheets and product guidesReliability information

MTBF or AFRUER, or equivalentWarranty

PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size

EnvironmentalsPower

19

AFR operating hours per year can be a footnote


20

Availability



Nines MatterIs the Internet up?

21



Nines MatterIs the Internet up?Is the Internet down?

22



Nines MatterIs the Internet up?Is the Internet down?Is the Internet reliability 5-9’s?

23



Nines Don’t MatterIs the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?

24



Reliability Matters!Is the Internet up?Is the Internet down?Is the Internet’s reliability 5-9’s?Do 5-9’s matter?Reliability matters!

25



Designing for FailureChange design perspectiveDesign to success

How to make it work?What you learned in school: solve the equationCan be difficult...

Design for failureHow to make it work when everything breaks?What you learned in the army: win the warCan be difficult... at first...

26



HA-Cluster plugin

Example: Design for Success

x86 ServerNexentaStor

Shared Storage

Shared Storage

x86 ServerNexentaStor

FCSASiSCSI



Designing for FailureApplication-level replicationHard to implement - coding required

Some activity in open communityHard to apply to general purpose computing

ExamplesDoD, Google, Facebook, Amazon, ...The big guys

Tends to scale well with sizeMultiple copies of data

28



Reliability - AvailabilityReliability trumps availability

If disks didn’t break, RAID would not existIf servers didn’t break, HA cluster would not exist

Reliability measured in probabilitiesAvailability measured in nines

29


30

Data Retention



Evaluating Data RetentionMTTDL = Mean Time To Data LossNote: MTBF is not constant in the real world, but keeps math simpleMTTDL[1] is a simple MTTDL modelNo parity (single vdev, striping, RAID-0)

MTTDL[1] = MTBF / NSingle Parity (mirror, RAIDZ, RAID-1, RAID-5)

MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)Double Parity (3-way mirror, RAIDZ2, RAID-6)

MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)Triple Parity (4-way mirror, RAIDZ3)

MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)

31



Another MTTDL ModelMTTDL[1] model doesn't take into account unrecoverable readBut unrecoverable reads (UER) are becoming the dominant failure

modeUER specifed as errors per bits readMore bits = higher probability of loss per vdev

MTTDL[2] model considers UER

32



Why Worry about UER?Richard's study

3,684 hosts with 12,204 LUNs11.5% of all LUNs reported read errors

Bairavasundaram et.al. FAST08www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf1.53M LUNs over 41 monthsRAID reconstruction discovers 8% of checksum mismatches“For some drive models as many as 4% of drives develop

checksum mismatches during the 17 months examined”Manufacturers trade UER for space

33



Why Worry about UER?

RAID array study

34



Why Worry about UER?

RAID array study

35

UnrecoverableReads

Disk Disappeared“disk pull”

“Disk pull” tests aren’t very useful



MTTDL[2] ModelProbability that a reconstruction will fail

Precon_fail = (N-1) * size / UERModel doesn't work for non-parity schemes

single vdev, striping, RAID-0Single Parity (mirror, RAIDZ, RAID-1, RAID-5)

MTTDL[2] = MTBF / (N * Precon_fail)Double Parity (3-way mirror, RAIDZ2, RAID-6)

MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)Triple Parity (4-way mirror, RAIDZ3)

MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 * Precon_fail)

36



Practical View of MTTDL[1]

37



MTTDL[1] Comparison

38



MTTDL Models: Mirror

39

Spares are not always better...



MTTDL Models: RAIDZ2

40



Space, Dependability, and Performance

41



Dependability Use CaseCustomer has 15+ TB of read-mostly data16-slot, 3.5” drive chassis2 TB HDDsOption 1: one raidz2 set

24 TB available space12 data2 parity2 hot spares, 48 hour disk replacement time

MTTDL[1] = 1,790,000 yearsOption 2: two raidz2 sets

24 TB available space (each set)6 data2 parityno hot spares

MTTDL[1] = 7,450,000 years

42



Planning for Spares Number of systems Need for sparesHow many spares do you need?How often do you plan replacements?

Replacing devices immediately becomes impracticalNot replacing devices increases risk, but how much?There is no black/white answer, it depends...

43



SparesOptimizer Demo

44


Capacity, Planning, and Design

45



SpaceSpace is a poor sizing metric, really!Technology marketing heavily pushes space

Maximizing space can mean compromising performance AND reliability

As disks and tapes get bigger, they don’t get better$150 rulePHB’s get all excited about space

Most current capacity planning tools manage by space



BandwidthBandwidth constraints in modern systems are rareOverprovisioning for bandwidth is relatively simpleWhere to gain bandwidth can be tricky

Link aggregationEthernetSAS

MPIOAdding parallelism beyond 2 trades off reliability

47



LatencyLower latency == better performanceLatency != IOPS

IOPS also achieved with parallelismParallelism only delivers latency when latency is constrained by

bandwidthLatency = access time + transfer timeHDD

Access time limited by seek and rotateTransfer time usually limited by media or internal bandwidth

SSDAccess time limited by architecture more than cTransfer time limited by architecture and interface

TapeAccess time measured in seconds

48


49

Deduplication



What is Deduplication?A $2.1 Billion feature2009 buzzword of the yearTechnique for improving storage space efficiency

Trades big I/Os for small I/OsDoes not eliminate I/O

Implementation stylesoffline or post processing

data written to nonvolatile storageprocess comes along later and dedupes dataexample: tape archive dedup

inlinedata is deduped as it is being allocated to nonvolatile storageexample: ZFS

50



Dedup how-toGiven a bunch of dataFind data that is duplicatedBuild a lookup table of references to dataReplace duplicate data with a pointer to the entry in the lookup tableGrainularity

fileblockbyte

51



Dedup ConstraintsSize of the deduplication tableQuality of the checksums

Collisions happenAll possible permutations of N bits cannot be stored in N/10 bitsChecksums can be evaluated by probability of collisionsMultiple checksums can be used, but gains are marginal

Compression algorithms can work against deduplicationDedup before or after compression?

52



Verification

add reference

checksum

compress

DDT entry lookup

write()

read data

data match?

new entry

yes

no

verify?

yes

no

yes

noDDT

match?

53



Reference Counts

54

Eggs courtesy of Richard’s chickens


55

Replication



Replication Services

Recovery Point Objective

System I/O Performance

Text

Days

Seconds

Slower Faster

Mirror

Application Level

Replication

Block ReplicationDRBD, SNDR

Object-level syncDatabases, ZFS

File-level syncrsync

Traditional Backup NDMP, tar

Hours

56



How Many Copies Do You Need?Answer: at least one, more is better...One production, one backupOne production, one near-line, one backupOne production, one near-line, one backup, one at DR siteOne production, one near-line, one backup, one at DR site, one

archived in a vaultRAID doesn’t countConsider 3 to 4 as a minimum for important data

57



Tiering Example

58

Big, honkingdisk array

Big, honkingtape library

File-basedbackup

Works great, but...



Tiering Example

59



File-basedbackup

... backups never complete

10 million files1 million daily changes

12 hourbackup window



Tiering Example

60



Near-linebackup

Backups to near-line storage and tape have different policies

10 million files1 million daily changes

weeklybackup window

hourly block-levelreplication



Tiering Example

61



Near-linebackup

Quick file restoration possible



Application-Level Replication Example

62

Site 2

Long-termarchive option

Site 1

Data stored atdifferent sites

Site 3

Application


63

Data Sheets



Reading Data Sheets ReduxManufacturers publish useful data sheets and product guidesReliability information

MTBF or AFRUER, or equivalentWarranty

PerformanceInterface bandwidthSustained bandwidth (aka internal or media bandwidth)Average rotational delay or rpm (HDD)Average response or seek timeNative sector size

EnvironmentalsPower

64

AFR operating hours per year can be a footnote


65

Summary



Key Points

66

You will need many copies of your data, get used to itThe cost/byte decreases faster than kicking old habitsReplication is a good thing, use oftenTiering is a good thing, use often

Beware of designing for success, design for failure, tooReliability trumps availabilitySpace, dependability, performance: pick two


67

Thank You!

Questions?

[email protected]

[email protected]


techniques for managing huge data lisa10

Technology

usenix lisa10 conference

summary mtbf

year mtbf

failure rates mtbf

data sheets mtbf

disk evolution

system reliability

hours afr