idedup - usenix

28
Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp iDedup Latency-aware inline deduplication for primary workloads 1

Upload: others

Post on 18-Jan-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: iDedup - USENIX

Kiran Srinivasan, Tim Bisson Garth Goodson, Kaladhar Voruganti Advanced Technology Group NetApp

iDedup Latency-aware inline deduplication

for primary workloads

1

Page 2: iDedup - USENIX

iDedup – overview/context

2

NFS/CIFS/iSCSI

Primary Storage

Storage Clients

Secondary Storage

NDMP/Other

Primary Storage: •  Performance & reliability are key features •  RPC-based protocols => latency sensitive •  Only offline dedupe techniques developed

Dedupe exploited effectively here

=> 90+% savings

iDedup •  Inline/foreground dedupe technique for primary •  Minimal impact on latency-sensitive workloads

Page 3: iDedup - USENIX

3

Dedupe techniques (offline vs inline)

Offline dedupe •  First copy on stable storage is not deduped •  Dedupe is a post-processing/background activity

Inline dedupe •  Dedupe before first copy on stable storage •  Primary => Latency should not be affected! •  Secondary => Dedupe at ingest rate (IOPS)!

Page 4: iDedup - USENIX

Why inline dedupe for primary?

4

Provisioning/Planning is easier •  Dedupe savings are seen right away •  Planning is simpler as capacity values are accurate

No post-processing activities •  No scheduling of background processes •  No interference => front-end workloads are not affected •  Key for storage system users with limited maintenance windows

Efficient use of resources •  Efficient use of I/O bandwidth (offline has both reads + writes) •  File system’s buffer cache is more efficient (deduped)

Performance challenges have been the key obstacle •  Overheads (CPU + I/Os) for both reads and writes hurt latency

Page 5: iDedup - USENIX

iDedup – Key features

5

Minimizes inline dedupe performance overheads •  Leverages workload characteristics •  Eliminates almost all extra I/Os due to dedupe processing •  CPU overheads are minimal

Tunable tradeoff: dedupe savings vs performance •  Selective dedupe => some loss in dedupe capacity savings

iDedup can be combined with offline techniques •  Maintains same on-disk data structures as normal file system •  Offline dedupe can be run optionally

Page 6: iDedup - USENIX

Related Work

6

Workload/ Method Offline Inline Primary NetApp ASIS

EMC Celerra IBM StorageTank

iDedup

Secondary

(No motivation for systems in this category)

EMC DDFS, EMC Cluster DeepStore, NEC HydraStor, Venti, SiLo, Sparse Indexing, ChunkStash, Foundation, Symantec, EMC Centera

zFS*

Page 7: iDedup - USENIX

Outline

¡  Inline dedupe challenges ¡ Our Approach ¡ Design/Implementation details ¡ Evaluation results ¡ Summary

7

Page 8: iDedup - USENIX

Fragmentation with random seeks

Inline Dedupe - Read Path challenges

8

Inherently, dedupe causes disk-level fragmentation ! •  Sequential reads turn random => more seeks => more latency •  RPC based protocols (CIFS/NFS/iSCI) are latency sensitive •  Dataset/workload property Primary workloads are typically read-intensive •  Usually the Read/Write ratio is ~70/30 •  Inline dedupe must not affect read performance !

Page 9: iDedup - USENIX

Inline Dedupe – Write Path Challenges

9

CPU overheads in the critical write path •  Dedupe requires computing the Fingerprint (Hash) of each block •  Dedupe algorithm requires extra cycles

Client Write

Write Logging

Compute Hash

Dedupe Algorithm

Write Allocation Disk IO

De-stage phase

Extra random I/Os in the write path due to dedupe algorithm •  Dedupe metadata (Finger Print DB) lookups and updates •  Updating the reference counts of blocks on disk

Dedupe metadata

Random Seeks !!

Page 10: iDedup - USENIX

10

Our Approach

Page 11: iDedup - USENIX

iDedup – Intuition

11

Is there a good tradeoff between capacity savings and

latency performance?

Page 12: iDedup - USENIX

iDedup – Solution to Read Path issues

12

Insight 1: Dedupe only sequences of disk blocks •  Solves fragmentation => amortized seeks during reads •  Selective dedupe, leverages spatial locality •  Configurable minimum sequence length

Fragmentation with random seeks

Sequences, with amortized seeks

Page 13: iDedup - USENIX

iDedup – Write Path issues

13

How can we reduce dedupe metadata I/Os? Flash(?) - read I/Os are cheap, but frequent updates are

expensive

Page 14: iDedup - USENIX

iDedup – Solution to Write Path issues

14

Insight 2: Keep a smaller dedupe metadata as an in-memory cache •  No extra IOs •  Leverages temporal locality characteristics in duplication •  Some loss in dedupe (a subset of blocks are used)

Time

Cached fingerprints Blocks

Page 15: iDedup - USENIX

iDedup - Viability

15

Dedup Ratio

Original Spatial Locality

Spatial + Temporal Locality

Is this loss in dedupe savings Ok?

Both spatial and temporal localities are dataset/workload properties ! ⇒  Viable for some

important primary workloads

Page 16: iDedup - USENIX

16

Design and Implementation

Page 17: iDedup - USENIX

iDedup Architecture

17

NVRAM (Write log blocks)

iDedup Algorithm

I/Os (Reads + Writes)

Dedupe metadata (FPDB)

Disk

De-stage

File system (WAFL)

Page 18: iDedup - USENIX

iDedup - Two key tunable parameters

18

Minimum sequence length – Threshold •  Minimum number of sequential duplicate blocks on disk •  Dataset property => ideally set to expected fragmentation •  Different from larger block size – variable vs fixed •  Knob between performance (fragmentation) and dedupe

Dedupe metadata (Fingerprint DB) cache size •  Workload’s working set property •  Increase in cache size => decrease in buffer cache •  Knob between performance (cache hit ratio) and dedupe

Page 19: iDedup - USENIX

iDedup Algorithm The iDedup algorithm works in 4 phases for every file: ¡  Phase 1 (per file):Identify blocks for iDedup

–  Only full, pure data blocks are processed –  Metadata blocks, special files ignored

¡  Phase 2 (per file) : Sequence processing –  Uses the dedupe metadata cache –  Keeps track of multiple sequences

¡  Phase 3 (per sequence): Sequence pruning –  Eliminate short sequences below threshold –  Pick among overlapping sequences via a heuristic

¡  Phase 4 (per sequence): Deduplication of sequence

19

Page 20: iDedup - USENIX

20

Evaluation

Page 21: iDedup - USENIX

Evaluation Setup ¡ NetApp FAS 3070, 8GB RAM, 3 disk RAID0 ¡ Evaluated by replaying real-world CIFS traces

–  Corporate filer traces in NetApp DC (2007) ¡  Read data: 204GB (69%), Write data: 93GB

–  Engineering filer traces in NetApp DC (2007) ¡  Read data: 192GB (67%), Write data: 92GB

¡ Comparison points –  Baseline: System with no iDedup –  Threshold-1: System with full dedupe (1 block)

¡ Dedupe metadata cache: 0.25, 0.5 & 1GB 21

Page 22: iDedup - USENIX

8

10

12

14

16

18

20

22

24

1 2 4 8

De

du

plic

atio

n r

atio

(%

)

Threshold

.25 GB.5 GB1 GB

Results: Deduplication ratio vs Threshold

22

Dedupe ratio vs Thresholds, Cache sizes (Corp)

Less than linear decrease in dedupe savings ⇒ Spatial locality in

dedupe

Ideal Threshold = Biggest threshold with least decrease in dedupe savings ⇒  Threshold-4 ⇒  ~60% of max

Page 23: iDedup - USENIX

Results: Disk Fragmentation (req sizes)

23

0

10

20

30

40

50

60

70

80

90

100

1 8 16 24 32 40

Perc

enta

ge o

f T

ota

l Request

s

Request Sequence Size (Blocks)

Baseline (Mean=15.8)Threshold-1 (Mean=12.5)Threshold-2 (Mean=14.8)Threshold-4 (Mean=14.9)Threshold-8 (Mean=15.4)

Fragmentation for other thresholds are between Baseline and Thresh-1 ⇒  Tunable

fragmentation

Least fragmentation

Max fragmentation

CDF of block request sizes (Engg, 1GB)

Page 24: iDedup - USENIX

Results: CPU Utilization

24

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Perc

enta

ge o

f C

PU

Sam

ple

s

CPU Utilization (%)

Baseline (Mean=13.2%)Threshold-1 (Mean=15.0%)Threshold-4 (Mean=16.6%)Threshold-8 (Mean=17.1%)

CDF of CPU utilization samples (Corp, 1GB)

Larger variance (long tail) compared to baseline ⇒  But, mean

difference is less than 4%

Page 25: iDedup - USENIX

Results: Latency Impact

25

75

80

85

90

95

100

0 5 10 15 20 25 30

Pe

rce

nta

ge

of

Re

qu

est

s

Response Time (ms)

BaselineThreshold-1Threshold-8

CDF of client response time (Corp, 1GB)

Affects >2ms

Latency impact for longer response times ( > 2ms) ⇒  Thresh-1 mean

latency affected by ~13% vs baseline

⇒  Diff between Thresh-8 and baseline <4%!

Page 26: iDedup - USENIX

Summary

¡  Inline dedupe has significant performance challenges –  Reads : Fragmentation, Writes: CPU + Extra I/Os

¡  iDedup creates tradeoffs b/n savings and performance –  Leverage dedupe locality properties –  Avoid fragmentation – dedupe only sequences –  Avoid extra I/Os – keep dedupe metadata in memory

¡  Experiments for latency-sensitive primary workloads –  Low CPU impact – < 5% on the average –  ~60% of max dedupe, ~4% impact on latency

¡  Future work: Dynamic threshold, more traces ¡  Our traces are available for research purposes

26

Page 27: iDedup - USENIX

Acknowledgements

¡ NetApp WAFL Team –  Blake Lewis, Ling Zheng, Craig Johnston,

Subbu PVS, Praveen K, Sriram Venketaraman ¡ NetApp ATG Team

–  Scott Dawkins, Jeff Heller ¡ Shepherd

–  John Bent ¡ Our Intern (from UCSC)

–  Stephanie Jones

27

Page 28: iDedup - USENIX

28