dutch t. meyer william bolosky - usenix · 2019. 2. 25. · a study of practical deduplication...

32
A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky Microsoft Research

Upload: others

Post on 17-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

A study of practical deduplication

Dutch T. MeyerUniversity of British Columbia

Microsoft Research InternWilliam Bolosky

Microsoft Research

Page 2: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

A study of practical deduplication

Dutch T. MeyerUniversity of British Columbia

Microsoft Research InternWilliam Bolosky

Microsoft Research

Page 3: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Why study deduplication?

$0.046 per GB

9ms 9ms per seekper seek

Page 4: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

When do we exploit duplicates?

It Depends.• How much can you get back from deduping?

• How does fragmenting files affect performance?

• How often will you access the data?

Page 5: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Outline

• Intro

• Methodology

• “There’s more here than dedup” teaser

(intermission)

• Deduplication Background

• Deplication Analysis

• Conclusion

Page 6: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Methodology

MD5(name)MetadataMD5(data)

MD5(name)MetadataMD5(data)

MD5(name)MetadataMD5(data)

Once per week for 4 weeks.~875 file systems~40TB~200M Files

Page 7: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

There’s more here than dedup!

• We update and extend filesystem metadata findings from 2000 and 2004

• File system complexity is growing

• Read the paper to answer questions like:

Are my files bigger now than they used to be?

Page 8: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Teaser: Histogram of file size

0%

2%

4%

6%

8%

10%

12%

14%

0 8 128 2K 32K 512K 8M 128M

File Size (bytes), power-of-two bins

2009 2004 2000

4KSince 1981!

Page 9: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

There’s more here than dedup!

How fragmented are my files?

Page 10: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Teaser: Layout and Organization

• High linearity: only 4% of files fragmented in practice

– Most windows machines defrag weekly

• One quarter of fragmented files have at least 170 fragments

Page 11: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Intermission

• Intro

• Methodology

• “There’s more here than dedup” teaser

(intermission)

• Deduplication Background

• Deplication Analysis

• Conclusion

Page 12: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Dedup Background

foo01101010….. ….110010101

bar01101010….. ….110010101

Whole file Deduplication

Page 13: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Dedup Background

foo01101010….. ….110010101

bar01101010….. ….110010101

Fixed Chunk Deduplication

1

01101010…..

01101010…..

….110010101

….1100101011

Page 14: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Dedup Background

foo01101010….. ….110010101

bar01101010….. ….110010101

Rabin Figerprinting

1

110101101010010100

101101010…..

Page 15: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

The Deduplication Space

Algorithm Parameters Cost Deduplication effectiveness

Whole-file Low Lowest

Fixed Chunk

Chunk Size SeeksCPUComplexity

Middle

Rabin fingerprints

Average Chunk Size

SeeksMore CPUMore Complexity

Highest

Page 16: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

What is the relative deduplication rate of the algorithms?

Page 17: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Dedup by method and chunk size

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

64K 32K 16K 8K

Spac

e D

ed

up

licat

ed

Chunk Size

Whole File Fixed-Chunk Rabin

Page 18: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

What if I was doing full weekly backups?

Page 19: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Backup dedup over 4 weeks

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Whole File

Whole File+ Sparse

8K rabin

Deduplicated Space

Page 20: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

How does the number of filesystems influence deduplication?

Page 21: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Dedup by filesystem count

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8 16 32 64 128 256 512 Whole Set

Spac

e D

ed

up

licat

ed

Deduplication Domain Size (file systems)

Whole File 64 KB Fixed 8KB Fixed 64KB Rabin 8KB Rabin

Page 22: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

So what is filling up all this space?

Page 23: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Bytes by containing file size

0%

2%

4%

6%

8%

10%

12%

1K 16K 256K 4M 64M 1G 16G 256G

Pe

rce

nta

ge o

f To

tal B

yte

s

Containing File Size (Bytes), Power-of-2 bins

2000 2004 2009

Page 24: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

What types of files take up disk space?

Page 25: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Disk consumption by file type

dll dll ø

pdbvhd

dllexe

pdb

libpst

exe

vhdpch

wma

pdb

mp3

lib

exe

lib

cab

pch

chm

pst

cab

cab

mp3

wma

ø

ø

iso

0%

10%

20%

30%

40%

50%

60%

2000 2004 2009

Page 26: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Disk consumption by file type

dll dll ø

pdbvhd

dllexe

pdb

libpst

exe

vhdpch

wma

pdb

mp3

lib

exe

lib

cab

pch

chm

pst

cab

cab

mp3

wma

ø

ø

iso

0%

10%

20%

30%

40%

50%

60%

2000 2004 2009

Page 27: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Which of these types deduplicate well?

Page 28: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Whole-file duplicates

Extension

% of Duplicate

Space

Mean File

Size (bytes)

% of

Total Space

dll 20% 521K 10%

lib 11% 1080K 7%

pdb 11% 2M 7%

<none> 7% 277K 13%

exe 6% 572K 4%

cab 4% 4M 2%

msp 3% 15M 2%

msi 3% 5M 1%

iso 2% 436M 2%

<a guid> 1% 604K <1%

Page 29: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

What files make up the 20% difference between whole file dedup and sparse file, as compared to more aggressive deduplication?

Page 30: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Where does fine granularity help?

vhdvhd

pch

lib

dll

obj

pdb

pdb

lib

pch

wma

iso

pst

dll

ø

avhd

avhd

wma

mo3

wim

0%

10%

20%

30%

40%

50%

60%

70%

8K Fixed 8K Rabin

Pe

rce

nta

ge o

f d

iffe

ren

ce v

s.w

ho

le f

ile +

sp

arse

Page 31: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Last plea to read the whole paper

• ~4x more results in paper!

• Real world filesystem analysis is hard

– Eight machines months in query processing

– Requires careful simplifying assumptions

– Requires heavy optimization

Page 32: Dutch T. Meyer William Bolosky - USENIX · 2019. 2. 25. · A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft Research Intern William Bolosky

Conclusion

• The benefit of fine grained dedup is < 20%

– Potentially just a fraction of that.

• Fragmentation is a manageable problem

• Read the paper for more metadata results

We’re releasing this dataset