alacc: accelerating restore performance of data

40
ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching Zhichao Cao, Hao Wen, Fenggang Wu and David H.C. Du University of Minnesota, Twin Cities 02/15/2018

Upload: others

Post on 05-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

ALACC: Accelerating Restore Performance of

Data Deduplication Systems Using

Adaptive Look-Ahead Window Assisted Chunk Caching

Zhichao Cao, Hao Wen, Fenggang Wu and David H.C. Du

University of Minnesota, Twin Cities

02/15/2018

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

Deduplication Process [1]

2 5 10 1422 18

Recipe

23 822

2223 822

2119 2018

2322

Container Buffer

133 102

Container Storage

102 518

259 178

5 1414

Indexing Table

……

Byte Stream

Not Found

14

1. Chunk ID

2. Chunk Size

3. Container Address

4. Offset in the container

5. Other meta information

Sliding Window

Recipe

Entry

[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.

Center for Research in Intelligent Storage

Deduplication Process [1]

2 5 10 1422 18

Recipe

23 822

2223 822

2119 2018

2322

Container Buffer

133 102

Container Storage

102 518

259 178

5

5

1414

Indexing Table

5

……

Byte Stream

Exits

Sliding Window

[1] Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System[C]//Fast. 2008, 8: 1-14.

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

• Due to the serious data fragmentation and size mismatching of requested

data and I/O unite, the restore performance is much lower than that of

directly reading out the data which is not deduplicated.

• CPU and memory resources are limited.

Why Improving Restore Performance is Important?

222 1823 822

……

Container Storage

22

23

8 18

2

?

Center for Research in Intelligent Storage

Restore Process with Container-based Caching

2 5 10 14 18 13 22 3 ……22 18

Recipe

23 822 28 23 12 13 32 23 28 6

Container Cache

2119 2018 133 102

259 178 1423 522

Assembling Buffer

102 518133 102

1423 522

142223 822

Restored Data Storage

133 52

……

1314 512

……

2119 201818

Container Storage

Restore Direction

135……

Center for Research in Intelligent Storage

Assembling Buffer

Restore Process with Chunk-based Caching

2 5 10 14 18 13 22 3

……

22 18

Recipe

23 822 28 23 12 13 32 23 28 6

Chunk Cache

2119 2018 133 102

259 178 1423 522

102 518142223 822

Restored Data Storage

133 52 1314 512

……

Container Read Buffer223 182

2313 514133 102

10

18

Container Storage

……

Center for Research in Intelligent Storage

Container-based Caching vs. Chunk-based Caching

Less operating and management

overhead

Relatively higher cache miss ratio,

especially when the caching space is

limited.

Container-based Caching Chunk-based Caching

1. Higher cache hit ratio

2. Even much higher if look-ahead

window is applied

Higher operating and management

overhead

Center for Research in Intelligent Storage

Container-based Caching vs. Chunk-based Caching

0

50

100

150

200

250

300

Con

tain

er-

rea

ds

per 1

00M

B

Rest

ored

Total Cache Size

Container_LRU

Chunk_LRU

00.5

11.5

22.5

33.5

44.5

5

Co

mp

uti

ng

Tim

e

(seco

nd

s/G

B)

Total Cache Size

Container_LRU

Chunk_LRU

Center for Research in Intelligent Storage

2119 2018 133 102

1423 522259 178

Forward Assembly Scheme [1]

8 5 10 14 18 9 22 3

……

22 18

Recipe

23 822 28 23 12 13

2223 822

Container Read buffer

Container Storage

2119 2018

259 178

Forward Assembling Area (FAA)

1814

Look-Ahead Window

……58 2218 18

Restored Data

Storage

822 232

……

1423 522

23 822

[1] Lillibridge M, Eshghi K, Bhagwat D. Improving

restore speed for backup systems that use inline chunk-

based deduplication[C]//FAST. 2013: 183-198.

9

Center for Research in Intelligent Storage

Chunk-based Caching vs. Forward Assembly

1. Highly efficient, when chunks from

the same container are used most in

the FAA range

2. Low operating and management

overhead

Workload sensitive, requires good

workload locality

Chunk-based Caching

Higher operating and management

overhead

Forward Assembly

1. When chunks are re-used in a

relatively long distance (larger than

FAA), caching is more effective

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

• Objective:

– Forward assembly + chunk-based caching + LAW (limited memory space)

• Challenges

– When the total size of available memory for restore is limited and fixed, how to use these

schemes in an efficient way, is unclear.

– How to make better trade-offs to achieve fewer container-reads, but limit the computing

overhead including the LAW, caching and forward assembly overhead.

– How to make the design adapt to the changing workload is very challenging.

Objective and Challenges

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-Ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

Look-ahead Window Assisted Chunk Cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window (LAW) Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

RestoredInformation for Chunk Cache

2 518

FAA

FAB1 FAB2

Chunk CacheContainer Read Buffer

2119 2018

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

252 188

1423 522

101712 1310

10

What’s the caching policy?

1712 1310

Center for Research in Intelligent Storage

Chunks in the Read-in Container10 P-chunk: Probably used chunk

17 U-chunk: Unused chunk

12 13 F-chunk: Future used chunk

Chunk Cache

F-cache P-cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

RestoredInformation for Chunk Cache

2 518

FAA

FAB1 FAB2

Container Read Buffer

2119 2018 1712 1310

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

101712 1310

10 22

18

32

28

21

25

2

5

Center for Research in Intelligent Storage

Caching Priority of F-cache

Chunk Cache

F-cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

RestoredInformation for Chunk Cache

2 518

FAA

FAB1 FAB2

Container Read Buffer

2119 2018 1712 1310

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

101712 1310

10 22

18

32

28

12

13

F-chunks being used in the near

future have higher priority than F-

chunks being used in the far future

High priority end

Low priority end

Center for Research in Intelligent Storage

Caching Priority of P-cache

Chunk Cache

P-cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

RestoredInformation for Chunk Cache

2 518

FAA

FAB1 FAB2

Container Read Buffer

2119 2018 1712 1310

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

101712 1310

10 5

2

8

21

F-cache

……

P-cache is LRU based

caching policy

10

21

High priority end

Low priority end

Center for Research in Intelligent Storage

• What’s the memory space ratio between forward assembly and chunk cache?

• What’s the size of LAW?– Too large: the computing overhead large but the extra information in the LAW is wasting

– Too small: it becomes forward assembly+LRU cache

• What if the workload locality changes?

Good Enough?

Center for Research in Intelligent Storage

Chunk Cache

2 5 10 14 5 2 18 14

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 5 10 13 32 23 28 6

Restored

2 518

FAA

FAB2 FAB3

Container Read Buffer

2119 2018 1712 1310

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

10 5

……FAB1

If most of the chunks (e.g., duplicated

chunks) from the same container can be

covered by FAA, a large FAA size is

preferred

2 18 18 5 10

Large FAA Size (Small Cache)

Cache Range

Center for Research in Intelligent Storage

Chunk Cache

19 20 21 10 21 3 7 21

Look-Ahead Window Covered Range

12 18

Unknown

2 822 18 28 16 22 2 8 2 19

FAA

FAB3

Container Read Buffer

2119 2018 1712 1310

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

If most the chunks reuse distance is

relatively larger, caching chunks is

preferred.

22 2 8

22

Information for Chunk Cache

22 2 8 2

……

……

Large Chunk Cache Size (Small FAA)

FAA Range

Center for Research in Intelligent Storage

LAW Size

……

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

Restored

2 5 10 14 10 15 7 22

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

Restored

14 10 15 7

With larger LAW, more F-chunks can be potentially identified and cached. However,

once the cache is full of F-chunks, further increasing the LAW size cannot improve

the cache efficiency. In contrast, it will bring in more unnecessary overhead.

With smaller LAW, lower computing overhead can be achieved, but cache will store

more P-chunks, which potentially reduce the caching efficiency.

Center for Research in Intelligent Storage

The LAW Size Influence

Center for Research in Intelligent Storage

ALACC

Increase FAA

The re-use distances of most duplicated

data chunks are within the FAA range

The data chunks in the first FAB are

identified mostly as unique data chunks

and these chunks are stored in the same

or close containers

Shrink LAW

P-chunk number is very small

F-chunk added during the restore

cycle is very large

Adjust LAW

Shrink Cache

Not satisfied

Enlarge Cache Shrink FAA

Not satisfied

OR

OR Adjust LAW

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

• Five Caching Designs:

– LRU-based container caching (Container_LRU)

– LRU-based chunk caching (Chunk_LRU)

– Forward assembly (FAA)

– Optimal configuration with fixed forward assembly and chunk-based caching (Fix_Opt)

– ALACC

• Four Traces:

– 2 FSL traces from FSL /home directory snapshots of the year 2014 [1]

– 2 EMC weekly full-backup traces [2]

Experiment Setup

[1] http://tracer.filesystems.org/.

[2] Nohhyun Park and David J Lilja. Characteriz- ing datasets for data deduplication in backup ap- plications. In Workload

Characterization (IISWC), 2010 IEEE International Symposium on, pages 1– 10. IEEE, 2010.

Center for Research in Intelligent Storage

How We Get the Fix_OptThe Fix_Opt configuration of each trace (FAA/chunk

cache/LAW size in container unite)

24

6

8

10

12

14

12

13

14

15

16

17

18

1624

3240

4856

6472

8088

96

FA

AS

ize

Res

tore

Th

rou

gh

pu

t(M

B/S

)

LAW Size

12-13 13-14 14-15 15-16 16-17 17-18

We run all possible configurations

for each trace to discover the optimal

throughput. Notice that we need tens

of experiments to find out the

optimal configurations of Fix_Opt

which is almost impossible to carry

out in a real-world production

scenario.

Chunk cache size = total memory size – FAA size

Center for Research in Intelligent Storage

Restore Throughput

FSL_1 FSL_2

0

5

10

15

20

25

0 1 2 3 4 5

Res

tore

Th

rou

gh

pu

t(M

B/S

)

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

0

10

20

30

40

50

60

70

0 1 2 3 4 5

Res

tore

Th

rou

gh

pu

t(M

B/S

)

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

3. CFL stands for the Chunk Fragmentation

Level

Restore Throughput (MB/S):

original data stream size divided by the

total restore time.

(4/12/56)(6/10/72)

>100%

>10%<15%

Center for Research in Intelligent Storage

Speed Factor

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5

Sp

eed

Fa

ctor

Version Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

0

1

2

3

4

5

0 1 2 3 4 5

Sp

eed

Fa

cto

rVersion Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

FSL_1 FSL_2

1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

3. CFL stands for the Chunk Fragmentation

Level

Speed Factor (MB/container-read):

the mean data size restored per

container read

(4/12/56) (6/10/72)

Center for Research in Intelligent Storage

Computing Cost

Factor

FSL_1 FSL_2

0

4

8

12

16

20

0 1 2 3 4 5

Co

mp

uti

ng

C

ost

Fact

or

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

0

2

4

6

8

10

12

0 1 2 3 4 5

Co

mp

uti

ng

Co

stF

act

or

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

1. ACS stands for Average Chunk Size

2.DR stands for the Deduplication Ratio.

3. CFL stands for the Chunk Fragmentation

Level

Computing Cost Factor (second/GB):

the time spent on computing operations

(subtracting the storage I/O time from

the restore time) per GB data restored

(4/12/56) (6/10/72)

Center for Research in Intelligent Storage

• Deduplication Process

• Restore Process with Different Caching Schemes

– Container/chunk based caching

– Forward Assembly

• Objective and Challenges

• Proposed Approach

– Look-ahead window assisted chunk based caching (all fixed)

– Adaptive Look-ahead Chunk-based Caching (ALACC)

• Evaluations

• Conclusions and Future Work

Agenda

Center for Research in Intelligent Storage

• Studied the effectiveness and the efficiency of different caching mechanisms.

• Designed an adaptive algorithm called ALACC which is able to adaptively

adjust the sizes of the FAA, chunk cache and LAW according to the

workload changing.

• In our future work, duplicated data chunk rewriting, multi-threading

implementation will be investigated and integrated with ALACC to further

improve the restore performance.

Conclusions and Future Work

Center for Research in Intelligent Storage

Center for Research in Intelligent Storage

Look-ahead Window Assisted Chunk Cache

2 5 10 14 10 15 7 22

……

Look-Ahead Window Covered Range

22 18

UnknownFAA Covered Range

2 822 18 23 12 13 32 23 28 6

RestoredInformation for Chunk Cache

2 518

FAA

FAB1 FAB2

Chunk CacheContainer Read Buffer

2119 2018

259 28 1423 522

Container Storage

……

Restored Data

Storage

222 822

……

252 188

1423 522

101712 1310

10

1712 1310

Center for Research in Intelligent Storage

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5

Sp

eed

Fact

or

Version Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

0

1

2

3

4

5

0 1 2 3 4 5

Sp

eed

Fa

ctor

Version Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

EMC_1 EMC_2

Center for Research in Intelligent Storage

EMC_1 EMC_2

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5

Co

mp

uti

ng

Co

stF

act

or

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

0

0.4

0.8

1.2

1.6

2

0 1 2 3 4 5

Co

mp

uti

ng

Co

stF

act

or

Version Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

Center for Research in Intelligent Storage

EMC_1 EMC_2

0

10

20

30

40

50

60

0 1 2 3 4 5

Res

tore

Th

rou

gh

pu

t(M

B/S

)

Version Number

Container_LRU Chunk_LRU

FAA Fix_Opt

ALACC

0

15

30

45

60

75

90

105

0 1 2 3 4 5

Res

tore

Th

rou

gh

pu

t(M

B/S

)

Version Number

Container_LRU Chunk_LRUFAA Fix_OptALACC

Center for Research in Intelligent Storage