jingwei ma, rebecca j. stones, yuxiang ma, jingui wang, junjie...
TRANSCRIPT
![Page 1: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/1.jpg)
Lazy Exact Deduplication
Jingwei Ma, Rebecca J. Stones, Yuxiang Ma,Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu
College of Computer and Control Engineering,Nankai University, China.
5 May 2016
![Page 2: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/2.jpg)
Lazy exact deduplication
![Page 3: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/3.jpg)
Lazy exact deduplication
Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang).
![Page 4: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/4.jpg)
Lazy exact deduplication
Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.
![Page 5: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/5.jpg)
Lazy exact deduplication
Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.
Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).
![Page 6: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/6.jpg)
Lazy exact deduplication
Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.
Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).
Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch.
![Page 7: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/7.jpg)
Lazy exact deduplication
Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.
Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).
Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch. (Lazy is exact.)
![Page 8: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/8.jpg)
Deduplication: What usually happens...
We have a large amount of data, with lots of duplicate data(e.g. weekly backups).
![Page 9: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/9.jpg)
Deduplication: What usually happens...
We have a large amount of data, with lots of duplicate data(e.g. weekly backups).
We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).
![Page 10: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/10.jpg)
Deduplication: What usually happens...
We have a large amount of data, with lots of duplicate data(e.g. weekly backups).
We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).
![Page 11: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/11.jpg)
Deduplication: What usually happens...
We have a large amount of data, with lots of duplicate data(e.g. weekly backups).
We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).
The data is broken up into chunks (Rabin Hash).
![Page 12: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/12.jpg)
Deduplication: What usually happens...
We have a large amount of data, with lots of duplicate data(e.g. weekly backups).
We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).
The data is broken up into chunks (Rabin Hash).
The chunks are fingerprinted (SHA1): same fingerprint =⇒duplicate chunk.
![Page 13: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/13.jpg)
Deduplication: What usually happens...
Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.
![Page 14: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/14.jpg)
Deduplication: What usually happens...
Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.
Caching and prefetching reduce the disk bottleneck problem:
![Page 15: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/15.jpg)
Deduplication: What usually happens...
Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.
Caching and prefetching reduce the disk bottleneck problem:fingerprints
cache
disk
cache miss
· · · fA fB fC fD · · ·
fA fB fC fD
The first time we see fingerprints fA, fB , ...
![Page 16: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/16.jpg)
Deduplication: What usually happens...
Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.
Caching and prefetching reduce the disk bottleneck problem:fingerprints
cache
disk
cache miss
· · · fA fB fC fD · · ·
fA fB fC fD
The first time we see fingerprints fA, fB , ...
fingerprints
cache
cache miss prefetching
disk
cache hit
· · · fA fB fC fD · · ·
fA fB fC fD
The second time we see fingerprints fA, fB , ...
![Page 17: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/17.jpg)
Lazy deduplication...
fingerprints
cache
disk
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 18: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/18.jpg)
Lazy deduplication...
Bloom filter: identifies manyuniques (not all). [Commonlyused.]
fingerprints
Bloomfilter
cache
disk
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 19: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/19.jpg)
Lazy deduplication...
Bloom filter: identifies manyuniques (not all). [Commonlyused.]
buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)
fingerprints
Bloomfilter
cache
disk
buffer
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 20: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/20.jpg)
Lazy deduplication...
Bloom filter: identifies manyuniques (not all). [Commonlyused.]
buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)
post-lookup: searching thecache after buffering (maybemultiple times)
fingerprints
Bloomfilter
cache
post-lookup
disk
buffer
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 21: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/21.jpg)
Lazy deduplication...
Bloom filter: identifies manyuniques (not all). [Commonlyused.]
buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)
post-lookup: searching thecache after buffering (maybemultiple times)
pre-lookup: searching thecache before buffering [notshown]
fingerprints
Bloomfilter
cache
post-lookup
disk
buffer
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 22: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/22.jpg)
Lazy deduplication...
Bloom filter: identifies manyuniques (not all). [Commonlyused.]
buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)
post-lookup: searching thecache after buffering (maybemultiple times)
pre-lookup: searching thecache before buffering [notshown]
prefetching: bidirectional;triggers post-lookup
fingerprints
Bloomfilter
cache
post-lookup
prefetching
disk
buffer
· · · fA fB fC fD · · ·
fC fA fB , fD
fA fB fC fD
![Page 23: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/23.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk
![Page 24: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/24.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints.
![Page 25: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/25.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).
![Page 26: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/26.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).
To overcome this obstacle, each buffered fingerprint is given a...
![Page 27: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/27.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).
To overcome this obstacle, each buffered fingerprint is given a...
rank, used to determine the on-disk search range;
![Page 28: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/28.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).
To overcome this obstacle, each buffered fingerprint is given a...
rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.
![Page 29: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/29.jpg)
Prefetching...
Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).
To overcome this obstacle, each buffered fingerprint is given a...
rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.
It looks like this:
rank r :
fingerprints
0 1 2 3 4 5 6 7 8
fingerprintsstored on disk
on-disklookup
· · ·
2048 fingerprints
r
incoming unique on-disk unique buffered / on-disk match
![Page 30: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/30.jpg)
Experimental results...
(See our paper for the details and further experiments.)
![Page 31: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/31.jpg)
Experimental results...
(See our paper for the details and further experiments.)
The time it takes to deduplicate a dataset (on SSD):
Vm (220GB) Src (343GB) FSLHomes (3.58TB)
eager way 282 sec. 476 sec. 5824 sec.
lazy way 151 sec. 226 sec. 3939 sec.
(eager = non-lazy [exact] way—i.e., no buffering before accessingthe disk)
Conclusion: Lazy is faster.
![Page 32: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/32.jpg)
On-disk lookups...
Disk access time (sec.) on SSD:
Vm Src FSLHomes
eager lazy eager lazy eager lazy
on-disk lookup 176 20 325 45 4598 1639
prefetching 46 60 52 68 298 655
other 59 71 99 113 928 1645
total disk access 222 80 377 113 4896 2294
total dedup. 282 151 476 226 5824 3939
Conclusion: Lazy reduces the disk bottleneck.
![Page 33: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/33.jpg)
Throughput...
656
397
0
100
200
300
400
500
600
700
800
20 70 120 170 220 270 320 370 420
thro
ughp
ut
(MB
/sec
.)
data size (GB)
Src on SSD
eager lazy
69
151
0
50
100
150
200
250
300
350
20 70 120 170 220 270 320 370 420
thro
ughp
ut
(MB
/sec
.)
data size (GB)
Src on HDD
eager lazy
Conclusion: Lazy has better throughput on both SSD and HDD,but moreso on slower HDD.
![Page 34: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca](https://reader034.vdocuments.site/reader034/viewer/2022052611/5f05cbb37e708231d414bf76/html5/thumbnails/34.jpg)