chunking algorithm fast simd-based

42
Fast SIMD-Based Chunking Algorithm PSC 2019-Aug-27 Johnny Dude Fast SIMD-Based Fast SIMD-Based Chunking Algorithm Chunking Algorithm Yehonatan Dude, Michael Hirsch, Yair Toaff PSC 2019 PSC 2019

Upload: others

Post on 11-Dec-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-BasedFast SIMD-BasedChunking AlgorithmChunking Algorithm

Yehonatan Dude, Michael Hirsch, Yair Toaff

PSC 2019PSC 2019

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

OutlineOutline

1. Background2. Chunking Problem3. Traditional Solutions4. Our Solution5. Future Work

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - DeduplicationBackground - Deduplication

Deduplication is a technique for eliminatingDeduplication is a technique for eliminatingduplicate copies of repeating data.duplicate copies of repeating data.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Deduplication process in a nutshellDeduplication process in a nutshell1. Divide into chunks2. Calculate the chunks' hashes3. Store chunks uniquely

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Object 1

Object 2

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

A BC

D BC

D DObject 1

Object 2

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

A BC

D BC

D DObject 1

Object 2

Object 1Object 2

AB

CD

hA

hB

hC

hD

hA hBhC hDhD

hBhD hC

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - Chunking MethodsBackground - Chunking Methods

How to chunk the input dataHow to chunk the input data1. Simple - fixed size.2. Content aware - files, objects, applications.3. Content sensitive - rolling hash.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed._ He was not, as was long believed

, a creature of the imagination_ of the artists, the superstition

of the managers, or a product o f the joy and impressionable bra

ins of the young ladies of the b allet, their mothers, the box-ke

epers, the cloak-room attendants or the concierge. Yes, he exist

ed in flesh and blood, although_ he assumed the complete appearan

ce of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed._ He was not, as was long believed

, a creature of the imagination_ of the artists, the superstition

of the managers, or a product o f the absurd and impressionable

brains of the young ladies of th e ballet, their mothers, the box

-keepers, the cloak-room attenda nts or the concierge. Yes, he ex

isted in flesh and blood, althou gh he assumed the complete appea

rance of a real phantom; that is to say, of a spectral shade.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed.

He was not, as was long believed, a creature of the imagination of the

artists, the superstition of the managers, or a product of the joy and

impressionable brains of the young ladies of the ballet, their

mothers, the box-keepers, the cloak-room attendants or the concierge.

Yes, he existed in flesh and blood, although he assumed the complete

appearance of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed.

He was not, as was long believed, a creature of the imagination of the

artists, the superstition of the managers, or a product of the absurd

and impressionable brains of the young ladies of the ballet, their

mothers, the box-keepers, the cloak-room attendants or the concierge.

Yes, he existed in flesh and blood, although he assumed the complete

appearance of a real phantom; that is to say, of a spectral shade.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

The Opera ghost really existed. He was not, as was long believed

, a creature of the imagination of the artists, the superstition

of the managers, or a product of the joy and impressionable bra

ins of the young ladies of the ballet, their mothers, the box-ke

epers, the cloak-room attendants or the concierge. Yes, he exist

ed in flesh and blood, although he assumed the complete appearan

ce of a real phantom; that is to say, of a spectral shade.

The Opera ghost really existed. He was not, as was long believed

, a creature of the imagination of the artists, the superstition

of the managers, or a product of the absurd and impressionable

brains of the young ladies of the ballet, their mothers, the box

-keepers, the cloak-room attendants or the concierge. Yes, he ex

isted in flesh and blood, although he assumed the complete appea

rance of a real phantom; that is to say, of a spectral shade.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Background - Deduplication PerformanceBackground - Deduplication Performance

In 2017 we worked on a deduplication engine,and we tried to improve its performance.

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Seconds per GB

LZ4

Karp-Rabin

SHA1

Other1 2

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Chunking ProblemChunking Problem

Given a stream of bytes, divide it into chunks for deduplication.1. Output identical chunks for identical data2. Good chunk size distribution.3. Good performance.4. Works for any input (photo, DB, text, random, etc...)

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Traditional SolutionsTraditional Solutions

Karp-RabinCyclic Polynomial

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

●●● Xk-2 Xk-1 Xk=Hash( )Hk

Criteria(If Hk ) holds then mark a boundary after Xk

Xk-63 Xk-62 Xk-61

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1

Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62 Xk-61 Xk-60 ●●● Xk-1 Xk Xk+1=Hash( )HK+1

Xk-63 Xk-62 Xk-61 ●●● Xk-2 Xk-1 Xk=Hash( )Hk

=RollHash( Xk-63 Xk+1Hk ), ,

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Karp-Rabin Cyclic Polynomial

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Proposed SolutionProposed Solution

How does it work:How does it work:1. Work with rolling vectors2. Calculate a hash of byte size3. Calculate the criteria, in a way that:

Number of calculations are constantunrelated to the vector sizeCan find a cutting point at a byte offset

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Xk-62

Xk-61

Xk-60

●●● Xk-10 Xk-6 Xk-2=Hash( )hk-2

Xk-63 Xk-59

Xk-58

●●● Xk-11 Xk-7 Xk-3=Hash( )hk-3

Xk-57

Xk-56

Xk-55

Xk-54

Xk-53

Xk-52

●●● Xk-9 Xk-5 Xk-1 )

●●● Xk-8 Xk-4 Xk )=Hash(hk

=Hash(hk-1

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

●●● Xk+2=Hash( )hk+2

Xk-59

Xk-58

●●● Xk+1=Hash( )hk+1

Xk-57

Xk-56

Xk-55

Xk-54

Xk-53

Xk-52

●●● Xk+3 )

●●● Xk+4 )=Hash(hk+4

=Hash(hk+3

=RollHash( ), ,

Xk-2

Xk-3

Xk-1

Xk

=RollHash( ), ,

=RollHash( ), ,

=RollHash( ), ,

hk-2

hk-3

hk

hk-1

Xk-62

Xk-61

Xk-60

Xk-63

Xk+2

Xk+1

Xk+3

Xk+4

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

= Criteria( Hk )ck

= Criteria( Hk+1 )ck+1

= Criteria( Hk+2 )ck+2

= Criteria( Hk+3 )ck+3

Pass

Fail

Pass

Pass

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

= Criteria( Hk )ck

= Criteria( Hk+1 )ck+1

= Criteria( Hk+2 )ck+2

= Criteria( Hk+3 )ck+3

0 1 0 0

ck ...ck+3

Pass

Fail

Pass

Pass

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Trailing Zeroes

Leading Zeroes

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

1 1 0 0

ck-4 ...ck-1

0 1 0 0

ck ...ck+3

0 0 0 1

ck+4 ...ck+7

Trailing Zeroes

Leading Zeroes

Boundary

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Measured ResultsMeasured Results

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Algorithm Random Data Corpus Data

Karp-Rabin 975 MB/s 927 MB/s

Cyclic-Polynomial 1675 MB/s 1676 MB/s

Ours 6715 MB/s 7136 MB/s

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

ChunkingAlg.

DedupPerf.

LZ4 SHA1 Other Chunking

Karp-Rabin 262 MB/s 63.9% 4.1% 3.8% 28.1%

Ours 345 MB/s 84.5% 5.4% 5.1% 4.7%

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Same DistributionFaster Chunking PerformanceFaster Overall Performance

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Future WorkFuture Work

A chunking algorithm that is* Past Presented Future

Backward Compatible N/A f(k, x) ≠ f(l, x) f(k, x) = f(l, x)

Work cn cn cn

Speed cn c(n log k) / k c(n log k) / k

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

Fast SIMD-Based Chunking AlgorithmPSC 2019-Aug-27 Johnny Dude

ThanksThankshttps://github.com/dudejohnny/PSC2019