new hashing algorithms for data storage · hashing in distributed systems distributed storage if...
TRANSCRIPT
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
New Hashing Algorithms for Data Storage
Jason Resch Cleversafe
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Applications of Hashing
Hashing is useful generally: Provides O(1) lookup Key Value storage/retrieval
Could use hashing to decide… Storage node in storage system or database Proxy server that has a cache Task assignment in distributed computing
2
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Hashing in Distributed Systems
Distributed Storage If buckets are “storage nodes”, we can use
hashing so readers and writers select the same storage locations for the same names
Distributed Caching If buckets are “caching servers”, we can use
hashing to maximize reuse of the same caching servers for the same URLs
3
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Conventional Hash Table Resize
4
0:
1:
2:
3:
4:
5:
6:
7:
8:
16
374
602
993
700
618
612
975
825
68
580
461
233
353
bucket = hash(key) % num_buckets
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Conventional Hash Table Resize
5
0:
1:
2:
3:
4:
5:
6:
7:
8:
16
374
602
993
700
618
612
975
825
68
580
461
233
353
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
612
461
353 993
975
580
374
68
825
618
16
700
602
233
bucket = hash(key) % num_buckets
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Stable Hashing Defined
When a conventional Hash Table is resized, most keys are remapped to different buckets bucket = hash(key) % num_buckets Almost all keys move if num_bucket changes
Stable Hashing Enables Hash Tables with greater stability Minimizes disruption when resizing/scaling
6
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Stable Hashing Resize
7
0:
1:
2:
3:
4:
5:
6:
7:
8:
16
374
602
993
700
618 612
975
825
68
580
461
233
353
bucket = stable_hash(buckets, key)
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Stable Hashing Resize
8
0:
1:
2:
3:
4:
5:
6:
7:
8:
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
16
374
602
993
700
618 612
975
825
68
580
461
233
353
374
993 233
975
618
602
68
580 16
461
700
825
612 353
bucket = stable_hash(buckets, key)
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Who uses Stable Hashing?
Caching/Routing: DHT/Storage:
9
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
When is Stable Hashing Preferable?
When the system is stateful And recreating or transferring state is expensive
For in-memory Hash Tables remapping is cheap Requires copying a pointer in RAM
For Distributed Hash Tables remapping is costly Moving a key requires transfer over a network
10
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Stable Hashing with Global Namespaces
Last year we presented about unlimited scale: Main lesson: it requires eliminating points of
contention, including metadata systems We achieved this with a “Global Namespace”
Namespace is fixed, but system is dynamic… We needed an algorithm that could adapt to
changes in the system, and do so efficiently! 11
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Our motivations for using Stable Hashing
Helps balance: Storage (read/write) load across nodes Storage utilization across nodes
Minimizes disruption for: Addition of new nodes Resizing of existing nodes (disk addition) Removing or repurposing nodes Replacing obsolete nodes with new ones
12
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
But what algorithm to use…
Perfect Stable Hashing: Rendezvous Hashing (‘96) Consistent Hashing (‘97)
Weighted Stable Hashing: CARP (‘98) RUSH/CRUSH (‘04/’06)
13
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Classes of Stable Hashing Algorithms
14
Stable Hashing
Perfectly Stable Precisely Weighted
Consistent Hashing CARP
RUSH
CRUSH Rendezvous Hashing
???
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Consistent Hashing Works
15
0 - 605 606 - 999
Buckets inserted in random positions Keys map to the next node greater than that key
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Consistent Hashing Works
16
0 - 316 317 - 605 606 - 999
Buckets inserted in random positions Keys map to the next node greater than that key
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Consistent Hashing Works
17
0 - 316 317 - 605 606 - 782 783 - 999
Buckets inserted in random positions Keys map to the next node greater than that key
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Consistent Hashing Works
18
0 - 278 279 - 316 317 - 605 606 - 782 783 - 999
Buckets inserted in random positions Keys map to the next node greater than that key
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Rendezvous Hashing Works
Hash(Bucket ID || Key) Score Bucket with the highest score wins
19
0 1 2 3 4
H(“0” || ) = 759 612 H(“1” || ) = 481 612
H(“2” || ) = 830 612 H(“3” || ) = 879 612 H(“4” || ) = 484 612
612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Rendezvous Hashing Works
Hash(Bucket ID || Key) Score Bucket with the highest score wins
20
0 1 2 3 4
H(“0” || ) = 707 461 H(“1” || ) = 854 461
H(“2” || ) = 370 461 H(“3” || ) = 065 461 H(“4” || ) = 804 461
612 461
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How Rendezvous Hashing Works
Hash(Bucket ID || Key) Score Bucket with the highest score wins
21
0 1 2 3 4
H(“0” || ) = 746 353 H(“1” || ) = 207 353
H(“2” || ) = 515 353 H(“3” || ) = 668 353 H(“4” || ) = 252 353
353 461 612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How CARP Works
CARP is Rendezvous Hashing, with one change Scores are multiplied with a “Load Factor”
22
0 1 2 3 4
H(“0” || ) × 0.197 = 149.24 612 H(“1” || ) × 0.240 = 115.60 612
H(“2” || ) × 0.197 = 163.21 612 H(“3” || ) × 0.170 = 149.22 612 H(“4” || ) × 0.197 = 095.17 612
612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Why CARP isn’t Perfectly Stable
Load factors in CARP must be relatively scaled If any node’s weighting changes, or if any
node is added or removed, then all load factors must be recomputed
23
0 1
0.6000 0.4000
1.50:1
Load Factor: 200 100 Weight:
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Why CARP isn’t Perfectly Stable
Load factors in CARP must be relatively scaled If any node’s weighting changes, or if any
node is added or removed, then all load factors must be recomputed
24
0 1 2
0.3604 0.2792 0.3604
1.29:1
Load Factor: 200 100 Weight: 200
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How RUSH/CRUSH work
25
200 300
250
500
750
0 1
2
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of Stable Hashing
Perfect Stable Hashing: Rendezvous Hashing (‘96) Consistent Hashing (‘97)
Weighted Stable Hashing: CARP (‘98) RUSH/CRUSH (‘04/‘06)
Perfect Weighted Stable Hashing: Weighted Rendezvous Hash (‘14)
26
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Classes of Stable Hashing Algorithms
27
Stable Hashing
Perfectly Stable Precisely Weighted
Consistent Hashing CARP
RUSH
CRUSH Rendezvous Hashing
WRH
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
100 / -Log(H(“3” || ) / MAX_HASH) = 775.37 612
How Weighted Rendezvous Hashing Works
WRH adjusts scores before weighting them Unlike CARP, scores aren’t relatively scaled
28
0 1 2 3 4
200 / -Log(H(“0” || ) / MAX_HASH) = 725.29 612 400 / -Log(H(“1” || ) / MAX_HASH) = 546.43 612 200 / -Log(H(“2” || ) / MAX_HASH) = 1073.36 612
200 / -Log(H(“4” || ) / MAX_HASH) = 275.61 612
612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Why WRH is perfectly stable
When a node is added, removed, or changed: Only the scores for that node change
It may win some keys (if weight increased) It may lose some keys (if weight decreased)
For the unchanged nodes: Scores for all of them remain unchanged
No wasted data transfer occurs between nodes Minimum data moves to recover equilibrium
29
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
100 / -Log(H(“3” || ) / MAX_HASH) = 775.37 612
Weight Change with WRH
WRH adjusts scores before weighting them Unlike CARP, scores aren’t relatively scaled
30
0 1 2 3 4
200 / -Log(H(“0” || ) / MAX_HASH) = 725.29 612 400 / -Log(H(“1” || ) / MAX_HASH) = 546.43 612 200 / -Log(H(“2” || ) / MAX_HASH) = 1073.36 612
200 / -Log(H(“4” || ) / MAX_HASH) = 275.61 612
612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
100 / -Log(H(“3” || ) / MAX_HASH) = 775.37 612
Weight Change with WRH
WRH adjusts scores before weighting them Unlike CARP, scores aren’t relatively scaled
31
0 1 2 3 4
200 / -Log(H(“0” || ) / MAX_HASH) = 725.29 612 800 / -Log(H(“1” || ) / MAX_HASH) = 1092.86 612 200 / -Log(H(“2” || ) / MAX_HASH) = 1073.36 612
200 / -Log(H(“4” || ) / MAX_HASH) = 275.61 612
612 612
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Keys Transferred under CARP
32
0 1 2
200 100 Weight: 200
47.8% 35.7% 4.3%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Keys Transferred under WRH
33
0 1 2
200 100 Weight: 200
40% 40%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Simplicity of WRH
34
Where the magic happens
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Proof of Correctness
35
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
How we use the WRH
Our system is grown by sets of devices Each set is composed of devices spread
across fault domains (racks, sites, etc.) Devices have a lifecycle: Added, possibly expanded, then retired
The WRH selects which “device set” to write a given object to or read a given object from
36
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of a Storage Pool
9/22/2015
Storage Pool A Total Pool Size = 10 PB Device Set Size = 10 PB, Fraction = 10 PB/10 PB = 100%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of a Storage Pool
9/22/2015
Storage Pool A Total Pool Size = 20 PB Device Set Size = 10 PB, Fraction = 10 PB/20 PB = 50%
Device Set Size = 10 PB, Fraction = 10 PB/20 PB = 50%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of a Storage Pool
9/22/2015
Storage Pool A Total Pool Size = 40 PB Device Set Size = 10 PB, Fraction = 10 PB/40 PB = 25%
Device Set Size = 10 PB, Fraction = 10 PB/40 PB = 25%
Device Set Size = 20 PB, Fraction = 20 PB/20 PB = 50%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of a Storage Pool
9/22/2015
Storage Pool A Total Pool Size = 55 PB Device Set Size = 10 PB, Fraction = 10 PB/55 PB = 18%
Device Set Size = 10 PB, Fraction = 10 PB/55 PB = 18%
Device Set Size = 20 PB, Fraction = 20 PB/55 PB = 36%
Device Set Size = 15 PB, Fraction = 15 PB/55 PB = 27%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Evolution of a Storage Pool
9/22/2015
Storage Pool A Total Pool Size = 45 PB
Device Set Size = 10 PB, Fraction = 10 PB/45 PB = 22%
Device Set Size = 20 PB, Fraction = 20 PB/45 PB = 44%
Device Set Size = 15 PB, Fraction = 15 PB/45 PB = 33%
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Moving Data according to the WRH
9/22/2015
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Storage Resource Map
Shows relative capacities of device sets each of which is independently reliable storage
43
"storage_pool_map": { "657fe35a-a87a-44cf-b766-8e890aea7b2e": { "weight": "46000000000000000", "hash_seed": 67662243 }, "bfa3a243-c2f4-3a1c-afa9-cee4b56c1da1": { "weight": "22000000000000000" "hash_seed": 27781369 } }
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Efficient Replacement: reuse seed
The Hash Seed enables a clever trick: When retiring a device set with replacement,
we-use the same hash seed for the new set Since it seeds hashes in the same way, it
computes identical scores as the old set When the retired set’s weight is set to 0, all
keys move from the old set to the new one
44
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Other ways to use WRH
We see many potential applications: Performing work
Take on rebuilding tasks from a work queue Assign compute jobs according to CPU capacity
Route access requests to “Access nodes” Reduces contention, maximizes cache hits
Map data to drives within a storage node When a drive fails, remap data to other drives
45
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
Thank you! Any Questions?
46
2015 Storage Developer Conference. Copyright © 2015 Cleversafe, Inc. All Rights Reserved.
References
Rendezvous Hashing: https://en.wikipedia.org/wiki/Rendezvous_hashing
Consistent Hashing: https://en.wikipedia.org/wiki/Consistent_hashing
CARP: https://tools.ietf.org/html/draft-vinod-carp-v1-03
RUSH: http://www.ssrc.ucsc.edu/Papers/honicky-ipdps04.pdf
CRUSH: http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
CEPH: http://ceph.com/docs/master/rados/operations/crush-map/
OpenStack: http://docs.openstack.org/developer/swift/ring.html
GlusterFS: http://blog.gluster.org/2012/03/glusterfs-algorithms-distribution/
Cassandra: http://blog.imaginea.com/consistent-hashing-in-cassandra/
DynamoDB: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
47