free recovery: a step towards self-managing state andy huang and armando fox stanford university
DESCRIPTION
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang Two state management challenges Failure handling Consistency requirements Consistency requirements ð Node recovery costly ð Reliable failure detection Relax internal consistency Relax internal consistency ð Fast, non-intrusive recovery (“free”) System evolution Large data sets Large data sets ð Repartitioning is costly ð Good resources provisioning Free recovery Free recovery ð Automatic, online repartitioning an easy-to-manage cluster-based persistent hash table for Internet services DStoreTRANSCRIPT
Free Recovery: Free Recovery: A Step Towards Self-Managing A Step Towards Self-Managing StateStateAndy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Persistent hash tablesPersistent hash tables
FrontendsApp Servers
DB
LAN
LAN
KeyKey ValueValueYahoo! user IDYahoo! user ID User profileUser profileISBNISBN Amazon Amazon
catalog catalog metadatametadata
Hash table
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Two state management challengesTwo state management challenges
Failure handlingFailure handling• Consistency requirements Consistency requirements Node recovery costlyNode recovery costly Reliable failure detectionReliable failure detection
• Relax internal consistencyRelax internal consistency Fast, non-intrusive Fast, non-intrusive
recovery (“free”)recovery (“free”)
System evolutionSystem evolution• Large data setsLarge data sets Repartitioning is costlyRepartitioning is costly Good resources provisioningGood resources provisioning
• Free recoveryFree recovery Automatic, online Automatic, online
repartitioningrepartitioning
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
DStore architectureDStore architecture
Dlib
LAN
Brickapp server
Dlib: exposes hash table API and is the “coordinator” for distributed operations
Brick: stores data by writing synchronously to disk
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Focusing on recoveryFocusing on recoveryTechnique 1: QuorumsTechnique 1: Quorums
Tolerant to brick inconsistencyTolerant to brick inconsistency
Technique 2: Single-phase Technique 2: Single-phase writeswrites
No request relies on specific No request relies on specific bricksbricks
Simple, non-intrusive recoverySimple, non-intrusive recovery
2PC: failure between phases complicates protocol
• 2nd phase depends on particular set of bricks
• Relies on reliable failure detection
Single-phase quorum writes: can be completed by any majority of bricks
Any brick can fail at any timeAny brick can fail at any time
Write: send to all, wait for majority
Read: read from majority
OK if some bricks’ data differs
Failure = missing some writes
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Considering consistencyConsidering consistency
Dl1 B1 B2 B3
x = 0
Dl2
0
read
read
1
Dlib failure can cause a partial write, violating the quorum property
If timestamps differ, read-repair restores majority invariant
Delayed commit
write(1)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Considering consistencyConsidering consistency
B1 B2 B3
x = 0
Dl1 Dl2
1
read
write
write(1)
A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual An individual
client’s view of client’s view of DStore is DStore is consistent with consistent with that of a single that of a single centralized server centralized server (Bayou)(Bayou)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Free recoveryBenchmark: Free recovery
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
1K
2K
3K
4K
GET
req/
sec
0K
1K
2K
3K
4K
GET
req/
sec
0
100
200
300
400
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
100
200
300
400
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
100
200Repairs/sec
0
100
200Repairs/sec
0K
2K
4K
6K
8K
GET
req/
sec
0K
2K
4K
6K
8K
GET
req/
sec
Worst-case behavior(100% cache hit rate)
Expected behavior(85% cache hit rate)
Recovery: fast and non-intrusiveRecovery: fast and non-intrusiveBrick
ki
lled
Reco
very
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Automatic failure detectionBenchmark: Automatic failure detection
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50Repairs/sec
0
50Repairs/sec
0K
4K
8K
GET
req/
sec
0K
4K
8K
GET
req/
sec
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50Repairs/sec
0
50Repairs/sec
0K
4K
8K
GET
req/
sec
0K
4K
8K
GET
req/
sec
Modest policy(anomaly threshold = 8)
Aggressive policy(anomaly threshold = 5)
False positives: low costFalse positives: low costFail-stutter: detected by PinpointFail-stutter: detected by PinpointFail-
stut
ter
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Online repartitioningOnline repartitioning
1.1. Take brick offlineTake brick offline
2.2. Copy data to new brickCopy data to new brick
3.3. Bring both bricks onlineBring both bricks online
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1 1
0 1 0 1 0 1
Appears as if brick just failed and recoveredAppears as if brick just failed and recovered
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Benchmark: Automatic online Benchmark: Automatic online repartitioningrepartitioning
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
100
200
300
0 10 20 30 40 50 60 70 80
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0K
2K
4K
6K
8K
GET
req/
sec
# bricks6 9 12
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
2.5K
5K
GET
req/
sec
# bricks3 4 5 6
0K
2.5K
5K
GET
req/
sec
# bricks3 4 5 6
Evenly-distributed load(3 to 6 bricks)
Hotspot in 01 partition(6 to 12 bricks)
Brick selection: effectiveBrick selection: effectiveRepartitioning: non-intrusiveRepartitioning: non-intrusive
Naive
Naive
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
Perform online checkpointsPerform online checkpoints Take checkpointing brick Take checkpointing brick
offlineoffline Just like failure+recoveryJust like failure+recovery
See if free recovery can See if free recovery can simplify online data simplify online data reconstruction after hard reconstruction after hard failuresfailures
Any other state Any other state management challenges you management challenges you can think of?can think of?
Next up for free recoveryNext up for free recovery
0
50
100
150
200
0 1 2 3 4 5 6 7 8 9 10
PUT
req/se
c
Time (minutes)
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
SummarySummary
Free recoveryFree recovery
DStore = DStore = DecoupledDecoupled Storage Storage
Managed like a stateless Web farmManaged like a stateless Web farm
Quorums [spacial decoupling]Cost: extra overprovisioning
Gain: fast, non-intrusive recovery
Single-phase ops [temporal decoupling]Cost: temporarily violates “majority” invariantGain: any brick can fail at any time
Failure handling fast, non-intrusive Mechanism: simple reboot
Policy: aggressively reboot anomalous bricks
System evolution “plug-and-play”
Mechanism: automatic, online repartitioning Policy: dynamically add and remove nodes based on predicted load
Jan 2004 ROC Retreat - Lake Tahoe, CA © 2004 Andy Huang
an easy-to-manage an easy-to-manage cluster-based persistent hash cluster-based persistent hash
table table for Internet servicesfor Internet services
DStorDStoree
[email protected]@stanford.edu