moat: a multi-object assignment toolkit

MOAT: MOAT: A Multi-Object Assignment ToolkitA Multi-Object Assignment Toolkit

Haifeng YuIntel Research Pittsburgh / CMU

Joint work with:

Phillip B. Gibbons

Intel Research Pittsburgh

Haifeng Yu, Intel Research Pittsburgh / CMU

2

BackgroundBackground Availability has become principle design goal:

0.1% improvement $2M / year

for Amazon and Ebay [internetweek.com]

One major focus of 8 OSDI’04 papers (out of 27)

Two orthogonal efforts: Lower-level system components robustness

Example: disk, individual machine, Internet routing

Higher-level redundancy

Example: data replication

This talk focuses on higher-level redundancy


3

High Availability via ReplicationHigh Availability via Replication

Large amount of data accessed by many users: Distributed file systems

Network monitoring (PIER, SDIMS, IRISLOG)

Index databases for search engine (Google, p2p)

Scientific / medical databases

Data replicated across multiple machines Object: The unit for replication

File, file block, database table, database tuple, inverted index for a certain keyword


4

Multi-object AccessesMulti-object Accesses

Many accesses request multiple objects Compile a project

Writing a paper under Latex

Asking for aggregates of network conditions

Search for web pages containing multiple keywords

Availability of single object can be misleading: An access requesting 1,000 objects can observe up to

1,000 times higher unavailability

There’s more subtlety.....


5

A Simple ExampleA Simple Example

Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D

Four machines fail independently with same prob, each holds two file

Which assignment gives better avail:

A B C D

A B C D

orA B C D

A C B D

Better

Assignment matters because objects are now correlated


6

A Simple Example - ContinuedA Simple Example - Continued

Suppose user is happy even if only three objects are available (e.g., when computing average)

A B C D

A B C D

orA B C D

A C B D

Better

Assignment makes a difference Even if we are using the same machines (same

amount of redundancy/resource)

Easily have multiple-nine difference


7

Goal and ContributionsGoal and Contributions

MOAT (Multi-Object Assignment Toolkit): Goal: High availability for multi-object accesses

Key issue: Replica assignment

Contributions: First to observe the importance of replica assignment

Strong theoretical results regarding best and worst assignments

Practical designs to approximate optimal assignments

MOAT toolkit implementation for replica assignments


8

OutlineOutline

Motivation and MOAT contributions

System model and case studies of existing systems

Theoretical results

Designs for approximating optimal assignments

Designs for mixed accesses

Conclusions


9

Assumptions for This TalkAssumptions for This Talk

Assume: Replication (no erasure coding)

Crash failures (no Byzantine failures)

Eventual consistency (no quorum or voting)

Most of our results hold without these assumptions

Assume same replication degree for all objects We have results for different replication degrees as

well

Talk to me if interested in the more complete story...


10

MOAT Architecture OverviewMOAT Architecture Overview

MOAT

raw data on distributed

machines or disks

file

system

network

monitoring

p2p

DB

search

engine

Storage

System

App

replication / repair / load balancing / naming / assignment

Data API

obj create / delete / read / write

Control API

assignment policy


11

System ModelSystem Model

Basic system model: N objects, each with k replicas

Load balancing among all machines

Machines fail independently with same prob

An assignment is a mapping: replica machine, for all N k replicas

A B C D

A B C D


12

Some Simple AssignmentsSome Simple Assignments

PTN: partition assignment Used in most practice of Coda [Satyanarayanan et al.’90]

A B C D E F

A B C D E F

for k = 2

...........

...........

RAND: pick a random replica each time Similar as in Google File System [Ghemawat et al.’03]


13

Assignment in Chord [Stoica et al.’01]Assignment in Chord [Stoica et al.’01]

DHTs: Hash machine

IP to get machine id

Assignment in Chord: Sliding window

Neither PTN nor RAND

101

120

104

098

090

080

AA

C

C

B

hash(A) = 95

C

B

B


14

Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]

Hash object k times CAN uses a

similar approach

Similar as RAND But machines

may have slightly different number of objects

101

120

104

098

090

080

A

hash1(A) = 95


15


101

120

104

098

090

080

A

A

hash2(A) = 119


similar approach




16


101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B


similar approach




17

Which assignment should we use?Which assignment should we use?

MOAT Goal: Improve avail of multi-object accesses If an access requests n (n N) objects, what if only x

are available?

Threshold-based success definition: If x ≥ t, user happy Available

If x < t, too low confidence Unavailable

Availability for an access defined as: Prob[ t objects available out of n requested objects]


18

Examples of tExamples of t

t = n File systems

Search for terrorist images in image database

t close n Query for top-10 most-loaded machines on PlanetLab

t not close n Sample with confidence


19

OutlineOutline



Theoretical results



Conclusions


20

Formal ResultsFormal Results For access requesting N objects

Theorem: Among all assignments, when t = N: PTN is best (within constant)

RAND is worst (within constant)

Difference is about c folds (c is #obj / machine)

Theorem: Among all assignments, when t = c+1 < N: PTN is worst

RAND is best (within constant)

Difference is even larger


21

Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN

Chord

c times difference

if p is small, where c is # obj/machine

unavail of single obj


22

A Spectrum of AssignmentsA Spectrum of Assignments40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN


23

More Formal ArgumentsMore Formal Arguments

Tradeoff is fundamental: Impossible to achieve the best of RAND and PTN

Previous results only for access requesting N objects Similar results hold for accesses requesting n (n N)

objects

But each machine may not be filled to capacity:

For PTN, use as few machines as possible

For RAND, use as many machines as possible

I have more....talk to me if you are interested


24

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN

Chord

Access Requesting 500 ObjectsAccess Requesting 500 Objects


25

OutlineOutline



Theoretical results



Conclusions


26

Design of Replica AssignmentDesign of Replica Assignment

Trivial in a static / centralized environment

Challenging in dynamic environment: We may not have global knowledge with many objects

and many machines

Basic solution: Consistent hashing But some re-design is necessary


27

Approximating RANDApproximating RAND

Multi-hash DHT: Hash the object k

times

As in CAN

101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B


28

Approximating PTNApproximating PTN

Chord does not achieve PTN

101

120

104

098

090

080

A

B C

A B

C

B

hash(A) = 95

C


29

Approximating PTNApproximating PTN

Chord does not achieve PTN

Group DHT: (Arbitrarily) group

machine into groups of k size

120120

A B

C

B

hash(A) = 95

C

101101

090090

A B

C


30

Node Join and Leave in Group DHTNode Join and Leave in Group DHT

Maintain r rondevour points in DHT Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]

New node reports to a random rondevour point If group can be formed, join DHT

Two options upon node leave: Dismiss group and delete the group from DHT

The group wait to recruit a new node

Groups use rondevour point to decide


31

Complexity AnalysisComplexity AnalysisMetric Standard DHT Group DHT

Routing state log N log N/k

Routing hops log N log N/k

Messages / Join (log N)^2 log N/k + (log N/k)^2 / k

Messages / Leave (Log N)^2 log N/k + (log N/k)^2 / k

Obj moves / Join k/N k/N

Obj moves / Leave k/N 2k/N


32

OutlineOutline



Theoretical results



Conclusions


33

Mixture of QueriesMixture of Queries

Previous design only for single access requesting all N objects PTN if t close to N

RAND if t far from N

But there are other accesses Requests n (n < N) objects with threshold t

How does t change with n ? Infinite possibilities

We focus on 4 large categories


34

Four Application ScenariosFour Application Scenarios

Scenario small accesses (small n) large accesses (large n)

File system strict strict

Computing aggregates

loose loose

Network monitoring

strict

(pinpoint problems)

loose

(overview query)

Image database search

loose

(resource retrieval of frequent objects -- E.g.,

find clip art for slide)

strict

(non-existence test -- E.g., exhaustive search

of terrorist)

Strict accesses: t n Loose accesses: t < n


35

LooseLoose for both small and largefor both small and large n n

Goal: Approach RAND

for both small and large n

Design: Multi-hash DHT

101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B


36

LooseLoose for small for small nn; ; StrictStrict for for largelarge n n

Goal: Approach RAND

for small n

Approach PTN for large n

Design: Group DHT

120120

A B

C

A

C

101101

090090

A B

C


37

StrictStrict for both small andfor both small and largelarge n n Goal:

Approach PTN for both small and large n

Assume accesses are tree accesses

Design: Group DHT with

item-balancing [Karger et al.’04]

120120

A B

C

B

A = 95

101101

090090

A B

C


38

StrictStrict for small n; for small n; LooseLoose for for largelarge n n

Goal: Approaches PTN

for n < R

Approaches RAND for n >> R

Design: Multi-hash DHT

But cluster objects into clusters of constant size R

101

120

104

098

090

080

A

A

hash1(AB) = 84

hash2(AB) = 100

B

B


39

Simulation Results for Strict Accesses Simulation Results for Strict Accesses

number (n) of objects requested by an access

una

vaila

bili

ty

Here an access needs all n objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj


40

Simulation Results for Loose AccessesSimulation Results for Loose Accesses

number (n) of objects requested by an access

una

vaila

bili

ty

Here an access needs only t = n - 150 objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj


41

Current StatusCurrent Status

Waiting for paper deadlines

Finishing implementing MOAT

Evaluation on IrisLog trace and file system traces


42

Related WorkRelated Work Multi-object accesses rarely addressed

CFS [Dabek et al.’01] focuses on individual file blocks

Chain replication [Renesse et al.’04] considers single data object

A long list ..... Replica assignment largely ignored

Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied

Replica placement [Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:

Typically for machines in different locations in the network

Machines are heterogeneous

Approaches does not apply to replica assignment


43

ConclusionsConclusions

Availability becoming key design goal Multi-object access availability dramatically different

from single-object availability

MOAT Contributions: First to observe the importance of replica assignment

Strong theoretical results regarding the best and worst assignments

Practical designs to approximate optimal assignments

MOAT toolkit implementation


44

My Other Recent WorkMy Other Recent Work

Om [NSDI’04]: Consistent and automatic replica regeneration

Regenerate from any single replica rather than a majority

Signed quorum systems [PODC’04]: Constant quorum size at the cost of small prob of

inconsistency

Node failure characteristics in WAN [WORLDS’04]: Answer subtle questions regarding real-world failure

properties


45


46

Erasure CodingErasure Coding

Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object

RAID techniques are special cases

Replication is a special case where m = 1


47

Example RevisitedExample Revisited

Need four files to compile:

A B C D

A B C D

orA B C D

A C B D

Better

Erasure coding is hard to be applied across large amount of data Updating any portion of data needs to update k - m + 1

fragments the size of original data

We cannot use erasure coding across 1,000 files

Can we treat A, B, C, D as a single obj and use erasure coding?

So that all files can be reconstructed from any 4 out of 8 fragments


48

Threshold Semantics and Erasure CodingThreshold Semantics and Erasure Coding

Threshold Semantics Erasure Coding

need t out of n objects to answer query

need m out of k fragments to reconstruct object

t determined by app semantics m determined at coding time

result dependent on which t objects

same result regardless of which m fragments

may update single object by itself modification to any portion of the object needs to update k-m+1 fragments

In short, they are different, orthogonal concepts


49

Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

CRAND (10)

CRAND (100)

PTN

Chord

c times difference

if p is small, where c is # obj/machine

moat: a multi-object assignment toolkit

Documents

correlatedhaifeng yu

differencehaifeng yu

replica assignmentshaifeng

multiple machines object

certain keywordhaifeng

multiobject accesseskey

multiple objectscompile

filewhich assignment