moat: a multi-object assignment toolkit

49
MOAT: MOAT: A Multi-Object Assignment A Multi-Object Assignment Toolkit Toolkit Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh

Upload: eldon

Post on 02-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

MOAT: A Multi-Object Assignment Toolkit. Haifeng Yu Intel Research Pittsburgh / CMU Joint work with: Phillip B. Gibbons Intel Research Pittsburgh. Background. Availability has become principle design goal : 0.1% improvement  $2M / year for Amazon and Ebay [internetweek.com] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MOAT:  A Multi-Object Assignment Toolkit

MOAT: MOAT: A Multi-Object Assignment ToolkitA Multi-Object Assignment Toolkit

Haifeng YuIntel Research Pittsburgh / CMU

Joint work with:

Phillip B. Gibbons

Intel Research Pittsburgh

Page 2: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

2

BackgroundBackground Availability has become principle design goal:

0.1% improvement $2M / year

for Amazon and Ebay [internetweek.com]

One major focus of 8 OSDI’04 papers (out of 27)

Two orthogonal efforts: Lower-level system components robustness

Example: disk, individual machine, Internet routing

Higher-level redundancy

Example: data replication

This talk focuses on higher-level redundancy

Page 3: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

3

High Availability via ReplicationHigh Availability via Replication

Large amount of data accessed by many users: Distributed file systems

Network monitoring (PIER, SDIMS, IRISLOG)

Index databases for search engine (Google, p2p)

Scientific / medical databases

Data replicated across multiple machines Object: The unit for replication

File, file block, database table, database tuple, inverted index for a certain keyword

Page 4: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

4

Multi-object AccessesMulti-object Accesses

Many accesses request multiple objects Compile a project

Writing a paper under Latex

Asking for aggregates of network conditions

Search for web pages containing multiple keywords

Availability of single object can be misleading: An access requesting 1,000 objects can observe up to

1,000 times higher unavailability

There’s more subtlety.....

Page 5: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

5

A Simple ExampleA Simple Example

Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D

Four machines fail independently with same prob, each holds two file

Which assignment gives better avail:

A B C D

A B C D

orA B C D

A C B D

Better

Assignment matters because objects are now correlated

Page 6: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

6

A Simple Example - ContinuedA Simple Example - Continued

Suppose user is happy even if only three objects are available (e.g., when computing average)

A B C D

A B C D

orA B C D

A C B D

Better

Assignment makes a difference Even if we are using the same machines (same

amount of redundancy/resource)

Easily have multiple-nine difference

Page 7: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

7

Goal and ContributionsGoal and Contributions

MOAT (Multi-Object Assignment Toolkit): Goal: High availability for multi-object accesses

Key issue: Replica assignment

Contributions: First to observe the importance of replica assignment

Strong theoretical results regarding best and worst assignments

Practical designs to approximate optimal assignments

MOAT toolkit implementation for replica assignments

Page 8: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

8

OutlineOutline

Motivation and MOAT contributions

System model and case studies of existing systems

Theoretical results

Designs for approximating optimal assignments

Designs for mixed accesses

Conclusions

Page 9: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

9

Assumptions for This TalkAssumptions for This Talk

Assume: Replication (no erasure coding)

Crash failures (no Byzantine failures)

Eventual consistency (no quorum or voting)

Most of our results hold without these assumptions

Assume same replication degree for all objects We have results for different replication degrees as

well

Talk to me if interested in the more complete story...

Page 10: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

10

MOAT Architecture OverviewMOAT Architecture Overview

MOAT

raw data on distributed

machines or disks

file

system

network

monitoring

p2p

DB

search

engine

Storage

System

App

replication / repair / load balancing / naming / assignment

Data API

obj create / delete / read / write

Control API

assignment policy

Page 11: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

11

System ModelSystem Model

Basic system model: N objects, each with k replicas

Load balancing among all machines

Machines fail independently with same prob

An assignment is a mapping: replica machine, for all N k replicas

A B C D

A B C D

Page 12: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

12

Some Simple AssignmentsSome Simple Assignments

PTN: partition assignment Used in most practice of Coda [Satyanarayanan et al.’90]

A B C D E F

A B C D E F

for k = 2

...........

...........

RAND: pick a random replica each time Similar as in Google File System [Ghemawat et al.’03]

Page 13: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

13

Assignment in Chord [Stoica et al.’01]Assignment in Chord [Stoica et al.’01]

DHTs: Hash machine

IP to get machine id

Assignment in Chord: Sliding window

Neither PTN nor RAND

101

120

104

098

090

080

AA

C

C

B

hash(A) = 95

C

B

B

Page 14: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

14

Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]

Hash object k times CAN uses a

similar approach

Similar as RAND But machines

may have slightly different number of objects

101

120

104

098

090

080

A

hash1(A) = 95

Page 15: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

15

Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]

101

120

104

098

090

080

A

A

hash2(A) = 119

Hash object k times CAN uses a

similar approach

Similar as RAND But machines

may have slightly different number of objects

Page 16: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

16

Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]

101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B

Hash object k times CAN uses a

similar approach

Similar as RAND But machines

may have slightly different number of objects

Page 17: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

17

Which assignment should we use?Which assignment should we use?

MOAT Goal: Improve avail of multi-object accesses If an access requests n (n N) objects, what if only x

are available?

Threshold-based success definition: If x ≥ t, user happy Available

If x < t, too low confidence Unavailable

Availability for an access defined as: Prob[ t objects available out of n requested objects]

Page 18: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

18

Examples of tExamples of t

t = n File systems

Search for terrorist images in image database

t close n Query for top-10 most-loaded machines on PlanetLab

t not close n Sample with confidence

Page 19: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

19

OutlineOutline

Motivation and MOAT contributions

System model and case studies of existing systems

Theoretical results

Designs for approximating optimal assignments

Designs for mixed accesses

Conclusions

Page 20: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

20

Formal ResultsFormal Results For access requesting N objects

Theorem: Among all assignments, when t = N: PTN is best (within constant)

RAND is worst (within constant)

Difference is about c folds (c is #obj / machine)

Theorem: Among all assignments, when t = c+1 < N: PTN is worst

RAND is best (within constant)

Difference is even larger

Page 21: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

21

Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN

Chord

c times difference

if p is small, where c is # obj/machine

unavail of single obj

Page 22: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

22

A Spectrum of AssignmentsA Spectrum of Assignments40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN

Page 23: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

23

More Formal ArgumentsMore Formal Arguments

Tradeoff is fundamental: Impossible to achieve the best of RAND and PTN

Previous results only for access requesting N objects Similar results hold for accesses requesting n (n N)

objects

But each machine may not be filled to capacity:

For PTN, use as few machines as possible

For RAND, use as many machines as possible

I have more....talk to me if you are interested

Page 24: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

24

40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

PTN

Chord

Access Requesting 500 ObjectsAccess Requesting 500 Objects

Page 25: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

25

OutlineOutline

Motivation and MOAT contributions

System model and case studies of existing systems

Theoretical results

Designs for approximating optimal assignments

Designs for mixed accesses

Conclusions

Page 26: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

26

Design of Replica AssignmentDesign of Replica Assignment

Trivial in a static / centralized environment

Challenging in dynamic environment: We may not have global knowledge with many objects

and many machines

Basic solution: Consistent hashing But some re-design is necessary

Page 27: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

27

Approximating RANDApproximating RAND

Multi-hash DHT: Hash the object k

times

As in CAN

101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B

Page 28: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

28

Approximating PTNApproximating PTN

Chord does not achieve PTN

101

120

104

098

090

080

A

B C

A B

C

B

hash(A) = 95

C

Page 29: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

29

Approximating PTNApproximating PTN

Chord does not achieve PTN

Group DHT: (Arbitrarily) group

machine into groups of k size

120120

A B

C

B

hash(A) = 95

C

101101

090090

A B

C

Page 30: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

30

Node Join and Leave in Group DHTNode Join and Leave in Group DHT

Maintain r rondevour points in DHT Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]

New node reports to a random rondevour point If group can be formed, join DHT

Two options upon node leave: Dismiss group and delete the group from DHT

The group wait to recruit a new node

Groups use rondevour point to decide

Page 31: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

31

Complexity AnalysisComplexity AnalysisMetric Standard DHT Group DHT

Routing state log N log N/k

Routing hops log N log N/k

Messages / Join (log N)^2 log N/k + (log N/k)^2 / k

Messages / Leave (Log N)^2 log N/k + (log N/k)^2 / k

Obj moves / Join k/N k/N

Obj moves / Leave k/N 2k/N

Page 32: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

32

OutlineOutline

Motivation and MOAT contributions

System model and case studies of existing systems

Theoretical results

Designs for approximating optimal assignments

Designs for mixed accesses

Conclusions

Page 33: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

33

Mixture of QueriesMixture of Queries

Previous design only for single access requesting all N objects PTN if t close to N

RAND if t far from N

But there are other accesses Requests n (n < N) objects with threshold t

How does t change with n ? Infinite possibilities

We focus on 4 large categories

Page 34: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

34

Four Application ScenariosFour Application Scenarios

Scenario small accesses (small n) large accesses (large n)

File system strict strict

Computing aggregates

loose loose

Network monitoring

strict

(pinpoint problems)

loose

(overview query)

Image database search

loose

(resource retrieval of frequent objects -- E.g.,

find clip art for slide)

strict

(non-existence test -- E.g., exhaustive search

of terrorist)

Strict accesses: t n Loose accesses: t < n

Page 35: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

35

LooseLoose for both small and largefor both small and large n n

Goal: Approach RAND

for both small and large n

Design: Multi-hash DHT

101

120

104

098

090

080

A

A

hash1(B) = 84

hash2(B) = 100

B

B

Page 36: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

36

LooseLoose for small for small nn; ; StrictStrict for for largelarge n n

Goal: Approach RAND

for small n

Approach PTN for large n

Design: Group DHT

120120

A B

C

A

C

101101

090090

A B

C

Page 37: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

37

StrictStrict for both small andfor both small and largelarge n n Goal:

Approach PTN for both small and large n

Assume accesses are tree accesses

Design: Group DHT with

item-balancing [Karger et al.’04]

120120

A B

C

B

A = 95

101101

090090

A B

C

Page 38: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

38

StrictStrict for small n; for small n; LooseLoose for for largelarge n n

Goal: Approaches PTN

for n < R

Approaches RAND for n >> R

Design: Multi-hash DHT

But cluster objects into clusters of constant size R

101

120

104

098

090

080

A

A

hash1(AB) = 84

hash2(AB) = 100

B

B

Page 39: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

39

Simulation Results for Strict Accesses Simulation Results for Strict Accesses

number (n) of objects requested by an access

una

vaila

bili

ty

Here an access needs all n objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

Page 40: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

40

Simulation Results for Loose AccessesSimulation Results for Loose Accesses

number (n) of objects requested by an access

una

vaila

bili

ty

Here an access needs only t = n - 150 objects to be successful

400 machines

fail prob = 0.2

40,000 obj

4 replica / obj

Page 41: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

41

Current StatusCurrent Status

Waiting for paper deadlines

Finishing implementing MOAT

Evaluation on IrisLog trace and file system traces

Page 42: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

42

Related WorkRelated Work Multi-object accesses rarely addressed

CFS [Dabek et al.’01] focuses on individual file blocks

Chain replication [Renesse et al.’04] considers single data object

A long list ..... Replica assignment largely ignored

Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied

Replica placement [Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:

Typically for machines in different locations in the network

Machines are heterogeneous

Approaches does not apply to replica assignment

Page 43: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

43

ConclusionsConclusions

Availability becoming key design goal Multi-object access availability dramatically different

from single-object availability

MOAT Contributions: First to observe the importance of replica assignment

Strong theoretical results regarding the best and worst assignments

Practical designs to approximate optimal assignments

MOAT toolkit implementation

Page 44: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

44

My Other Recent WorkMy Other Recent Work

Om [NSDI’04]: Consistent and automatic replica regeneration

Regenerate from any single replica rather than a majority

Signed quorum systems [PODC’04]: Constant quorum size at the cost of small prob of

inconsistency

Node failure characteristics in WAN [WORLDS’04]: Answer subtle questions regarding real-world failure

properties

Page 45: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

45

Page 46: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

46

Erasure CodingErasure Coding

Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object

RAID techniques are special cases

Replication is a special case where m = 1

Page 47: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

47

Example RevisitedExample Revisited

Need four files to compile:

A B C D

A B C D

orA B C D

A C B D

Better

Erasure coding is hard to be applied across large amount of data Updating any portion of data needs to update k - m + 1

fragments the size of original data

We cannot use erasure coding across 1,000 files

Can we treat A, B, C, D as a single obj and use erasure coding?

So that all files can be reconstructed from any 4 out of 8 fragments

Page 48: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

48

Threshold Semantics and Erasure CodingThreshold Semantics and Erasure Coding

Threshold Semantics Erasure Coding

need t out of n objects to answer query

need m out of k fragments to reconstruct object

t determined by app semantics m determined at coding time

result dependent on which t objects

same result regardless of which m fragments

may update single object by itself modification to any portion of the object needs to update k-m+1 fragments

In short, they are different, orthogonal concepts

Page 49: MOAT:  A Multi-Object Assignment Toolkit

Haifeng Yu, Intel Research Pittsburgh / CMU

49

Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2

threshold

una

vaila

bili

ty

RAND (CAN)

CRAND (10)

CRAND (100)

PTN

Chord

c times difference

if p is small, where c is # obj/machine