challenges in distributed storage and compute systemsparimal/talks/mvj2017slides.pdf · 1/ 28...

1/ 28

Challenges in Distributed Storageand Compute Systems

Parimal Parag

Electrical Communication EngineeringIndian Institute of Science

MVJ College of EngineeringAugust 29, 2017

2/ 28

Evolving Digital Landscape

Information Theoretic Regime

Emerging Applications

Del

ayT

oler

ance

Rate Requirements

Voice Cloud Storage Video Streaming

Network Printer

Email

Browsing

File Transfer

3/ 28

Dominant traffic on Internet

Peak Period Traffic Composition (North America)

Upstream Downstream Aggregate0

20

40

60

80

100Real-Time Entertainment

Web BrowsingMarketplaces

FilesharingTunneling

Social NetworkingStorage

CommunicationsGaming

Outside Top 5

I Real-Time Entertainment: 64.54% for downstream and 36.56% for mobile access1

1https://www.sandvine.com/downloads/general/global-internet-phenomena/2015/global-internet-

phenomena-report-latin-america-and-north-america.pdf

4/ 28

Centralized Paradigm – Media Vault

Vault

File 1

File 2

File 3

File 4

File 5

File 6

Requests

Potential Issues with Centralized Scheme

I Traffic load: Vault must handle all requests for all files

I Service rate: Large storage entails longer access time

I Not robust to hardware failures or malicious attacks

5/ 28

Alternative to Centralized Paradigm

File 1

File 2

File 3

File 4

File 5File 6

File 7 File 8

File 9

File 10

File 11

Distributed Systems

I Autonomous nodes with local memory

I Interaction between the connected nodes

I Nodes with local knowledge of input and network topology

I Heterogeneous and potentially time varying system topology

6/ 28

Distributed Systems

File 1

File 2

File 3

File 4

File 5File 6

File 7 File 8

File 9

File 10

File 11

Desirable Properties

I Scalability: Linear or sub-linear increase in number of nodes

I Resilience: Able to withstand local node failures

I Efficiency: Minimum interaction between nodes

I Fairness: Almost equal load at all nodes

7/ 28

Examples

Distributed Storage

I Content streaming: NetFlix, HotStar, Eros Now, YouTube,Hulu, Amazon Prime Video

I Cloud storage: GitHub, DropBox, iCloud, OneDrive,UbuntuOne

I Cloud service: Facebook, Google Suite, Office365

Distributed Computation

I Cloud computing: Amazon Web Services, Microsoft Azure,Google Search

I Cluster computing: Hadoop, Spark

I Distributed database: Aerospike, Cassandra, Couchbase,Druid

8/ 28

Distributed System Architecture

Classification

I Client-server: Online banking, Web servers, e-commerce

I Peer-to-peer: Bitcoin, OS distribution

I Hybrid: Spotify, content delivery in ISPs

Interaction

I Master-slave: Message passing with local memory

I Database-centric: Relation database for interaction

9/ 28

Content Delivery Network

Vault

File 1

File 2

File 3

File 4

File 5

File 6

File 1 File 3 File 5




Routed Requests

Redundancy for resilience

I Mirroring content with local servers

I Media file on multiple servers

10/ 28

Load Balancing through File Fragmentation

Vault

File 1A File 1B

File 2A File 2B

File 3A File 3B

File 4A File 4B

File 5A File 5B

File 6A File 6B

File 1A File 3A File 5A

File 2B File 4B File 6B







Multiple Requests

Partial Completions

Shared Coherent Access

I Availability and better content distribution

I File segments on multiple servers

11/ 28

Problem Statement

f1(X )

f2(X )

f3(X )

f4(X )

Requests

Quantify mean access time

I with number of fragments for a single message X ,

I with encoding and storage fi (X ) for fragmented messageX = (X1, . . . ,Xk) at n distinct nodes.

11/ 28

Problem Statement

A

A

B

B

Requests A

B

A+B

A-B

Requests

ProblemQuantify the latency gains offered by distributed coding

SolutionCoded storage offers scaling gains over replication

12/ 28

System Model

File storage

I Each media file divided into k pieces

I Pieces encoded and stored on n servers

Arrival of requests

I Each request wants entire media file

I Poisson arrival of requests with rate λ

Time in the system

I Till the reception of whole file

Service at each server

I IID exponential service time with rate k/n

13/ 28

Question: Duplication versus MDS Coding

B

B

A

A

D

C

B

A

Reduction of access time

I How to select number of fragments for a single message?

I How to encode and store at the distributed storage nodes?

14/ 28

Pertinent References (very incomplete)

N. B. Shah, K. Lee, and K. Ramchandran, “When do redundant requests reduce latency?” IEEE Trans.

Commun., 2016.

G. Joshi, Y. Liu, and E. Soljanin, “On the delay-storage trade-off in content download from coded

distributed storage systems” IEEE Journ. Spec. Areas. Commun., 2014.

Dimakis, Godfrey, Wu, Wainwright, and Ramchandran, “Network Coding for Distributed Storage Systems ”

IEEE Trans. Info. Theory, 2010.

A. Eryilmaz, A. Ozdaglar, M. Medard, and E. Ahmed, “On the delay and throughput gains of coding in

unreliable networks,” IEEE Trans. Info. Theory, 2008.

D. Wang, D. Silva, F. R. Kschischang, “Robust Network Coding in the Presence of Untrusted Nodes”, IEEE

Trans. Info. Theory, 2010.

A. Dimakis, K. Ramchandran, Y. Wu, C. Suh, “A Survey on Network Codes for Distributed Storage”,

Proceedings of IEEE, 2011.

Karp, Luby, Meyer auf der Heide, “Efficient PRAM simulation on a distributed memory machine”, ACM

symposium on Theory of computing, 1992.

Adler, Chakrabarti, Mitzenmacher, Rasmussen, “Parallel randomized load balancing”, ACM symposium on

Theory of computing, 1995.

Gardner, Zbarsky, Velednitsky, Harchol-Balter, Scheller-Wolf, “Understanding Response Time in the

Redundancy-d System”, SIGMETRICS, 2016.

B. Li, A. Ramamoorthy, R. Srikant, “Mean-field-analysis of coding versus replication in cloud storage

systems”, INFOCOM, 2016.

15/ 28

Storage Coding – The Centralized MDS Queue

X

exempli gratia: Shah, Lee, Ramchandran (2013), Lee, Shah, Huang, Ramchandran (2017), Vulimiri, Michel,

Godfrey, Shenker (2012), Ananthanarayanan, Ghodsi, Shenker, Stoica (2012) Baccelli, Makowski, Shwartz (1989)

16/ 28

State Space Structure

Keeping Track of Partially Fulfilled Requests

I Element of state vector YS(t) is number of users with givensubset S of pieces

Continuous-Time Markov Chain

I Y(t) = {YS(t) : S ⊂ [n], |S | < k} is a Markov process

17/ 28

Storage Coding – (n, k) Fork-Join Model

x

x

X

exempli gratia: Joshi, Liu, Soljanin (2012, 2014), Joshi, Soljanin, Wornell (2015), Sun, Zheng, Koksal, Kim, Shroff

(2015), Kadhe, Soljanin, Sprintson (2016), Li, Ramamoorthy, Srikant (2016)

18/ 28

State Space Collapse

TheoremFor duplication and coding schemes under priority scheduling andparallel processing model, collection

S(t) = {S ⊂ [n] : YS(t) > 0, |S | < k}

of information subsets is totally ordered in terms of set inclusion

Corollary

Let Yi (t) be number of requests with i information symbols attime t, then

Y(t) = (Y0(t),Y1(t), . . . ,Yk−1(t))

is Markov process

19/ 28

State Transitions of Collapsed System

Arrival of Requests

I Unit increase in Y0(t) = Y0(t−) + 1 with rate λ

Getting Additional Symbol

I Unit increase in Yi (t) = Yi (t−) + 1

I Unit decrease in Yi−1(t) = Yi−1(t−)− 1

Getting Last Missing Symbol

I Unit decrease in Yk−1(t) = Yk−1(t−)− 1

20/ 28

Tandem Queue Interpretation (No Empty States)

γ1Y1(t)γ0Y0(t)λ

Duplication

I When all statesnon-empty

I No. servers availableat level i is n/k

I Normalized servicerate at level i

γi = 1 i = 0, . . . , k−1

MDS Coding

I When all states non-empty

I One server available at leveli 6= k − 1

I Normalized service rate at level i

γi =

{kn i < k − 1kn (n − k + 1) i = k − 1

21/ 28

Tandem Queue Interpretation (General Case)

γ1Y1(t)γ0Y0(t)λ

Tandem Queue with Pooled Resources

I Servers with empty buffers help upstream

I Aggregate service at level i becomes

li (t)−1∑j=i

γj where li (t) = k ∧ {l > i : Yl(t) > 0}

I No explicit description of stationary distribution formulti-dimensional Markov process

22/ 28

Bounding and Separating

µ1µ0λ

Theorem†

When λ < minµi , tandem queue has product form distribution

π(y) =k−1∏i=0

λ

µi

(1− λ

µi

)yi

Uniform Bounds on Service RateTransition rates are uniformly bounded by

γi ≤li (y)−1∑j=i

γj ≤k−1∑j=i

γj , Γi

†F. P. Kelly, Reversibility and Stochastic Networks. New York, NY, USA: Cambridge University Press, 2011.

23/ 28

Bounds on Tandem Queue

γ1Y1(t)γ0Y0(t)λ

Γ1Y1(t)Γ0Y0(t)λ

γ1Y1(t)γ0Y0(t)λ

Lower BoundHigher values for service ratesyield lower bound on queuedistribution

π(y) =k−1∏i=0

λ

Γi

(1− λ

Γi

)yi

Upper Bound

Lower values for service rateyield upper bound on queuedistribution

π(y) =k−1∏i=0

λ

γi

(1− λ

γi

)yi

24/ 28

Mean Sojourn Time

0.1 0.2 0.4 0.6 0.8 0.950

5

10

15

Arrival Rate

Replication Coding

Upper BoundSimulationApproximationLower Bound

25/ 28

Mean Sojourn Time

0.1 0.2 0.4 0.6 0.8 0.950

5

10

15

Arrival Rate

(4, 2) MDS Code

Upper BoundSimulationApproximationLower Bound

26/ 28

Approximating Pooled Tandem Queue

γ1Y1(t)γ0Y0(t)λ

µ1Y1(t)

µ0Y0(t)λ

Independence Approximation with Statistical Averaging

Service rate is equal to base service rate γi plus cascade effect,averaged over time

µk−1 = γk−1

µi = γi + µi+1πi+1(0)π(y) =

k−1∏i=0

λ

µi

(1− λ

µi

)yi

27/ 28

Comparing Replication versus MDS Coding

2 4 8 12 16 201

2

3

4

5

Number of Servers

MeanSojourn

Tim

e

Repetition Simulation

Repetition Approximation

MDS Simulation

MDS Approximation

Arrival rate 0.3 units and coding rate n/k = 2

28/ 28

Summary and Discussion

Main Contributions

I Analytical framework for study of distributed computation andstorage systems

I Upper and lower bounds to analyze replication and MDS codes

I A tight closed-form approximation to study distributed storagecodes

I MDS codes are better suited for large distributed systems

I Mean access time is better for MDS codes for all code-rates

challenges in distributed storage and compute systemsparimal/talks/mvj2017slides.pdf · 1/ 28...

Documents