challenges in distributed storage and compute systemsparimal/talks/mvj2017slides.pdf · 1/ 28...
TRANSCRIPT
1/ 28
Challenges in Distributed Storageand Compute Systems
Parimal Parag
Electrical Communication EngineeringIndian Institute of Science
MVJ College of EngineeringAugust 29, 2017
2/ 28
Evolving Digital Landscape
Information Theoretic Regime
Emerging Applications
Del
ayT
oler
ance
Rate Requirements
Voice Cloud Storage Video Streaming
Network Printer
Browsing
File Transfer
3/ 28
Dominant traffic on Internet
Peak Period Traffic Composition (North America)
Upstream Downstream Aggregate0
20
40
60
80
100Real-Time Entertainment
Web BrowsingMarketplaces
FilesharingTunneling
Social NetworkingStorage
CommunicationsGaming
Outside Top 5
I Real-Time Entertainment: 64.54% for downstream and 36.56% for mobile access1
1https://www.sandvine.com/downloads/general/global-internet-phenomena/2015/global-internet-
phenomena-report-latin-america-and-north-america.pdf
4/ 28
Centralized Paradigm – Media Vault
Vault
File 1
File 2
File 3
File 4
File 5
File 6
Requests
Potential Issues with Centralized Scheme
I Traffic load: Vault must handle all requests for all files
I Service rate: Large storage entails longer access time
I Not robust to hardware failures or malicious attacks
5/ 28
Alternative to Centralized Paradigm
File 1
File 2
File 3
File 4
File 5File 6
File 7 File 8
File 9
File 10
File 11
Distributed Systems
I Autonomous nodes with local memory
I Interaction between the connected nodes
I Nodes with local knowledge of input and network topology
I Heterogeneous and potentially time varying system topology
6/ 28
Distributed Systems
File 1
File 2
File 3
File 4
File 5File 6
File 7 File 8
File 9
File 10
File 11
Desirable Properties
I Scalability: Linear or sub-linear increase in number of nodes
I Resilience: Able to withstand local node failures
I Efficiency: Minimum interaction between nodes
I Fairness: Almost equal load at all nodes
7/ 28
Examples
Distributed Storage
I Content streaming: NetFlix, HotStar, Eros Now, YouTube,Hulu, Amazon Prime Video
I Cloud storage: GitHub, DropBox, iCloud, OneDrive,UbuntuOne
I Cloud service: Facebook, Google Suite, Office365
Distributed Computation
I Cloud computing: Amazon Web Services, Microsoft Azure,Google Search
I Cluster computing: Hadoop, Spark
I Distributed database: Aerospike, Cassandra, Couchbase,Druid
8/ 28
Distributed System Architecture
Classification
I Client-server: Online banking, Web servers, e-commerce
I Peer-to-peer: Bitcoin, OS distribution
I Hybrid: Spotify, content delivery in ISPs
Interaction
I Master-slave: Message passing with local memory
I Database-centric: Relation database for interaction
9/ 28
Content Delivery Network
Vault
File 1
File 2
File 3
File 4
File 5
File 6
File 1 File 3 File 5
File 1 File 4 File 6
File 2 File 3 File 6
File 2 File 4 File 5
Routed Requests
Redundancy for resilience
I Mirroring content with local servers
I Media file on multiple servers
10/ 28
Load Balancing through File Fragmentation
Vault
File 1A File 1B
File 2A File 2B
File 3A File 3B
File 4A File 4B
File 5A File 5B
File 6A File 6B
File 1A File 3A File 5A
File 2B File 4B File 6B
File 1A File 4A File 6A
File 2B File 3B File 5B
File 2A File 3A File 6A
File 1B File 4B File 5B
File 2A File 4A File 5A
File 1B File 3B File 6B
Multiple Requests
Partial Completions
Shared Coherent Access
I Availability and better content distribution
I File segments on multiple servers
11/ 28
Problem Statement
f1(X )
f2(X )
f3(X )
f4(X )
Requests
Quantify mean access time
I with number of fragments for a single message X ,
I with encoding and storage fi (X ) for fragmented messageX = (X1, . . . ,Xk) at n distinct nodes.
11/ 28
Problem Statement
A
A
B
B
Requests A
B
A+B
A-B
Requests
ProblemQuantify the latency gains offered by distributed coding
SolutionCoded storage offers scaling gains over replication
12/ 28
System Model
File storage
I Each media file divided into k pieces
I Pieces encoded and stored on n servers
Arrival of requests
I Each request wants entire media file
I Poisson arrival of requests with rate λ
Time in the system
I Till the reception of whole file
Service at each server
I IID exponential service time with rate k/n
13/ 28
Question: Duplication versus MDS Coding
B
B
A
A
D
C
B
A
Reduction of access time
I How to select number of fragments for a single message?
I How to encode and store at the distributed storage nodes?
14/ 28
Pertinent References (very incomplete)
N. B. Shah, K. Lee, and K. Ramchandran, “When do redundant requests reduce latency?” IEEE Trans.
Commun., 2016.
G. Joshi, Y. Liu, and E. Soljanin, “On the delay-storage trade-off in content download from coded
distributed storage systems” IEEE Journ. Spec. Areas. Commun., 2014.
Dimakis, Godfrey, Wu, Wainwright, and Ramchandran, “Network Coding for Distributed Storage Systems ”
IEEE Trans. Info. Theory, 2010.
A. Eryilmaz, A. Ozdaglar, M. Medard, and E. Ahmed, “On the delay and throughput gains of coding in
unreliable networks,” IEEE Trans. Info. Theory, 2008.
D. Wang, D. Silva, F. R. Kschischang, “Robust Network Coding in the Presence of Untrusted Nodes”, IEEE
Trans. Info. Theory, 2010.
A. Dimakis, K. Ramchandran, Y. Wu, C. Suh, “A Survey on Network Codes for Distributed Storage”,
Proceedings of IEEE, 2011.
Karp, Luby, Meyer auf der Heide, “Efficient PRAM simulation on a distributed memory machine”, ACM
symposium on Theory of computing, 1992.
Adler, Chakrabarti, Mitzenmacher, Rasmussen, “Parallel randomized load balancing”, ACM symposium on
Theory of computing, 1995.
Gardner, Zbarsky, Velednitsky, Harchol-Balter, Scheller-Wolf, “Understanding Response Time in the
Redundancy-d System”, SIGMETRICS, 2016.
B. Li, A. Ramamoorthy, R. Srikant, “Mean-field-analysis of coding versus replication in cloud storage
systems”, INFOCOM, 2016.
15/ 28
Storage Coding – The Centralized MDS Queue
X
exempli gratia: Shah, Lee, Ramchandran (2013), Lee, Shah, Huang, Ramchandran (2017), Vulimiri, Michel,
Godfrey, Shenker (2012), Ananthanarayanan, Ghodsi, Shenker, Stoica (2012) Baccelli, Makowski, Shwartz (1989)
16/ 28
State Space Structure
Keeping Track of Partially Fulfilled Requests
I Element of state vector YS(t) is number of users with givensubset S of pieces
Continuous-Time Markov Chain
I Y(t) = {YS(t) : S ⊂ [n], |S | < k} is a Markov process
17/ 28
Storage Coding – (n, k) Fork-Join Model
x
x
X
exempli gratia: Joshi, Liu, Soljanin (2012, 2014), Joshi, Soljanin, Wornell (2015), Sun, Zheng, Koksal, Kim, Shroff
(2015), Kadhe, Soljanin, Sprintson (2016), Li, Ramamoorthy, Srikant (2016)
18/ 28
State Space Collapse
TheoremFor duplication and coding schemes under priority scheduling andparallel processing model, collection
S(t) = {S ⊂ [n] : YS(t) > 0, |S | < k}
of information subsets is totally ordered in terms of set inclusion
Corollary
Let Yi (t) be number of requests with i information symbols attime t, then
Y(t) = (Y0(t),Y1(t), . . . ,Yk−1(t))
is Markov process
19/ 28
State Transitions of Collapsed System
Arrival of Requests
I Unit increase in Y0(t) = Y0(t−) + 1 with rate λ
Getting Additional Symbol
I Unit increase in Yi (t) = Yi (t−) + 1
I Unit decrease in Yi−1(t) = Yi−1(t−)− 1
Getting Last Missing Symbol
I Unit decrease in Yk−1(t) = Yk−1(t−)− 1
20/ 28
Tandem Queue Interpretation (No Empty States)
γ1Y1(t)γ0Y0(t)λ
Duplication
I When all statesnon-empty
I No. servers availableat level i is n/k
I Normalized servicerate at level i
γi = 1 i = 0, . . . , k−1
MDS Coding
I When all states non-empty
I One server available at leveli 6= k − 1
I Normalized service rate at level i
γi =
{kn i < k − 1kn (n − k + 1) i = k − 1
21/ 28
Tandem Queue Interpretation (General Case)
γ1Y1(t)γ0Y0(t)λ
Tandem Queue with Pooled Resources
I Servers with empty buffers help upstream
I Aggregate service at level i becomes
li (t)−1∑j=i
γj where li (t) = k ∧ {l > i : Yl(t) > 0}
I No explicit description of stationary distribution formulti-dimensional Markov process
22/ 28
Bounding and Separating
µ1µ0λ
Theorem†
When λ < minµi , tandem queue has product form distribution
π(y) =k−1∏i=0
λ
µi
(1− λ
µi
)yi
Uniform Bounds on Service RateTransition rates are uniformly bounded by
γi ≤li (y)−1∑j=i
γj ≤k−1∑j=i
γj , Γi
†F. P. Kelly, Reversibility and Stochastic Networks. New York, NY, USA: Cambridge University Press, 2011.
23/ 28
Bounds on Tandem Queue
γ1Y1(t)γ0Y0(t)λ
Γ1Y1(t)Γ0Y0(t)λ
γ1Y1(t)γ0Y0(t)λ
Lower BoundHigher values for service ratesyield lower bound on queuedistribution
π(y) =k−1∏i=0
λ
Γi
(1− λ
Γi
)yi
Upper Bound
Lower values for service rateyield upper bound on queuedistribution
π(y) =k−1∏i=0
λ
γi
(1− λ
γi
)yi
24/ 28
Mean Sojourn Time
0.1 0.2 0.4 0.6 0.8 0.950
5
10
15
Arrival Rate
Replication Coding
Upper BoundSimulationApproximationLower Bound
25/ 28
Mean Sojourn Time
0.1 0.2 0.4 0.6 0.8 0.950
5
10
15
Arrival Rate
(4, 2) MDS Code
Upper BoundSimulationApproximationLower Bound
26/ 28
Approximating Pooled Tandem Queue
γ1Y1(t)γ0Y0(t)λ
µ1Y1(t)
µ0Y0(t)λ
Independence Approximation with Statistical Averaging
Service rate is equal to base service rate γi plus cascade effect,averaged over time
µk−1 = γk−1
µi = γi + µi+1πi+1(0)π(y) =
k−1∏i=0
λ
µi
(1− λ
µi
)yi
27/ 28
Comparing Replication versus MDS Coding
2 4 8 12 16 201
2
3
4
5
Number of Servers
MeanSojourn
Tim
e
Repetition Simulation
Repetition Approximation
MDS Simulation
MDS Approximation
Arrival rate 0.3 units and coding rate n/k = 2
28/ 28
Summary and Discussion
Main Contributions
I Analytical framework for study of distributed computation andstorage systems
I Upper and lower bounds to analyze replication and MDS codes
I A tight closed-form approximation to study distributed storagecodes
I MDS codes are better suited for large distributed systems
I Mean access time is better for MDS codes for all code-rates