a vod system for massively scaled, heterogeneous ...yanlisa/publications/mascots2013_vod.pdffig. 1:...

A VoD System for Massively Scaled,Heterogeneous Environments:Design and Implementation

Kangwook Lee, Lisa Yan, Abhay Parekh and Kannan RamchandranDepartment of EECS

University of California at Berkeley, Berkeley, [email protected], [email protected], {parekh, kannanr}@eecs.berkeley.edu

Abstract—We propose, analyze and implement a generalarchitecture for massively parallel VoD content distribution. Weallow for devices that have a wide range of reliability, storageand bandwidth constraints. Each device can act as a cache forother devices and can also communicate with a central server.Some devices may be dedicated caches with no co-located users.Our goal is to allow each user device to be able to stream anymovie from a large catalog, while minimizing the load of thecentral server.

First, we architect and formulate a static optimizationproblem that accounts for various network bandwidth andstorage capacity constraints, as well as the maximum numberof network connections for each device. Not surprisingly thisformulation is NP-hard. We then use a Markov approximationtechnique in a primal-dual framework to devise a highly dis-tributed algorithm which is provably close to the optimal. Nextwe test the practical effectiveness of the distributed algorithm inseveral ways. We demonstrate remarkable robustness to systemscale and changes in demand, user churn, network failure andnode failures via a packet level simulation of the system. Finally,we describe our results from numerous experiments on a fullimplementation of the system with 60 caches and 120 users on20 Amazon EC2 instances.

In addition to corroborating our analytical and simulation-based findings, the implementation allows us to examine varioussystem-level tradeoffs. Examples of this include: (i) the splitbetween server to cache and cache to device traffic, (ii) thetradeoff between cache update intervals and the time taken forthe system to adjust to changes in demand, and (iii) the tradeoffbetween the rate of virtual topology updates and convergence.These insights give us the confidence to claim that a muchlarger system on the scale of hundreds of thousands of highlyheterogeneous nodes would perform as well as our currentimplementation.

I. INTRODUCTION

The shape of the internet is changing. On the one hand,large internet exchanges and datacenters have made it pos-sible to centralize a lot of compute and storage resources.It is clear that established players such as Netflix, Googleand Amazon are more likely to exploit the reliability andeconomies of scale that come from such architectures, andthat is of course what they are doing. On the other hand,the edge of the internet is growing exponentially. Phones,tablets, laptops, and ebook readers are growing in powerand sophistication to the point that that they resemble thedesktop computers of a few years ago. But fully 50% of the

traffic of the internet does not originate from data centers andhierarchical CDNs [1]. This traffic consists of applicationssuch as file sharing and P2P, and relies more heavily onedge-devices which take on the role of potentially unreliableservers. As we play out the evolution of content and theinternet, it seems clear that both kinds of content distributionwill co-exist. Videos of police action in an oppressive stateare easier to detect and throttle in a centralized architecturethan a distributed one, and there will always be situations inwhich small groups of individuals will want to share contentwithout the “prying eyes” of a media giant.

We have been working on a highly distributed edge-based scheme where the content is streamed video andhas quality of service constraints. The demand for thesevideos is not specified to the system. Each edge device haslimited storage but can store content from any movie (evenones that the owner of that device is not watching). Thedevices are assumed to have limited connectivity and thenetwork is allowed to be unreliable. As demand and networkconnectivity fluctuate, so does what is stored at each node.In other words our goal is to test the limits of how unreliableand distributed we can make the infrastructure and still meetthe stringent quality of service constraints of streamed video.Central to our architecture is the existence of a single reliableserver or Seedbox [2], that can fill in the gaps of service whenour distributed algorithms cannot meet QoS constraints andthe objective of our distributed algorithms is to ensure thatfor any set of adverse network conditions, the load on theSeedbox is minimized.

Our approach is the following. First, we formulate theproblem as a static convex optimization problem that isanalytically tractable. In [3] we explained the theoreticaljustification of our approach, and ”solved” this NP-hardproblem through a novel relaxation based on a Markovapproximation technique in a primal-dual formulation, thatresults in a highly distributed algorithm that converges to anear-optimal solution. The algorithm approximately specifiesthe optimal allocation of rate and storage resources at eachnode, as well as the best network topology that respectsthe network connectivity constraints. In this paper we focuson the systems level issues involved in taking an algorithmderived from theory to a full implementation. As an interme-

2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems

1526-7539/13 $26.00 © 2013 IEEE

DOI 10.1109/MASCOTS.2013.8

1

diate step, we have designed and implemented an extensivepacket-level simulator of the system. This proved to be veryuseful in studying and learning the dynamic and robustnessproperties of the algorithm, and helped in the implementationphase. In this paper we present a number of results to showthat our approach would work well if massively deployed inhighly heterogeneous environments.

A. Problem formulation

Our goal is to build a system that jointly solves thefollowing problems:

1) Content Placement: What content should be stored ateach device/node given the storage constraints, networkcapacity and current demand?

2) Overlay Topology: Given that each node can supportonly a bounded number of end devices, how shouldend devices be matched to the nodes?

3) Minimal Server Load: When there is no available nodethat can serve an end device to watch a specific pieceof content from, the Seedbox (or central server) “fillsin the gap” by streaming directly to it. We wish tominimize the load on this sever.

We illustrate these problems further in the example de-picted in Figure 1. The system has two 1 GB videos, 𝐴 and𝐵, which must be delivered at a streaming rate of 1 Mbps.There are 4 users: two request video 𝐴 and two request video𝐵. The three cache nodes are constrained by bandwidth,maximum degree (a bound on the number of simultaneouslysupported streaming connections) and storage. Figure 1bshows that under a certain “bad” topology and a “bad”content allocation scheme, demand cannot be satisfied. InFigure 1c, a “good” content placement strategy is chosen.In Figure 1d, a “good” topology is also chosen. As we cansee, the three problems enumerated earlier are closely related.Further, taken in isolation each problem is hard: there arean exponential number of possible topologies (from whichwe must select in a distributed manner) and as we will see(although it may be clear to some readers even at this point),the content selection problem for a fixed topology, is alsoNP-hard.

II. RELATED WORK

Distributed video-on-demand systems such as those of-fered by Netflix rely on a network of reliable well-connectedservers. As we have explained earlier, our work is notattempting to improve on these systems, rather it is to proposea solution under much more unreliable settings. Pure peer-to-peer networks such as Bittorrent are built for sharingfiles whereas our system includes a Seedbox server and isdesigned to accommodate quality of service constraints. Oursystem also goes beyond the traditional torrent architecture inthat it accommodates inter-torrent caching and cross-torrentcontent sharing in a VoD setting.

The optimization of VoD systems has received wide atten-tion in the academic literature. [4]–[10]. Almeida et al. [4]studied the delivery cost minimization problem under a fixedtopology by optimizing over content replication and routing.

Boufkhad et al. [5] investigated the problem of maximizingthe number of videos that can be served by a collectionof peers. Zhou et al. [6] focused on minimizing the loadimbalance of video servers while maximizing the systemthroughput. Tan and Massoulie [9] studied the problem ofoptimal content placement in P2P networks. Their goalis to maximize the utilization of peer uplink bandwidthresources. Optimal content placement strategies are identifiedin a particular scenario of limited content catalog under theframework of loss networks. Their work assumes that thepeers’ storage capacity grows unboundedly with system size.In contrast, our work does not make any assumption on thestorage capacities and also takes into account the overlaytopology. Applegate et al. [10] formulated the problem ofcontent placement into a mixed integer program (MIP) thattakes into account constraints such as disk space and linkbandwidth. However, they assume knowledge of the popu-larity of content under a fixed topology, with a video beingstored either in full or not at all. In our work, we use aclass of network codes that enables fractional storage, and wefurther do not assume any prior knowledge on the demanddistribution. We also optimize over the choices of all feasibletopology graphs.

With regard to network resource utilization, Borst etal. [11] solved a link bandwidth utilization problem assuminga tree structure with limited depth. A Linear Program (LP)is formulated, and under the assumption of symmetric linkbandwidth, demand, and cache size, a simple local greedyalgorithm is designed to find a close-to-optimal solution.Valancius et al. [12] propose an LP-based heuristic tocalculate the number of video copies placed at customerhome gateways. The network topology in our work is notconstrained to be a tree, and the video request patterns can bearbitrary in different network areas. Zhou and Xu [13] aimedto minimize the load imbalance among servers subject todisk space egress link capacity from servers. In contrast, weconsider the link capacity constraints that may exist anywherein the network.

Topology building is also an important design dimensionand has been studied in various works [14]–[16]. Whilemost works focus on enforcing locality-awareness and/orimproving ISP-friendliness, they make the simple assumptionthat the graph is fully connected, i.e., no node-degree-bound is taken into consideration. Zhang et al. solves theproblem of optimal P2P streaming under node degree con-straints [17]. However their topology selection algorithmdepends on global statistics which are easily accessibleunder a live-streaming scenario. However, directly applyingtheir technique in the Video-on-Demand setting that is ofinterest in this paper, requires global statistics of all users’utility functions, which can create an enormous overhead.Our distributed algorithm requires knowledge of only localinformation of neighboring overlay link rates.

To the best of our knowledge, we are unaware of any otherwork to jointly optimize topology graph selection, contentplacement and link rate allocation. Our solution is fullydistributed and adapts well to system dynamics.

2

(a) (b) (c) (d)

Fig. 1: A simple example of VoD caching problem. The system has two videos of size 1 GB and rate 1 Mbps and there are2 users requesting each video. The system employs 3 cache nodes with constraints on storage, bandwidth and out-degreeas shown in (a). The problem is to decide, for each cache, which videos to store, which users to connect to, and how muchbandwidth to allocate for each user. These questions are coupled. The connections between the cache nodes and users in(b) form a “bad” topology, and the content placement is non-optimal. The content placement in (c) is “good”. In (d), thetopology is “good”, and with the same content placement strategy in (c), only one user is in deficit of half of a video. Ingeneral, finding the “best” storage, bandwidth and topology combination is a combinatorially-hard problem.

Fig. 2: Caches and users connected by a physical network.The link capacity constraints can exist arbitrarily anywherein the network.

III. MATHEMATICAL PROBLEM FORMULATION

In this section, we cast the VoD optimization problem as aconvex optimization problem. We analyzed this formulationextensively in [3] and showed that a distributed algorithmclosely approximates the optimal. Our formulation assumesa static setting: the video catalog, the set of users andsubscriptions, and the set of caches are fixed.

As illustrated in Figure 2, a set of caches(𝐻) and aset of users(𝑈 ) are connected by a fixed physical networkwhich consists of links with capacity. Caches and users areconnected by a overlay graph configuration, 𝑔, which isexpressed by the set of overlay links 𝑅 and the correspondingrouting matrix 𝐴. We denote the set of overlay links and therouting matrix by 𝑅𝑔 and 𝐴𝑔 under a graph configuration𝑔. Each overlay link 𝑟 ∈ 𝑅, consists of a set of underlaylinks, 𝐿𝑟 ⊂ 𝐿 and we say 𝑙 ∈ 𝑟 if 𝑙 ∈ 𝐿𝑟. An overlaylink 𝑟 = (ℎ, 𝑢) enables cache node ℎ to send data to node𝑢 in the overlay graph by setting up TCP/UDP connections.The routing matrix 𝐴 := (𝐴𝑙𝑟, (𝑙, 𝑟) ∈ 𝐿× 𝑅) is defined asusual where 𝐴𝑙𝑟 = 1 if 𝑙 ∈ 𝑟 and 0 otherwise, We denotethe connected neighbor of node 𝑣 by 𝑁𝑔

𝑣 . Node 𝑣 cannotconnect to more than 𝐵𝑣 users, which leads the node-I/Oconstraints. Let 𝐺 = {∀𝑔∣∣𝑁𝑔

𝑣 ∣ ≤ 𝐵𝑣 ∀𝑣 ∈ 𝐻 ∪ 𝑈} be the

set of possible overlay graphs. Let link 𝑙 ∈ 𝐿 have a capacity𝑐𝑙, and let 𝑥𝑟 be the rate on the overlay link 𝑟. This leadsto natural routing constraints as follows: 𝐴𝑥 ≤ 𝑐, where 𝑥 isthe column vector of the overlay rates 𝑥𝑟 and 𝑐 is the columnvector of link capacity constraints 𝑐𝑙.

The video catalog consists of videos in the set 𝑀 . Eachvideo 𝑚 has size 𝛽𝑚 and is streamed at a constant rate of 𝛾𝑚.Since we assume a fixed demand in this model, we denotethe set of users watching 𝑚 as 𝑈𝑚. To model the storageconstraints, let 𝑠ℎ be the storage capacity of cache nodeℎ, and denote by 𝑠 := (𝑠ℎ, ℎ ∈ 𝐻) the column vector ofthe storage capacities. Let 𝑊 := (𝑊ℎ𝑚, ℎ ∈ 𝐻,𝑚 ∈ 𝑀)be the storage matrix where 𝑊ℎ𝑚 ∈ {0, 1} indicates ifvideo 𝑚 is stored on cache node ℎ (𝑊ℎ𝑚 = 1) or not(𝑊ℎ𝑚 = 0). Denote by 𝛽 := (𝛽𝑚,𝑚 ∈ 𝑀) the vector ofthe sizes (in MB) of all videos. The storage constraints canthen be expressed by 𝑊𝛽 ≤ 𝑠. Availability constraints arealso modeled : caches can only serve stored movies. Fromcache ℎ to user 𝑢, the streaming rate can be only 0 if cachedoes not store the movie 𝑚 which is being viewed by theuser. If the cache stores the movie, the streaming rate can beanything no greater than 𝛾𝑚. It can be equivalently expressedas 𝑥𝑟:=(ℎ,𝑢) ≤𝑊ℎ𝑚𝛾𝑚.

Let 𝑧𝑢 =∑

𝑟=(ℎ,𝑢):ℎ∈𝑁𝑔𝑢𝑥𝑟 be the total received rate of

user 𝑢, and 𝑉 𝑢(𝑧) be a concave function that represents theutility of user 𝑢 when the received rate is 𝑧. Table I lists allrelevant notations.

Now, we have the following optimization problem. Theobjective is to find a graph 𝑔, content placement 𝑊 , and rateallocation 𝑥, which jointly maximize the sum of user utilitiesunder constraints.

max𝑔∈𝐺

𝑥𝑟≥0,𝑊ℎ𝑚∈{0,1}

∑

𝑢∈𝑈

𝑉 𝑢(𝑧𝑢)

s.t. 𝑥𝑟:=(ℎ,𝑢) ≤𝑊ℎ𝑚𝛾𝑚,

𝐴𝑔𝑥 ≤ 𝑐, 𝑊𝛽 ≤ 𝑠.

The above optimization is very difficult to solve due to

3

TABLE I: Key Notations

Parameters Definition𝐻 , 𝑈 , 𝑀 set of caches / users / movies𝑈𝑚 set of users watching video 𝑚𝛾𝑚, 𝛽𝑚 video 𝑚’s streaming rate and size𝑠ℎ storage capacity of cache ℎ𝐺, 𝐵𝑣 set of feasible overlay graphs / I/O constraints𝐿, 𝑐𝑙 set of underlay links / link capacityAuxiliary Variables Definition𝜃𝑙 shadow price of link 𝑙𝑞𝑟 𝑞𝑟 =

∑𝑙∈𝑟 𝜃𝑙 is the shadow price of route 𝑟

𝜆𝑟,Σℎ,𝑚 demand index of route 𝑟 / movie 𝑚 at cache ℎ𝜔ℎ storage price of cache ℎDecision Variables Definition𝑥𝑟 route rate of 𝑟𝑊ℎ𝑚 storage of video 𝑚 on cache ℎ𝑔,𝐴𝑔, 𝑅𝑔 overlay graph / routing matrix / overlay links𝑁𝑔

𝑣 connected neighborhood of 𝑣 under 𝑔𝑝𝑔 probability of each topology graph 𝑔

the exponentially large size of the feasible graph set 𝐺 andinteger constraints of storage matrix 𝑊 .

IV. SYSTEM

In this section, we describe the overall architecture of oursystem. Before introducing the architecture and the systemcomponents, we first briefly illustrate how we apply codesunder the context of video streaming and caching. The codingtechnique is used to eliminate the combinatorial nature ofthe resource allocation problem by allowing for “fractional”content streaming and caching. Then, we describe the overallarchitecture of our system, and provide a detailed explanationof each component. The distributed algorithms to optimallyutilize cache resources are presented with details.

A. Streaming and caching with codes

Fig. 3: A pictorial illustration of codes for video streaming.Each video is divided into multiple scenes and each scene issliced and then encoded using MDS codes. In this example,MDS codes with (𝑛, 𝑘) = (32, 16) are used.

Figure 3 shows the conceptual way of applying codes forstreaming videos. We first cut a video into multiple scenes,each with a fixed duration. Then, we slice each scene into𝑘 equal sized chunks1. Lastly, for each scene, we encode 𝑘chunks using (𝑛, 𝑘) MDS codes, and take 𝑛 coded chunks.Media servers will store all of these 𝑛 chunks of all scenes.

Caches can partially store a video by storing an equalnumber of chunks of all scenes. For example, Figure 4 showshow caches store a half of a video in our system. Assumethat both Cache A and Cache B decide to store a half of a

1The last chunk will be zero-padded to make all chunks equal-sized.

Fig. 4: Caches randomly choose and store a subset of theencoded chunks. Users successively download chunks fromthe connected caches and the server (if the caches cannotcumulatively provide the required 𝑘 chunks per scene). Afterreceiving 𝑘 chunks, users decode the scene and watch it afterthe current scene.

video. Then each cache will first randomly choose a subset of{1, 2, ..., 𝑛} with size 𝑘/2. After the choice of a random setof indices, it will store the corresponding encoded chunks ofall scenes. Users will need any 𝑘 encoded chunks to decodea scene.

B. Overall Architecture

Fig. 5: System Architecture

The system (Figure 5) includes four components: A light-weight Tracker, Servers, Caches, and Users. Each componentcan be on a separate machine or co-located. The trackercontains a list of videos and the identity of the server. (Incase there are multiple seedbox servers, it would have to listmultiple server identities). It also contains the IP addressesof each of the cache nodes in the system. Note that itdoes not contain information on which videos are storedon a given cache. In the following, we briefly explain thedynamics of the algorithm, and defer detailed explanationsand pseudocode to subsequent sections.

4

When a cache joins the system, it first registers itselfto the tracker and immediately starts running the resourceallocation algorithm (see Section IV-C for more details). Theresource allocation algorithm is used to appropriately updatehow many chunks of which movies to cache, and to adjustupload rates to connected users in a distributed manner. Notethat while running the algorithm, the cache does not needto communicate with the tracker. This is possible due tothe fully distributed nature of our algorithm, which thereforereduces protocol chatter dramatically.

When a user joins the system, it retrieves from the tracker alist of videos available in the system. After choosing a videoto watch, it registers itself to the tracker as a user watchingthe chosen video. Then it retrieves the IP address and theport number of one of the servers storing the chosen video,and connects to that server. After this server connection isestablished, the user retrieves a set of randomly sampledonline caches (that may or may not have the desired content).After retrieving this set of available caches, the user picks arandom subset of caches from this set and connects to them.The maximum number of caches that the user connects to isstrictly enforced either by the user device’s performance orby the user’s preferences.

After these connections to the server and caches have beenestablished, a user requests the first video frame from theserver so that it can start watching the video immediately(when a user controls its playback, the frame which cor-responds to the playback’s time pointer is also requestedfrom the server). While watching the first frame, it keepsdownloading available chunks of the next frame from theconnected caches at the rate determined by the caches. Ifthe user successfully downloads the required number ofchunks needed to decode that frame, it decodes and sendsa ‘SATISFIED’ signal to the connected caches. On the otherhand, if the connected caches fail to stream the requirednumber of chunks needed for decoding the next frame bya certain deadline (e.g., a targeted number of seconds beforethe current frame is about to be played out), the user fills inthe gaps by requesting the missing number of chunks fromthe server. Once the combination of the server and connectingcaches supply the required number of chunks needed todecode the next frame, the user decodes this in preparationto watch the next frame, and signals an ‘UNSATISFIED’message to the connecting caches.

Since it is possible that the initial set of caches that a userconnects to do not have the video of interest, it also runs atopology update algorithm (see Section IV-D) to find a bettermatching set of caches that have the desired content. In [3],we proved that the topology updates are done in a mannerthat is guaranteed to approximately find the best matchingset of caches for each user.

C. Cache Algorithms

The Resource allocation algorithm at each cache peri-odically updates 1) upload rates assigned to each connecteduser, and 2) the number of cached chunks for each videowhile satisfying storage, bandwidth, and connectivity degree

constraints. Denote the upload rate to user 𝑢 as 𝑥𝑢, and thestored fraction of video 𝑚 as 𝑊𝑚. The upload rate variable𝑥𝑢 is nonnegative and bounded by the streaming rate ofthe video being watched by user 𝑢. The cache variable ofvideo 𝑚 is also nonngegative and bounded by 1, representingthe fraction of cached video 𝑚. The cache node also up-dates some auxiliary variables needed to satisfy the resourceconstraints. The pseudo-code of the algorithm is describedbelow. Every 𝑇𝑢𝑝𝑑𝑎𝑡𝑒 seconds, the cache node updates theupload rate variables and the storage variables. As the updateequations indicate, these updates input the users’ signalsrelating to the current level of satisfaction (i.e., ‘SATISFIED’or ‘UNSATISFIED’ as described earlier). The cache countsthe number of received ‘UNSATISFIED’ signals for eachmovie, and increases the stored amount of that movie andthe upload rate proportionally. Note that an outward driftof a varible is not applied if the variable is out of itsfeasible range. These cache updates are regulated by auxiliaryvariables that represent shadow prices of the resources. Forexample, if during an update phase, a cache exhausts all ofits resources, these auxiliary variables increase, forcing theother variables to reduce their values in response to this. Thepseudo-code of this part is presented in Algorithm 1.

All variables updated by the algorithm are in units ofchunks. For example, the algorithm converts upload rates,which are real numbers, into units of ‘number of chunks perframe’. For example, consider a movie with a streaming rateof 2Mbps. Further, assume we use (𝑛, 𝑘) = (40, 20) codesto encode the video per frame. Recall that this implies thateach frame consists of 20 chunks, which are encoded into 40chunks with the property that any 20 of these 40 chunks suf-fice to decode the frame. If an upload rate variable indicates1.5Mbps, the corresponding number of chunks of the frameto be sent to the user is 1.5Mbps

2Mbps × 20chunks = 15chunks.Similarly, the cache storage variables are also converted tophysically meaningful units. 𝑊𝑚 = 0.25 is equivalent tostoring one quarter of video 𝑚, or storing 5 chunks of eachframe2. If the updated count on the number of chunks for acache is greater than the actual number of stored chunks, thecache will download the missing number of chunks from theserver. On the other hand, if the count becomes smaller, thecache will delete the appropriate number of stored chunks.

Updating (virtual) variables is repeated every 𝑇update timeunits. In our system, we set the default setting as 𝑇update =0.01 second. Applying these frequently changing variables isperiodically done with longer periods. The upload rate vari-ables are updated according to the corresponding variablesevery 𝑇rate seconds. Similarly, the cache variables are appliedevery 𝑇storage seconds. In the following performance evalu-ation section, we observe that 𝑇storage controls the tradeoffbetween the server to cache traffic and the update frequencyof the caches.

2If the resulting number is not an integer, we round it down the nearestinteger.

5

Algorithm 1 Cache’s Resource Allocation Algorithm

1: 𝑐 = Available upload bandwidth of the cache2: 𝑠 = Available storage of the cache3: Wait 𝑇update.4: for each connected user 𝑢 do5: 𝑚 = ID of a video being watched by user 𝑢.6: 𝑔𝑢 = “Satisfaction-level” binary signal of user 𝑢.7: Update rate 𝑥𝑢: Δ𝑥𝑢 = 𝜖(𝑔𝑢 − 𝑞 − 𝜆𝑢).8: Update availability price 𝜆𝑢: Δ𝜆𝑢 = 𝜖(𝑥𝑢−𝑊𝑚𝛾𝑚).9: end for

10: Update bandwidth price 𝑞: Δ𝑞 = 𝜖(∑

𝑢 𝑥𝑢 − 𝑐).11: for each video 𝑚 do12: 𝑈𝑚 = Set of connected users watching video 𝑚13: Λ𝑚 =

∑𝑢∈𝑈𝑚

𝜆𝑢 = Sum of availability prices14: Update stored fraction 𝑊𝑚: Δ𝑊𝑚 = 𝜖(Λ𝑚−𝛽𝑚𝜔𝑚)15: end for16: 𝑆 =

∑𝑚 𝑊𝑚𝛽𝑚 = Sum of used storage.

17: Update storage price 𝜔: Δ𝜔 = 𝜖(𝑆 − 𝑠).18: Repeat.

D. User Algorithms

As decribed above, when a user starts watching a video,it downloads the frame from the connected server. Afterthe download completes, it starts watching the first frame,while simultaneously downloading the next frame from theconnected caches and the server. Upon a user’s request ofa video, each cache sends 1) a list of cached chunks to theuser, and 2) a suppliable upload rate, in units of number ofchunks per frame the cache can offer the user. Then, the userdecides which chunks to download from each cache. If theuser is not able to download all its needed chunks from thecaches, it sends an ‘UNSATISFIED(1)’ signal to the caches,and downloads the missing number of chunks from theserver. If the connected caches can cumulatively provide therequired number of chunks, the user sends a ’SATISFIED(0)’signal to the connected caches. As described in the cache’salgorithm section, these satisfaction-level binary signals willbe collected by the connected caches and used as a wayto measure the ‘local’ demand of a movie. Thus, we seethat even before the current frame ends, the next frame isdecoded and ready to be watched seamlessly, as needed tokeep the streaming rate going. While repeating the proceduredescribed above to have uninterrupted streaming, users alsorun the topology update algorithm to search for a bettermatching set of caches to connect to. The chunk selectionalgorithm and the topology update algorithm will now bedescribed.

Chunk Selection Algorithm is used to determine whichchunks to request from each cache with two major objectives.One is to maximize the number of unique chunks we getfrom the caches per frame, and a second is to maximize theusage of each cache according to its provided rate. We takea greedy approach to solving this problem and describe itbriefly. We first sort the chunks by rarest first, where rarityis determined by that chunk’s occurrence across all caches

for that frame. We then assign each chunk, from rarest tothe most common, to the cache that currently has the lowestratio of assigned upload rates to providable upload rates. Thegreedy algorithms achieve the maximum number of chunksto download while minimizing the relative gap between theprovided rates by the caches and the actual rates assigned bythe user.

Topology Update Algorithm is used to explore a bettermatch between users and caches in terms of supply anddemand. 𝑇normal seconds after the initial connection with theinitial set of caches is established, a user will randomlyconnect to a cache from its list of unconnected caches. Thenit waits for 𝑇transit seconds to give enough time for the newcache to adjust its assigned upload rates and stream thecached video chunks to the user. When this timer expires,the user drops one of the connected caches. Which cache isdropped depends on the quality of the match between theuser and the cache, as measured by the current upload ratebeing supplied by that cache to the user. In our algorithm,we employ what we call a “soft choking” rule. While a hardchoking rule would involve deterministically dropping theworst-performing cache, we use a softer rule where the prob-ability of dropping a cache depends on its supplied uploadrate (specifically, the choking probability is proportional tothe negative exponent of the supplied upload rate for eachconnected cache). Under this rule, every connected cachecan be dropped, even the best-performing one, but the worst-performing cache is the most likely to be dropped. This wayof choking is theoretically justified by a recently proposedMarkov Approximation Method [18]. The user repeats thisprocess and the pseudo-code of it is presented in Algorithm2.

Algorithm 2 User’s Topology Update Algorithm

1: Wait 𝑇normal.2: Randomly choose and connect to a new cache.3: Wait 𝑇transit.4: for each connected cache 𝑐 do5: 𝑥𝑐 = Upload rate provided by cache 𝑐6: 𝑝𝑐 ∝ exp (−𝑥𝑐) = Probability of dropping cache 𝑐7: end for8: Normalize 𝑝.9: Pick and disconnect a random cache 𝑐 with probability

𝑝𝑐.10: Repeat.

V. IMPLEMENTATION DETAILS

The Tracker is based on a stripped-down webserver basedon web.py. That way, http commands can be used for updatesand accesses. An important aspect of the implementation isthe gathering of logs from the various caches, and this is alsomanaged by the tracker.

Customized FTP Protocol: The rest of the algorithm isimplemented by modifying the FTP protocol to support codedvideo streaming. Mainly, we added two FTP commands,LIST CNKS and RETR CNKS to pyftpdlib, an existing open

6

source FTP server library. When LIST CNKS is called withthe id of a video by a user, a cache responds with the listof cached chunk indices of the specified video. The RETRCNKS command is used with the id of a video, index of aframe, and list of chunk indices to retrieve the set of chunksspecified by the arguments. On the client side, the commandswere implemented on ftplib, an open source ftp client.

Thus, the clients, caches and server(s) run instances ofthe modified server, and the user runs several instances ofthe modified client. All of the algorithms described in theprevious section are fully implemented as protocols.

VI. SIMULATION RESULTS

A. Packet-level Simulator

The mathematical formulation, described in sec.III appliesto a given demand profile. While we can prove that thedistributed system we derive from it will work well in thisstatic case, this is hardly good enough. It is important tostudy its performance under dynamic conditions such as userand cache churn, varying content popularity and networkconditions.

As a stepping stone to bridging theory and practice,we have implemented a large-scale packet-level simulatorin MATLAB, and equipped it with full functionality. Thesimulator allows us to test the robustness of the systemagainst various dynamics by varying system environmentssuch as number of users and caches, video popularity, andnetwork conditions. The distributed nature of the system,allowed us to run relatively large scale simulations in parallelusing multi-core processors. This section describes some ofour findings.

We first run a small scale simulation to validate thesimulator. The simulation setup is as following. There are 100users in the system and each is watching one of 20 videos.Video popularity follows the Zipf’s Law, which is a heavy-tail distribution. There are 50 caches, each of which can storeup to 2 videos. The cache upload bandwidth is at most twotimes the video streaming rate. The server is connected toall users. Figure 6 shows how the system running algorithmsevolves. After 2000 iterations, the system evolves such thatthe caches provide 95.4% of the overall traffic and the serverneeds to provide only 4.6% of the overall traffic.

Figure 6 shows the snapshots of the system at differenttimes. The 100 circles on the bottom of each figure representusers. A color associated with a user visualizes the indexof a video. 50 circles on the top represent caches. Aboveeach cache circle, a box represents its disk. A small coloredbox within each big box represents stored amount of thecorresponding video. A cache can fill up the storage upto the height of the box by caching several videos. Anoverlay topology between users and caches are representedas a bipartite graph between them. A red bar represents theserver load. Via the visualized outputs, we were also ableto check the validity of the algorithm and moreover find theinteresting observations. The following sections will coverextensive simulation results.

B. A toy example

Another small scale simulation illustrates the complexityof the problem being solved by the system. Figure 7 showsa simple VoD system with three caches and six users. Letus first consider the resource allocation algorithm under afixed overlay topology depicted in Figure 7a. Caches have tomaximize the the joint upload rates of them so as to minimizethe server load. The optimal caching and rate allocationcan be easily found by finding the optimal solution via anyoptimization solver. The optimal solution turns out to be asfollowing and depicted in 7a.

1) Cache 1 : Store video A. Upload the full stream to user1 and user 2.

2) Cache 2 : Store half of video A and half of video B.Upload the fractional stream of video A with rate halfto user 3. Upload the fractional stream of video B withrate half to user 4 and 5.

3) Cache 3 : Store video B. Upload the fractional streamof video B to users 4 and 5. Upload the full stream ofvideo B to user 6.

4) Server : Upload the fractonal stream of video A withrate half to user 3.

Figure 7b shows the convergence of non-cache traffic if werun only the distributed resource algorithm under the sameconfiguration. It is observed that the distributed algorithmalso converges to the same optimal point where the non-cache traffic is 0.5Mbps.

Now, what is the optimal topology selection and resourceallocation? Assume that degree bound of of each cache is3, 4, and 3 respectively. By considering all possible

(63

) ×(64

) × (63

)= 6000 overlay topologies, one can find out the

optimal topology and resource allocation scheme as followingand it is depicted in figure 7c.

1) Cache 1 : Connect to User 1, User 2, and User 3. Storevideo A. Upload the full stream to user 1 and the halfstream to user 2 and user 3.

2) Cache 2 : Connect to User 2, User 3, User 4, and User5. Store half of video A and half of video B. Uploadthe half stream of video A to user 2 and user 3. Uploadthe half stream of video B to user 4 and user 5.

3) Cache 3 : Connect to User 4, User 5, and User 6. Storevideo B. Upload the half stream of video B to users 4and 5. Upload the full stream of video B to user 6.

4) Server : Do nothing.

Here, the server traffic is 0Mbps.Figure 7d shows the convergence of non-cache traffic if we

run the distributed resource allocation algorithm and topologyupdate algorithm together. It is observed that the distributedalgorithm also converges to the optimal point where the non-cache traffic is 0Mbps quickly.

This confirms that the simulated distributed algorithm canindeed converge to the theoretical optimal operation pointquickly. In the following sections, we now show resultsfrom extensive experiments at large scales having interestingconsequences.

7

(a) @t=10, server load = 98.6% (b) @t=50, server load = 62.8% (c) @t=100, server load = 11.8% (d) @t=1000, server load = 4.6%

Fig. 6: Visualized simulation results. The red bar on the left represents the ratio of the server traffic to the total users’demand. Each video is represented a video. Small boxes and colored segments in them represent caches and cached content,respectively. Overlay topology is represented as a graph and the thickness of links represent upload rates on each link.

(a) The optimal resource alloca-tion

(b) Simulation results withouttopology update algorithm

(c) The optimal resource alloca-tion and topology

(d) Simulation results with topol-ogy update algorithm

Fig. 7: Small scale simulation results.

(a) Popular videos are added. (b) Inverted video popularity. (c) Cache collapse (d) Caches’ Aggregate Storage

Fig. 8: Simulation results from robustness test.

C. Robustness

Figure 8a shows the robustness of the system againstchanges in video demand. First we simultaneously add sev-eral videos and make them the most popular videos in thesystem. This results in a spike in server traffic. However, thedistributed caches quickly adapt to the change in demand byaltering the set of stored movies and in 200 iterations thesystem has converged to the new optimum. In fact, the samerapid adjustment is observed in Figure 8b when we invert thedemand for all movies, i.e., the most popular movie becomesthe least popular and so on. This is an extreme change indemand, especially given that the distribution is Zipf.

Figure 8c shows what happens a randomly chosen subsetof half the caches in the system become simultaneouslyinoperative. Remarkably, the surviving caches are able todetect the sudden increase in demand of users, adapt theircaches, and update their upload rates to the users quickly.Figure 8d shows how quickly the surviving caches adjusttheir storages to adapt.

D. Scalability

Finally, Figure 9a provides evidence of scalability. With10, 000 users, 5, 000 caches, and 1, 000 videos, the system

(a) Server traffic at a largescale

(b) Popularity and totalamount of caching

Fig. 9: Scalability

is able to achieve offload more than 98% of the demandfrom the server to caches. However, more interestingly, weobserve that the total amount of caching of a video in thesystem is roughly proportional to its popularity in Figure 9b.This is interesting because without any global knowledge ofvideo popularity, the caches jointly figure it out in a fullydistributed manner.

VII. TESTBED RESULTS

In this section, in addition to corroborating our analyt-ical and simulation-based findings, we study the system’s

8

performance in practice and examine various system-leveltradeoffs. We first observe that the server traffic to cachesis not negligible in practice, and study the split betweenserver to cache and cache to device traffic. Second, westudy the tradeoff between the cache update intervals and thetime taken for the system to adjust to changes in demand.Lastly, we test the topology update algorithm’s performancein practice. Even with the several practical constraints, thetopology algorithm is able to reduce the server traffic. Theseinsights give us the confidence to claim that a much largersystem on the scale of hundreds of thousands of highlyheterogeneous nodes would perform as well as our currentimplementation.

A. Experimental Setup

We deploy our full implementation of the system on 20Amazon EC2 medium instances located in Northern Cali-fornia. We run 60 caches and 120 users on those instances.There are 20 high-definition (3.2Mbps) videos registered inthe system, and users randomly choose a video following aZipf’s distribution. This implies that the most popular videois watched by more than 25 users while the least popularvideos are watched by only 1 or 2 users. When a user nodefinishes watching a video, it randomly chooses a new videoto watch and connects to newly chosen caches.

Each cache can store upto 2 videos and has an uploadbandwidth of four times of video streaming rate. We usedFTP bandwidth throttles to simulate caches’ upload band-width in Amazon EC2. Each user can connect to upto 5caches.

B. Traffic split between server to cache and cache to user

Fig. 10: Server traffic to users and caches

First, we look at how much traffic occurs from the serverto the users and caches. In this experiment, each user repeatschoosing and watching a video. Figure 10 shows how muchtraffic occurs from the server to the caches, compared to thetraffic from the server to the users. At the beginning, theserver traffic to the caches is 29.0% of the server traffic tousers. This is because the server has to fill up the emptycaches initially. However, as time goes on, the server trafficto the caches becomes negligible; it is less than 5.2% ofthe server traffic to the users. This observation supports ourtheoretical assumption that the server traffic to the cachesis non-negligible only in the initial phase, unless the videodemand distribution varies too rapidly.

C. Tradeoff between cache update intervals and the timetaken for the system to adjust to changes in demand

Second, we answer how frequently the caches shouldupdate their storage. If the server to cache traffic is notconsidered, the caches can freely update their storage asfrequently as they want without the system being penalized.However, in practice, if the caches update their storage toofrequently, the server will be burdened due to its constantlyneeding to update the caches. A natural question that arisesis: how frequently should the caches update their storage tobe appropriately responsive to changes in demand withoutoverburdening the server too much? We take a look at thistrade-off by changing video popularity abrubtly in the middleof the experiment. We gave 50% more resources to the cachesso that the system can achieve near-zero server traffic. Afterthe system reaches this state, we suddenly “invert” the videopopularity histogram.

Figure 11a, 11b shows results with 𝑇stroage = 0.5𝑠 and𝑇stroage = 5𝑠 respectively. With the shorter cache updateperiod, caches are able to adjust their storages so quicklythat the server traffic to the users is quite low. On the otherhand, with the longer cache update interval, the caches donot adjust their storages frequently, and more users downloadvideos directly from the server, resulting in high peaks. Webelieve the tradeoff observed here must be considered jointlywith how quickly the video demand distribution changes.

D. Topology update algorithm in Practice

In practice, the topology update might hurt the system be-cause a fast-varying topology can incur a higher server trafficto the caches because of the periodic soft-choking algorithmdescribed earlier. Although it is true that eventually, uponconvergence, the topology update algorithm is guaranteed toachieve a lower amount of server traffic to the users, what isnot clear is whether the transient server traffic to the cachesneeded to attain this topology convergence is negligible. Fortesting the performance of the topology update in practice,we consider the case where the users watch the same videoin a continuous loop.

Figure 12a, 12b shows the server traffic without the topol-ogy update algorithm and with topology update algorithm,respectively. It shows that, upon convergence, the topologyupdate algorithm actually achieves 19.3% lower server trafficto users and 37.9% lower server traffic to caches. This isbecause users who are watching the same video are morelikely to connect to the same caches, thereby reducing thetraffic to the caches also. The results here imply that thetopology update algorithm is still important even with theconsideration of the server traffic to caches in practice. Moreextensive tests under more dynamic settings of demand churnare part of ongoing work.

VIII. CONCLUSION

In this paper, we proposed an architecture for distributedVoD when the components are potentially very unreliable,

9

(a) Frequent cache update, 𝑇storage = 0.5sec (b) Infrequent cache update, 𝑇storage = 5sec

Fig. 11: Testbed results with different cache update intervals.

(a) Without topology update algorithm (b) With topology update algorithm

Fig. 12: Testbed results to check the usefulness of the topology update algorithm in practice.

and when storage, bandwidth, and node degree bound con-straints may be severe. We started from a theoretical formu-lation from which we derived a set of distributed algorithmsthat are highly robust to changes in demand, user churnand device failures. In addition to exploring the behavior ofthe system via a packet level simulator, we also related ourexperience with a full implementation. Experimental resultsfrom the test-bed also provide valuable insights into designof a much larger practical system, which we argued couldscale to large number of users and caches. We continue togain experience from our test-bed with a view such largerdeployments.

REFERENCES

[1] C. Labovitz, “The Other 50% of Internet Traffic,” in Proc. of NorthAmerican Network Operators Group Meeting, NANOG 54, 2012.

[2] Wikipedia, “Seedbox — Wikipedia, the free encyclopedia,”2012, [Online; accessed 7-October-2012]. [Online]. Available:http://en.wikipedia.org/wiki/Seedbox

[3] K. Lee, H. Zhang, Z. Shao, M. Chen, A. Parekh, and K. Ramchandran,“An optimized distributed video-on-demand streaming system: Theoryand design,” in Communication, Control, and Computing (Allerton),2012 50th Annual Allerton Conference on. IEEE, 2010.

[4] J. Almeida, D. Eager, M. Vernon, and S. Wright, “Minimizing de-livery cost in scalable streaming content distribution systems,” IEEETransactions on Multimedia, vol. 6, no. 2, pp. 356–365, 2004.

[5] Y. Boufkhad, F. Mathieu, F. de Montgolfier, D. Perino, and L. Viennot,“Achievable catalog size in peer-to-peer video-on-demand systems,” inProc. of IPTPS, 2008.

[6] X. Zhou and C. Xu, “Efficient algorithms of video replication andplacement on a cluster of streaming servers,” Journal of Network andComputer Applications, vol. 30, no. 2, pp. 515–540, 2007.

[7] N. Laoutaris, V. Zissimopoulos, and I. Stavrakakis, “On the optimiza-tion of storage capacity allocation for content distribution,” ComputerNetworks, vol. 47, no. 3, pp. 409–428, 2005.

[8] J. Wu and B. Li, “Keep cache replacement simple in peer-assisted vodsystems,” in Proc. of IEEE INFOCOM, 2009.

[9] B. Tan and L. Massoulie, “Brief announcement: adaptive contentplacement for peer-to-peer video-on-demand systems,” in Proc. ofACM PODC, 2010.

[10] D. Applegate, A. Archer, V. Gopalakrishnan, S. Lee, and K. Ramakr-ishnan, “Optimal content placement for a large-scale vod system,” inProceedings of the 6th International COnference. ACM, 2010, p. 4.

[11] S. Borst, V. Gupta, and A. Walid, “Distributed caching algorithms forcontent distribution networks,” in INFOCOM, 2010 Proceedings IEEE.IEEE, 2010, pp. 1–9.

[12] V. Valancius, N. Laoutaris, L. Massoulie, C. Diot, and P. Rodriguez,“Greening the internet with nano data centers,” in Proceedings of the5th international conference on Emerging networking experiments andtechnologies. ACM, 2009, pp. 37–48.

[13] X. Zhou and C. Xu, “Optimal video replication and placement ona cluster of video-on-demand servers,” in Parallel Processing, 2002.Proceedings. International Conference on. IEEE, 2002, pp. 547–555.

[14] Y. Liu, X. Liu, L. Xiao, L. Ni, and X. Zhang, “Location-aware topologymatching in p2p systems,” in proc. of IEEE INFOCOM, 2004.

[15] N. Laoutaris, D. Carra, and P. Michiardi, “Uplink allocation beyondchoke/unchoke,” in Proc. of ACM CoNEXT, 2008.

[16] V. Aggarwal, O. Akonjang, and A. Feldmann, “Improving user and ispexperience through isp-aided p2p locality,” in proc.of IEEE INFOCOM,2008.

[17] S. Zhang, Z. Shao, and M. Chen, “Optimal distributed p2p streamingunder node degree bounds,” in Proc. of IEEE ICNP, 2010.

[18] M. Chen, S. Liew, Z. Shao, and C. Kai, “Markov approximation forcombinatorial network optimization,” in Proc. of IEEE INFOCOM,2010.

10

a vod system for massively scaled, heterogeneous ...yanlisa/publications/mascots2013_vod.pdffig. 1:...

Documents