summarizing data center network traffic by partitioned conservative update

2168 IEEE COMMUNICATIONS LETTERS, VOL. 17, NO. 11, NOVEMBER 2013

Summarizing Data Center Network Traffic byPartitioned Conservative Update

Chi Harold Liu, Member, IEEE, Andreas Kind, Member, IEEE, and Tiancheng Liu

Abstract—Applications like search and massive data analysisrunning bandwidth-hungry algorithms like MapReduce in datacenter networks (DCNs) may lead to link congestion. Thus itis important to identify the source of congestions in real-time.In this letter, we propose a sketch-based data structure, called“P(d)-CU”, to estimate the aggregated/summarized flow statisticsover time that guarantees high estimation accuracy with lowcomputational complexity, and scales well with the increase ofinput data size. Considering the amount of skew for flows ofdifferent network services, it partitions a two-dimensional arrayof counters along its depth as an enhancement to the existingConservative Update (CU) mechanism. We show its superiorperformance by theoretical analysis and sufficient experimentalresults on a real DCN trace.

Index Terms—Data center network, flow analysis.

I. INTRODUCTION

EMERGING bandwidth-hungry applications running dis-tributed algorithms, e.g., MapReduce [1], for search

and massive data analysis are housed in modern data centernetworks (DCNs). They shuffle the data with growing sizefrom one virtual machine to the other potentially acrossdifferent switches/routers. These flows usually come and govery quickly and dynamically, and thus the sudden trafficincrease may cause some links in a DCN highly congested andbandwidth overuse. Therefore, we need an efficient algorithmto accurately identify the source of congestions down to theflow level with negligible computational complexity from thecollected flow records.

Previous research efforts have been paid to identify the ele-phant flows [2], however application-oriented [3] approachesneed specific application support and per-flow based ap-proaches [4] suffer from scalability issues [2]. sFlow [5]reduces the number of flow record collections by samplingthe packet header exported at a switch/router, but still theoverall huge volume prohibits storing all of them in a persis-tence database and further identifying elephants via databasequerying. Given the exported flow records as inputs, streamingalgorithms [6] consider them as a data stream composed ofkey-value pairs, where the key can represent the distinct pairof source-destination IP addresses, and the value is the amountof carried workload in that packet. Therefore, the same keymay appear randomly and repetitively many times, and thegoal is to identify a set of IP pairs carrying most workloadwithin a time period, as elephants. The algorithms can be

Manuscript received January 13, 2013. The associate editor coordinatingthe review of this letter and approving it for publication was H.-P. Schwefel.

C. H. Liu is with the Beijing Institute of Technology, China (e-mail:[email protected]).

A. Kind and T. Liu are with the System Technology Department, IBMResearch, China (e-mail: {zrlank, liutc}@cn.ibm.com).

Digital Object Identifier 10.1109/LCOMM.2013.091913.130094

implemented in different data structures. First is the counter-based algorithms [7], e.g., Lossy Counting [8] and SpaceSaving [9]. They use a one-dimensional array of countersto track a subset of inputs, and for limited storage space,they decide whether to store the newly arrived item or not.Second is the sketch-based algorithms. The term “sketch”was introduced in [10] to denote a data structure as a linearprojection of the input vector with a matrix. They perform“point query” or simply estimation to the matrix and returnthe approximated answer. Unlike storing data in a databasewith growing size, this matrix is implemented by a fixedtwo-dimensional array of counters. Examples are Count-Min(CM) [10] and Conservative Update (CU) [11]; for detailssee Section II-B. Since CM updates all counters, it alwaysoverestimates the true value. CU improves CM by conserva-tively updating a counter, if the sum of point query result andnew update is bigger than the value stored in that counter.However, CU comes with huge time complexity. Finally, [10]reported that the workload distribution of different networkservices (DNS, HTTP, etc.) exhibits significant and differentamount of skew, defined as a measure of the asymmetry to theprobability distribution of the carried workload. This amountcan be well modeled by the Zipfian parameter. However, noneof these algorithms successfully capture this property duringthe analysis phase. Motivated by these facts, in this letter, wehave explicitly made the following three contributions:

(a) the proposal of a partitioned CU approach, called “P(d)-CU” along the vertical dimension of the sketch to consider theamount of skew of different network services and achieve lowestimation error and computational complexity;

(b) time complexity analysis of P(d)-CU algorithm and itsbound;

(c) sufficient experimental results on a real DCN trace, an-alyzing the compute time and estimation error in comparisonto existing approaches.

The rest of the paper is organized as follows. Section IIintroduces background and motivation. Section III presentsthe proposed algorithm and analysis. Section IV shows experi-mental results, and conclusion remarks are given in Section V.

II. BACKGROUND AND MOTIVATION

A. A Real DCN Trace Study

We performed a trace study by using the data froma commercial DCN (composed of four virtual fabric 10Gswitches by BLADE Network Technologies) that hosts anairline travel booking service during the whole day of Jan. 1,2008, with 29,614,720 received sFlow records in total. Fig. 1shows the magnitude of workload for two of four switches,

1089-7798/13$31.00 c© 2013 IEEE

LIU et al.: SUMMARIZING DATA CENTER NETWORK TRAFFIC BY PARTITIONED CONSERVATIVE UPDATE 2169

0:00 4:00 8:00 12:00 16:00 20:00 24:000

2

4

6

8

Time

Wor

kloa

d (G

B)

Switch:10.75.22.10Switch:10.75.22.11

100

102

104

10610

-8

10-6

10-4

10-2

100

TotalHttpDNSHttpsSCSRA

Freq

uenc

y

Rank

SCSRA

HTTPs

HTTP

DNSspike

1 100 10K 1M

Fig. 1. The change of workload over 24hrs, and the Zipfian distributions onthe frequency of the sorted amount of workload.

TABLE INETWORK SERVICE DIAGNOSIS ON FOUR WORKLOAD SPIKES.

HTTP SCSRA HTTPs shell others2:10am-2:30am 41% 40% 3% 2% 15%3:00am-3:30am 35% 47% 2% 2% 14%5:00am-5:50am 18% 72% 0% 0% 10%10:00am-10:40am 27% 53% 3% 0% 17%

and we observe four spikes on switch 10.75.22.11, dis-tributed between 2:10am-2:30am, 3:00am-3:30am, 5:00am-5:50am, and 10:00am-10:40am, respectively. Table I identifiesthe associated network services in four spikes via databasequerying at offline. We found that HTTP-based web browsingthrough a remote administration protocol (SCSRA) dominates,as consistent with the offered travel booking services.

Nevertheless, real DCN operation requires runtime diagno-sis with little computational complexity and high estimationaccuracy. Also, we need to understand which particular sets ofIP pairs contribute to the spike to effectively help reduce thecongestion. To this end, we target at providing the analysisresult on retrieving the top-K (e.g., default 100) workloadcarried by source-destination IP addresses received on a list ofphysical ports of a specific switch/router. Note that the outputcan also be ranked by counting the number of appearances(i.e., heavy hitters). Our analysis also confirms the finding in[10] that the workload of each type of network service exhibitsstrong Zipfian distribution, as shown in Fig. 1. The fitted zparameters are zHTTP = 1.53, zDNS = 1.93, zothers = 1.02(coefficient 0.95), respectively. Note that although we use col-lected sFlow records to motivate our problem, the applicabilityof proposed data structure is not restricted to the format ofrecord collection.

B. CM and CU Approaches

As described earlier, sFlow alike protocols export sampledpacket headers of flows going through a switch/router, con-sidered as inputs. Each is represented by a key-value pair.The key is generated from the source-destination IP addresses,and the value is the amount of workload specified in thepacket. Then, considering only the distinct keys, we denotethem as a vector a of known dimension m, presented in animplicit, incremental fashion. Its current state at time t isa(t) = [a1(t), . . . , ai(t), . . . , am(t)]. Initially, a is the zerovector. Updates to individual entries of the vector are presentedas a stream of pairs (i, c), e.g., the i-th IP pair’s total workloadis increased by amount c. As shown in Fig. 2(a), the data struc-ture of CM is represented by a two-dimensional array of coun-ters with width w and depth d: counter[1, 1] . . . counter[d, w].

(b) P(d)-CU before the update (c) P(d)-CU after the update

5 h1

h2

4 5 7 3 8 7 1

2 9 1 5 6 3 8

6 1 3 13 2 5 4

HTTP sketch

9 h3DNS sketch

8 9 5 9 1 1 2

8 1 2 2 8 4 7

6 8 9 9 2 2 7

6h4h5

Sketch for all other service types

h6

4 5 2 3 8 7 1

2 9 1 5 6 3 8

6 1 3 4 2 5 4

2 9 5 9 1 1 2

8 1 2 2 3 4 7

6 8 9 9 2 2 7

4 5 2 3 8 7 1

2 9 1 5 6 3 8

6 1 3 4 2 5 4

2 9 5 9 1 1 2

8 1 2 2 3 4 7

6 8 9 9 2 2 7

h1

h6a2

(a) CM and CU sketch

a1h1

h6

Collision!

h3a2

a1h1

h2

a3h4

h6

Fig. 2. (a) CM/CU sketch structure with hash functions. An illustrativeexample of P(d)-CU algorithm, where (b) before update, and (c) after updateand only red numbers are updated while blue ones stay the same.

Each counter is initially zero. Additionally, we choose d hashfunctions h1 . . . hd : {1 . . .m} → {1 . . . w} uniformly atrandom from a pairwise-independent family, hashing the i-th element d times to counters [j, hj(i)], ∀1 ≤ j ≤ d. Whenupdates arrive, all hashed counters are increased accordingly.Thus, they may store the aggregated values of multiple items.This collision will very unlikely repeat in all rows simultane-ously by different hash functions. Then, the estimation ai fromthe structure is given by ai = min1≤j≤d counter[j, hj(i)], i.e.,the minimum of d hashed counters. Different from CM, CU[11] first computes ai, and then, counters are conservativelyupdated according to: ∀1 ≤ j ≤ d, counter[i, hj(i)] ←max {counter[i, hj(i)], ai + c}.

III. PARTITIONED CU ALGORITHM

Section II-A shows that different network services conformto different Zipf’s law. Also in [10] it proves that to have CMproduce ε estimation accuracy with probability at least 1− δneeds space O(ε−min{1,1/z} ln 1/δ). Clearly, when z > 1,bigger z requires smaller space to achieve the same accuracy.Therefore, different network services may use different spaceof the sketch to achieve the same error performance. Inthis section, we propose an enhanced algorithm to reducethe computational complexity and estimation error from thisangle.

We assume the types of network services going through theconsidered DCN have been monitored through enough flowstatistics, and thus the associated z parameter for each type isknown as a priori. Note that the same network service usuallyhas a similar Zipfian distribution [10] in different scenarios.Then, we partition the original sketch into multiple smallsketches, each of which collects and processes flows of a typeof network service; the processing is the same of existingCU algorithm. The partition can be performed either alongthe depth or width dimension of the original sketch, whilepreserving the other dimension as a stable constant. This yieldstwo different algorithms, denoted as P(d)-CU and P(w)-CU,

2170 IEEE COMMUNICATIONS LETTERS, VOL. 17, NO. 11, NOVEMBER 2013

respectively. Without loss of generality, let n denote the totalnumber of input records (e.g., total received sFlow records inan hour, different from m), w and d denote the width and depthof original sketch before partition, respectively. Then after thepartition, d =

∑Kk=1 dk along dimension d, w =

∑Kk=1 wk

along dimension w, and n =∑K

k=1 nk, where nk denotes theinput data size for the k-th partition.

The processing steps of P(d)-CU are illustrated in Fig. 2(b)and (c), as an example. The entire sketch of w = 7, d = 6is vertically divided into K = 3 sketches, each of which has2, 1, 3 rows to process packets from HTTP, DNS and otherservice types, respectively. As shown in Fig. 2(b), assumeitems a1, a2, a3 (representing different IP pairs) from threecategories are monitored, they are hashed into counters ofdifferent rows, and the stored counts before update were (2,9),4, and (2,3,9), respectively. Then, to return the estimation ofa1, a2, a3, we perform the point query on three sketches andresults are a1 = min{2, 9} = 2, a2 = 4, a3 = min{2, 3, 9} =2, i.e., the minimum of all counts in hashed counters. Nowsuppose new updates arrive c1 = 5, c2 = 9, c3 = 6. In thisparticular case, the update rule increases the value, only if itsstored value is less than the sum of estimation result and newupdate, i.e., a1 + c1 = 7, a2 + c2 = 13, a3 + c3 = 8. Asa result, Fig. 2(c) shows the values after the update whichbecome (7,9), 13, and (8,8,9), respectively.

The implementation of these K sketches can be eitherparalleled or serialized, and we next show its superiority evenif serialized approach is adopted. Since CM performs n simpleupdates for the entire input stream, its time complexity tCM isO (n) as the lower bound of all sketching algorithms. In con-trast, CU takes the estimation for each update and thus its timecomplexity is tCU = O (nd). The time complexity of P(w)-CUis the same as CU since the width w does not have impacton the update time: tP(w)−CU = O

(∑Kk=1 nkd

)= O(nd).

However, the compute time of P(d)-CU is significantly lessthan CU, as:

nd =K∑

k=1

nkdk +K∑

k=1

nk(d− dk) >>K∑

k=1

nkdk. (1)

Theorem 3.1: A CU sketch with width w and depth dis able to achieve the minimum computational complexityO(ndK

)if partitioned into K sketches, irrespective of how

the partition is performed as long as the entire input records(of size n) is equally fed into K sketches.

Proof: See Appendix A.Based on Theorem 3.1, the proposed P(d)-CU is also able to

achieve the same lower-bound time complexity as CM whend = K , i.e., the size of each sketch is a one-dimensional arrayto collect and process flows of different network services. Wehave tCM ≤ tP(d)-CU << tCU = tP(w)-CU.

IV. EXPERIMENTAL RESULTS

To assess the performance of all algorithms, we use thesame data trace collected from a commercial travel bookingwebsite on Jan. 1, 2008. Four switches form a DCN exportingsFlow packets to a commercial server, which extracts theuseful information from the packet header including the sourceand destination IP addresses, workload in bytes, port number

and time, and formulated in a CSV format record line. In totalwe received 29,614,720 lines of records. Our analyzer takesthese data as input, and all results are computed on an ordinarypersonal laptop Thinkpad x220i with Intel(R) Core(TM) i3-2310M, [email protected], and 4GB RAM as our analyzer. Weuse the port number to distinguish the DNS and HTTP trafficand feed them into the first two partitioned sketches, and treatall other services indistinguishably in the third sketch. Then,based on the inverse ratio of the fitted z parameters (to capturethe amount of skew), we allocate their depths as a ratio ofdHTTP : dDNS : dothers = 3 : 2 : 4.

We estimate the amount of workload carried by all IP pairsacross the DCN, and compare our proposed algorithm P(w)-CU and P(d)-CU with existing approaches CM and CU interms of:

• compute time: the period of time generating the estima-tions of the reported workload;

• average relative error (ARE) of the reported workload,as: 1

m

∑mi=1

|ai−ai|ai

, where m is the dimension of a.The comparison is under the fair constraint of equal space cost(measured in bytes), i.e., the size of CM and CU are the sameas the sum of space cost of multiple sketches for P(w)-CUand P(d)-CU.

The estimation error by all sketching algorithms not onlydepends on the width of sketch (since w controls the colli-sion probability when hashing different items into the samecounter), but also the depth of it (since the estimation needsto take the minimum of all d hashed counters). Then, AREis inversely proportional to w and d, and proportional to m.Fig. 3(a) shows the ARE versus total sketch depth d, whilesetting w = 8, 192 as a constant. It can be seen that P(d)-CUsuccessfully reduces the estimation error by at least 50% ifcompared with CU when d = 12, and this effect continuouslyholds when d increases where the ARE can be as low as 0.CM does not produce satisfactory results especially for smallerd, due to its overestimation (update on all counters in thesketch). On the other hand, since w has a higher impact onthe derived error than d, P(w)-CU performs even worse thanCU due to the significantly less allocated width per partition,as confirmed in the figure.

Fig. 3(b) demonstrates the compute time, where P(d)-CUperforms better than CU and very close to the lower boundCM algorithm, consistent with our analysis in Section III. Thisgain becomes clearer when d increases and the complexityreduction can reach up to 18% when d = 36. Meanwhile,P(w)-CU and CU achieve almost identical time performancesince the partition along w does not have any impact on it.As an overall trend, we observe their strict linearity, showingnice scalability with the space cost of the sketch.

Fig. 3(c) shows ARE when estimating the workload ofHTTP, DNS and other services by P(d)-CU, while in thepartition they are allocated the same dk . It further confirmsthat z parameter has significant impacts on ARE, wherethree network services experience different amount of errorperformance. Since ARE is inversely proportional to z, andzDNS is the biggest one, it achieves the least error, aroundonly 16% and 6% when w = 2, 048 if compared with HTTPand other services under the same sketch size. Followed byHTTP and others, this gain becomes weaker when w increases

LIU et al.: SUMMARIZING DATA CENTER NETWORK TRAFFIC BY PARTITIONED CONSERVATIVE UPDATE 2171

10 15 20 25 30 35 400

20

40

60

80

100

120

140

160

Depth

AR

E (%

)CMCUP(d)-CUP(w)-CU

10 15 20 25 30 35 401

2

3

4

5

6

7

8

9

Depth

Com

pute

Tim

e (s

econ

d)

P(d)-CUCUCMP(w)-CU

2000 4000 6000 8000 100000

100

200

300

400

500

600

700

800

Wdith

AR

E (%

)

HTTP-allHTTP-80%HTTP-40%DNSothers

60%

23%

(a) ARE vs. depth (b) Compute time vs. depth (c) ARE vs. width for different servicesFig. 3. Experimental results of ARE and compute time vs. depth of the sketch, and ARE when estimating workload of different services while varying w.

and after w = 7, 000, the data structure successfully estimatetheir individual workload almost without any error.

Finally, we discuss the impact of the number of distinctinput items m on the time and error performance. We mea-sure the compute time as the overall time consumption afterproducing results from all partitioned sketches in a sequel, i.e.,the overall input data size remains the same. Therefore, theproposed P(d)-CU algorithm fully contributes to the obtainedimprovement. Furthermore, not only the reduction of distinctinput items (of size m, which is different from the amount ofrecords n) but also the proposed algorithm by itself contributesto the obtained error improvement. As shown in Fig. 3(c), wefeed the HTTP traffic into the sketch which is carried by 80%and 40% of total m distinct source-destination IP pairs whenw = 2, 560, respectively. Then, by pairwise independenceof hash functions for mapping {1 . . .m} → {1 . . . w}, theamount of collision will be proportionally reduced to 80%and 40% in a counter when performing the update. Fromthe figure we observe 60% and 23% of ARE which showsthat the algorithm itself achieves around 20% less error (bycollision) apart from the help of the reduction of the numberof distinct input items m. Hence, all these results confirm thatP(d)-CU yields better error performance over all other existingalgorithms and close time complexity to the lower bound CMapproach, and thus applicable in dynamic DCNs managingdata with growing size.

V. CONCLUSION AND FUTURE WORK

In this paper, we presented P(d)-CU, as an enhancementof existing CU algorithm in the application to identify thesource of congestions in DCNs in real-time. It considers theamount of skew for different network services to successfullyguarantee high accuracy with low computational complexity,and more importantly scale well with the increase of inputdata size. This is further confirmed by sufficient experimentalresults by using a real DCN trace. In the future, we planto investigate the optimal sketch partition to achieve leasterror based on Zipfian parameters, and theoretically derivethis bound.

APPENDIX A

Proof of Theorem 3.1: We form the following optimizationproblem, i.e., to minimize the time complexity of P(d)-CU

approach given the constraint of the sum of partitioned sketchdepths equals the original sketch:

{dk} = argmindk

K∑k=1

nkdk s.t.K∑

k=1

dk = d, (2)

where n =∑K

k=1 nk. It is a classic constrained optimizationproblem, which could be solved by using the Lagrangianmultiplier λ. We take the gradient ∂L/∂dk = 0, where L(λ) =∑K

k=1 nkdk − λ(∑K

k=1 dk − d), and we have λ = nk, ∀k(since

∑k nk =

∑k λ = n, nk = n/K, ∀k). Therefore, it

is clear that irrespective of how the partitioned is performed,the lowest computational complexity is always achieved if theinput data stream is equally fed into each of the K sketches.Replace nk = n/K into the objective function, we completethe proof.

REFERENCES

[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,” Commun. of ACM, vol. 51, no. 1, pp. 107–113, 2008.

[2] A. R. Curtis, W. Kim, and P. Yalagandula, “Mahout: low-overheaddatacenter traffic management using end-host-based elephant detection,”in Proc. 2011 IEEE INFOCOM, pp. 1629–1637.

[3] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-servicemapping for qos: a statistical signature-based approach to IP trafficclassification,” in Proc. 2004 ACM IMC, pp. 135–148.

[4] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat,“Hedera: dynamic flow scheduling for data center networks,” in Proc.2010 NSDI, pp. 19–19.

[5] sFlow, http://www.sflow.org/.[6] S. Muthukrishnan, “Data streams: algorithms and applications,” Foun-

dations and Trends in Theoretical Computer Science, vol. 1, no. 2, 2005.[7] G. Cormode and M. Hadjieleftheriou, “Finding frequent items in data

streams,” VLDB J., vol. 1, no. 2, pp. 1530–1541, 2008.[8] G. S. Manku and R. Motwani, “Approximate frequency counts over data

streams,” in Proc. 2002 VLDB Conf., pp. 346–357.[9] A. Metwally, D. Agrawal, and A. E. Abbadi, “Efficient computation of

frequent and top-k elements in data streams,” in Proc. 2005 Int’l Conf.on Database Theory, pp. 398–412.

[10] G. Cormode and S. Muthukrishnan, “Summarizing and mining skeweddata streams,” in Proc. 2005 SIAM Conf. on Data Mining, pp. 44–55.

[11] A. Goyal, J. Jagarlamudi, H. I. Daume, and S. Venkatasubramanian,“Sketching techniques for large scale NLP,” in Proc. 2010 NAACL HLTSixth Web as Corpus Workshop, pp. 17–25.

summarizing data center network traffic by partitioned conservative update

Documents