linear representation of network trafï¬c

Mobile Netw ApplDOI 10.1007/s11036-008-0110-0

Linear Representation of Network TrafficWith Special Application to Wireless Workload Generation

Stefan Karpinski · Elizabeth M. Belding ·Kevin C. Almeroth · John R. Gilbert

© Springer Science + Business Media, LLC 2008

Abstract We propose a representation of wirelessworkload patterns as large, sparse matrices and providea method for stochastically generating experimentalworkloads from a given matrix. The essential prop-erty of the algebraic representation is that the sum-mation of vectors naturally yields a faithful descriptionof the aggregate behavior of the corresponding flows.This deceptively simple property allows us to expressmany common concepts from traffic modeling suc-cinctly in terms of a few linear transformations. Thealgebraic representation has many benefits: (1) it makesthe meaning of generally understood but vague con-cepts, such as “uniform behavior,” mathematically pre-cise and unambiguous; (2) it allows us to see clearly,through the lens of linear algebra, the implications ofcommon modeling assumptions; (3) the implementa-tion of traffic models becomes unprecedentedly simpleand orthogonal, requiring only a handful of high-levelmatrix operations, which can be freely composed; (4)the vast body of algebraic theory and highly optimizednumerical software may immediately be applied to

S. Karpinski (B) · E. M. Belding · K. C. Almeroth ·J. R. GilbertDepartment of Computer Science,University of California, Santa Barbara, USAe-mail: [email protected]

E. M. Beldinge-mail: [email protected]

K. C. Almerothe-mail: [email protected]

J. R. Gilberte-mail: [email protected]

traffic modeling. We use the paired differential simula-tion methodology introduced by the authors in previouswork to experimentally demonstrate that the generalmatrix model accurately reproduces realistic networkperformance (Karpinski et al. 2007a, b). We use thesame experimental methodology to explore the impli-cations of various assumptions and simplifications thatare commonly made in traffic modeling.

Keywords traffic modeling · traffic analysis ·workload generation · wireless networks ·wireless simulation · linear algebra · nonnegativematrix factorization

1 Introduction

More than 20 years of research in analysis and modelingof network traffic have yielded tremendous advancesin our understanding of the complex behaviors thatemerge from the interaction of millions of humans andcomputers on the Internet and in local-area networks(LANs). Both traffic analysis and realistic workloadgeneration, however, remain active areas of researchwith many unanswered questions. One of the majorchallenges of modeling whole-network traffic is theinterdependence of packet, flow,1 and node behaviors.

1We use the common definition of a flow as a sequence ofpackets sharing the same “5-tuple”: IP protocol type, source anddestination nodes, and TCP/UDP port numbers.

Mobile Netw Appl

Three types of simplifying assumptions are commonlyapplied in both analysis and generation:

1. Uniform: that traffic properties are uniformly dis-tributed. For example, assuming that all nodes in anetwork have equal probability of being the sourceor destination of each flow. Another common ex-ample is the assumption that all packet sizes andinter-packet intervals are the same; this is com-monly known as the “constant bit-rate” (CBR)model.

2. Marginal: that traffic properties share a single,common marginal2 distribution across the entirenetwork, and individual values are drawn indepen-dently and unconditionally from this shared dis-tribution. An example is assuming that a singledistribution of packet sizes describes all flows in anetwork, as compared to equipping each flow withits own specific distribution of sizes.

3. Conditional: that traffic behaves marginally withinequivalence classes defined by certain conditions.The assumption, for example, that all flows ema-nating from a common source node share the samepacket size distribution is a conditional model, con-ditioned on the source nodes.

We would be arguing against a straw man if we impliedthat researchers who apply these assumptions actuallybelieve them to be true. They clearly are not: differentnodes initiate and receive drastically different numbersof packets and flows; flows have drastically differentpacket size distributions. Simply considering intuitiveexamples of common behavior illustrates this amply:BitTorrent users vs. occasional email checkers; down-loading a large file vs. typing in an SSH session.

If no one actually believes that assumptions of uni-formity and marginality accurately describe networkbehavior, then why are these assumptions so common?There are two reasons. The first is that these assump-tions drastically simplify both traffic analysis and work-load generation, making them tractable problems. Thesecond is that there are no generally accepted alterna-tives for representing and simplifying traffic patternsthat allow effective analysis and generation withoutassuming uniformity or marginality.

2The statistical term “marginal” refers to the margins of actuarialtables formerly used for statistical computations. The rows andcolumns of the table represent possible values of two properties.Each entry in a table contains a count of the number of eventsfalling into the joint category for that row and column. Themargins contain sums of the rows and columns, thus giving theunconditional distributions of each property.

Fundamentally, uniform, marginal and conditionalmodels are applied so that network traffic behavior canbe represented using a manageable collection of em-pirical or analytical statistical distributions. Reductionin representation size and complexity makes analysisand generation feasible. From this perspective, threequestions immediately present themselves:

1. Do these assumptions distort the metrics we aretrying to analyze, replicate, and predict?

2. Can we represent “unreduced” network behaviorso that analysis and generation remain tractable?

3. Are there better ways to simplify and reduce therepresentation of network behavior?

This paper answers all three questions. We begin, how-ever by answering the second problem and using oursolution to address the other two questions.

To represent unreduced network behavior practi-cally, we propose the concept of a linear representationof network traffic. A linear representation of trafficmaps each flow to a vector, such that the followinglinearity condition is satisfied:

The sum of the representations of two flows is avector that represents the aggregate behavior ofthe two flows.

The complete behavior of a network is expressed as amatrix of such vectors, having one row per flow. Linearrepresentations express full network behavior, withoutimposing assumptions of uniformity or marginality, yetstill allow effective traffic analysis and workload gen-eration. We present a specific linear representation,called the general matrix model (GMM), and demon-strate experimentally that this model generates wirelessworkload that accurately reproduces the performancecharacteristics of real wireless network traffic.

To address the other two questions, we begin witha simple observation. In linear representations of traf-fic, uniform, marginal and conditional assumptions areassertions of equality between the rows and columnsof the traffic matrix. Moreover, real traffic matricescan be transformed into uniform, marginal and con-ditional forms through very simple matrix multiplica-tions. These transformations, however, are extremelydestructive: the matrices are degenerate with very lowrank, so multiplication destroys almost all informationin the original traffic matrix. We demonstrate exper-imentally that this loss of information is not benign:uniform and marginal traffic models severely distortcrucial network performance metrics.

Once traffic is represented linearly, alternatives toassuming uniformity or marginality become apparent.

Mobile Netw Appl

●

●

●●

●

●

●

● ●

●

0

1

23

4

5

6

7 8

9

●

●

●●

●

●

●

● ●

●

0

1

23

4

5

6

7 8

9

0 100 300 500

010

3050

70

time (seconds)

flow

num

ber

0 100 300 500

010

3050

70

time (seconds)

flow

num

ber

0 50 150 250

020

6010

014

0

time (seconds)

cum

ulat

ive

kilo

byte

s uniform (CBR)trace (real)

a b c d e

Figure 1 Examples illustrating real versus uniform behaviors atthe node, flow and packet levels. a, b Example node behaviors.The width of each line is proportional to the logarithm of thenumber of flows between the nodes (zero is the Internet gate-way). Uniform and trace flow behavior examples are plotted inc and d. The time axis indicates when flows start and end; the

width of each flow line is proportional to the logarithm of itsdata rate. e Uniform (i.e. CBR) packet behavior with the traceof an actual flow. In the uniform model, the cumulative data sentincreases smoothly over time, whereas in the actual packet trace,the transmissions are variable both in size and in inter-packetinterval, leading to a “lumpy” cumulative data plot

We argue that matrix factorization, which is nonde-structive and does not assume uniformity or marginal-ity, is far superior to multiplication for this problem.To demonstrate this approach, we use nonnegative ma-trix factorization (NMF) [7] to find a low-dimensionalapproximation of the packet size and inter-packet in-terval matrices of a real traffic trace. This factoriza-tion extracts a small set of “basic behaviors” from thetraffic matrix and simultaneously explains each flow’sobserved behavior as a mixture of these basic behav-iors. Although the factorization is computed withoutexamining port numbers, basic behaviors neverthelesscorrespond to intuitive expectations for protocols. Wefind basic behaviors corresponding to typical networkactivities, including file transfer, typing via SSH, websurfing, ping traffic, peer-to-peer, and others.

The rest of the paper is organized as follows. InSection 2 we discuss motivation and related work. Thenotion of linear representation is presented in Section 3and the general matrix model is presented in Section 4.Our experimental and analytical methodology for eval-uating traffic models is presented in Section 5, while theresults of our experiments are explained and analyzedin Section 6. The ramifications of these results arediscussed in Section 7. Finally, in Section 8, we concludewith a summary of this research and its impact on thefuture of wireless networking.

2 Motivation and related work

The interaction of network users and applications withthe lower layers of the networking stack is character-ized by where, when, how much, and to whom data istransmitted. The joint pattern of traffic generation and

mobility through time and space completely determinesthe effect of network usage on the lower levels of thestack. This is due to the data-agnostic nature of theprotocol stack: by design, IP networks treat all data inthe same manner [2].3 The credibility of conclusionsderived from simulation or experimental deploymentdepends crucially on our confidence that the modelsused to generate traffic in experiments are sufficientlyrealistic. Figure 1 illustrates through examples howdrastically different network traffic appears under as-sumptions of uniformity at different levels of behavior.These examples are take from a wireless trace recordedat an IETF meeting; details of the trace are given inSection 5.2. Marginality assumptions are not as visu-ally striking, but as we will demonstrate, they distortthe performance characteristics of networks almost asmuch, and sometimes more, than uniform models.

While our work is not inherently specific to wirelessnetworks, since the modeling techniques developed areequally applicable to traffic in wired networks, theresults are of special importance to wireless research.Realistic traffic models have a drastic impact on exper-imental results in wireless networks [5, 6]. This is pri-marily because wireless networks are locally resourceconstrained. Compare this with the typical situationin wired LANs, where gigabit Ethernet and modernrouters can easily handle all but the most intense levelsof traffic. If a wired network operator expects largervolumes of traffic on a conditional basis, the solution

3This is violated by some quality of service (QoS) schemes. How-ever, we can simply add QoS metadata—such as traffic classesor urgency flags—to our models of user behavior and the rest ofour arguments remain valid. The network is still disinterested inthe exact content of the data being transported; only the QoSmetadata is relevant.

Mobile Netw Appl

is to upgrade the hardware, not to change the networkprotocols. In wireless networks, on the other hand, thephysical medium cannot be upgraded; protocols mustbe improved to make better usage of the medium. Theability to perform experiments with realistic workloadis a vital part of the ongoing process of improvingwireless protocols. Without realistic workload mod-els for experiments, performance comparisons may bemisleading or even completely wrong [6].

Paxson and Floyd observed [8] that the interactionbetween endpoint behavior and the network conditionsis inherently closed-loop in the sense that the emergentbehavior is governed by a closed feedback loop. In thecase of most non-TCP applications, this feedback doesnot occur: an open-loop model suffices. Hernández-Campos’s dissertation [3] addresses in great depth howto model the feedback between the network and typicalTCP applications. The author shows how to abstract anapplication’s behavior from a TCP trace and replay itin a manner that accurately reflects how typical applica-tions react in response to different network conditions.This research has subsequently been turned into a TCPflow generation tool called TMIX [12]. The acronym issomewhat misleading: TMIX does not provide modelsfor application behaviors or for mixtures of applicationsin networks. Rather, it provides the ability to repro-duce accurate closed-loop dynamics for a single TCPflow given an application behavior model as input. Ourresearch complements this perfectly: we provide themeans to analyze traffic and generate realistic whole-network application mixtures which can then serve asthe necessary inputs to TMIX.

The two most prominent general traffic generationframeworks are Harpoon and D-ITG. Harpoon [11]uses a traffic trace for self-training, and can subse-quently generate synthetic traffic with certain statisticalproperties based on the original trace. The proper-ties reproduced are the empirical distributions of thefollowing: “file size, inter-connection time, source anddestination IP ranges, [and] number of active sessions.”Harpoon exemplifies the uniform and marginal mod-eling approaches. Source and destination nodes aretreated marginally: empirically derived marginal fre-quencies of sources and destinations are used to chooseendpoints independently for each flow. Flow sizes (i.e.file sizes) are likewise treated marginally: each flow’stotal byte count is sampled from a marginal distributionof flow sizes. UDP flows are modeled as being constantbit-rate; TCP flows are all treated as file transfers,making them effectively uniform with bit-rate deter-mined by TCP dynamics. Inter-connection times aresampled from a single empirically derived marginal dis-tribution. While Harpoon may be sufficiently realistic

for the intended purpose of generating Internet back-bone traffic,4 for generating local-area traffic in wirelessexperiments, it is not. Previous research has shownthat uniform packet behavior models used in Harpoondrastically distort vital performance metrics at all levelsof the protocol stack [6]. This paper shows, moreover,that marginal models, like those used in Harpoon, alsoseverely misrepresent performance metrics in wirelessnetworks.

Like Harpoon, D-ITG [1] exemplifies uniformand marginal modeling assumptions. However, unlikeHarpoon, D-ITG does not derive models from realtraffic. Packet behavior can be modeled using stan-dard statistical distributions (“constant, uniform, nor-mal, Cauchy, Pareto, exponential, etc.”). Guidance isnot provided, however, for choosing among these dis-tributions, choosing parameter values, or selecting amixture of distributions across a network. Flow andnode behaviors are not specified at all, but rather leftto the user’s discretion. The focus of this project isprimarily on the engineering challenge of generatingvery large volumes of synthetic traffic and injecting itinto networks in a distributed fashion. While D-ITGprovides the mechanism for generating large volumesof traffic, it does not provide any real guidance as towhat traffic to generate.

Hernández-Campos et al. [4] have defined the stateof the art in parametric modeling of wireless trafficpatterns. They use rigorous statistical methods to fitparametric models to a number of properties of wirelesstraffic. Flow arrivals fit a non-stationary Poisson model;flow inter-arrival times fit a Lognormal distribution;the number of flows per session fits a BiPareto model;flow sizes also fit a BiPareto model. The statisticalanalysis fitting these parametric models is thorough andconvincing. The models, however, are marginal: a singlestatistical model applies to all flows and nodes; packetbehavior is not considered. Each node has a differentnumber of flows assigned to it, but otherwise nodesare identical. Moreover, all flows, regardless of sourcenode, are statistically identical: flow sizes are drawnfrom a single marginal distribution, and nothing elsedistinguishes them. Yet, intuitively it is clear that dif-ferent nodes in real networks have drastically differenttendencies for flow size, flow rates, and packet behav-ior. Such interdependent behaviors cannot be captured

4Even this is unclear: generated traffic is only compared withtrace traffic using the very same metrics that are explicitly spec-ified as part of the model. Unsurprisingly, the distributions ofsampled metrics closely match the distributions those metricswere sampled from. Realism is not double-checked using otherstatistical metrics or actual performance metrics.

Mobile Netw Appl

using marginal modeling. In this paper, we demonstratethat this inability is not harmless: treating all flows andnodes statistically identically severely distorts crucialwireless performance metrics in experiments. We alsoprovide sorely needed methods of representing andanalyzing traffic patterns without making uniform ormarginal assumptions about network behavior.

This work uses the technique known as paired dif-ferential simulation to evaluate the realism of modi-fied or synthetically generated wireless workloads. Thistechnique was proposed and subsequently refined byKarpinski et al. [5, 6]. They introduce the notion ofsufficient realism, and propose paired differential sim-ulation as an experimental methodology based on thestandard technique of pairing control and test subjects.We describe the technique in Section 5. In the firstpaper [5], they explore various ways of modifying tracetraffic patterns, as a form of sensitivity analysis to deter-mine which aspects of network behavior are essentialand must be preserved in realistic models, and whichaspects can be discarded as inessential. Of particularinterest is their conclusion that detailed time-seriespacket behavior for flows is inessential: it suffices, foreach flow, to have accurate unconditional distributionsof packet sizes and inter-packet intervals; sufficientlyrealistic flow behavior can be reconstructed from thesedistributions by repeated independent sampling. In thesecond paper [6], Karpinski et al. explore the impactof a broad variety of uniformity assumptions at thepacket, flow and node levels of behavior. They find thatall uniformity assumptions are detrimental, inducingsignificant and often extreme misrepresentation of abroad variety of wireless performance metrics at everylevel of the protocol stack. Moreover, they show, usingOLSR and AODV, that uniform traffic models cancompletely invert the relative performance of networkprotocols.

3 Linear representation of traffic

The traffic workload pattern in a network is a collectionof IP packets between hosts in the network sent at cer-tain times. It is standard in network analysis, however,to aggregate sequences of packets sharing the same“5-tuple”: IP protocol, source and destination nodes,and source and destination TCP/UDP port numbers.Such a sequence of packets is called a flow. The trafficpattern of an entire network is simply the collectedbehaviors of all flows occurring in the network.

The behavior of a flow as it affects the network ischaracterized by the following properties: its IP pro-tocol type (TCP, UDP, ICMP, etc.); its source and

destination nodes; its start time; and the specific sizesof packets and intervals between their transmission.We begin our exposition of traffic representation byformalizing these properties mathematically as sets:

� types: type ∈ Y = {1, . . . , 255} ⊂ N

� nodes: src,dst ∈ N = {1, . . . , n} ⊂ N

� times: start ∈ T = R

� sizes: size ∈ Z = {0, . . . , MTU} ⊂ N

� intervals: ival ∈ V = R+ ∪ {∞}.

There are 255 possible IP types (0 is reserved for IPv6).The number n is simply the total number of nodes in thenetwork being modeled. MTU is the maximum trans-fer unit of the network (typically around 1,500 bytes).We formalize the packet behavior of a flow as a pairof infinite sequences of packet sizes and inter-packetintervals:

sizes = 〈size1, size2, . . .〉 ∈ ZN (1)

ivals = 〈ival1, ival2, . . .〉 ∈ VN. (2)

Since flows contain only a finite number of pack-ets, there exists a number of packets, pkts ∈ N, foreach flow. Since flows are finite, we require thatthe sequences sizes and ivals satisfy the followingrequirements:

∀ k : sizek > 0 ⇐⇒ k ≤ pkts (3)

∀ k : ivalk < ∞ ⇐⇒ k < pkts. (4)

Actual packet sizes are required to be positive becausewe are modeling application layer data transfer: emptydata transmission requests are non-events. The sendtime of the kth packet can be expressed as

timek = start+k−1∑

i=1

ivali. (5)

Equation 4 implies that this will be finite if and only ifk ≤ pkts. Finally, we formalize the space of flows:

F = Y × N 2 × T × N × ZN × VN. (6)

We write a generic element of F as

flow = 〈type, src,dst, start,pkts, sizes, ivals〉 ∈ F . (7)

Let 2F denote the power set of F , and F∗ ⊆ 2F the setof all finite sets of flows:

F∗ = {X ∈ 2F : |X| < ∞}

. (8)

A traffic sample is simply an element of F∗. We willrefer to elements of F∗ interchangeably as “traffic sam-ples” or “traffic instances.”

Mobile Netw Appl

Having chosen a mathematical formalization of net-work traffic, we may now define the notion of a linearrepresentation of network traffic.

Definition 1 A linear representation of network trafficis a pair, 〈V, ψ〉, where V is a vector space and ψ is afunction, ψ : F∗ → V, such that

∀ T ∈ F∗ : ψ(T) =∑

flow∈T

ψ(flow). (9)

In plain words, every finite collection of flows isrepresented by a vector that is the sum of the ele-ments representing the individual flows in the collec-tion. The careful reader will notice that any functionF → V uniquely determines a linear representation:Eq. 9 uniquely extends the function to all of F∗. Rep-resenting flows by arbitrary vectors, however, is notparticularly useful or interesting. We focus instead onmeaningful representations, where collections of flowsare represented by vectors that naturally describe theaggregate behavior in some manner. In the next section,we clarify these concepts through examples.

3.1 Examples of linear representations

Our first example is the packets-bytes-duration (PBD)representation. The representation is pbd : F∗ → R

3,mapping flow �→ 〈p, b , d〉 ∈ R

3, where

p = pkts, b =p∑

i=1

sizei, d =p−1∑

i=1

ivali. (10)

This representation encapsulates the information nec-essary to reproduce a CBR version of each flow: pack-ets have payloads of �b/p� bytes, and are sent everyd/p seconds.5 Given two flows with behavior vectors x1

and x2, their sum as vectors,

x1 + x2 = 〈p1 + p2, b 1 + b 2, d1 + d2〉, (11)

provides an appropriate description of the aggregatebehavior of the two flows together. Similarly, theiraverage, 1

2 (x1 + x2), has the average number of pack-ets, bytes, and duration, which is precisely what the“average behavior” of the two flows should intuitivelybe. In contrast, a “packets-payload-interval” represen-tation, expressing the CBR parameters directly, doesnot have this property: the sum of average payloadsis not the overall average of payloads; likewise for

5There are many schemes to correct the discrepancy in byteswhen p does not divide b evenly. The simplest is to ignore it;we leave more complex schemes to the reader.

intervals. In the PBD representation, the linearity prop-erty leads to natural descriptions of aggregate behav-ior through vector addition, whereas in the packets-payload-interval representation, vector addition is“unnatural.”

The PBD representation is highly lossy: it discardsmost of the information detailing the behavior of eachflow. At the other extreme, we can give a “perfect”representation of flows, that allows us to completely re-construct each flow from its representation. With care,the sum of representations of flows can still naturallydescribe their aggregate behavior. Let the representa-tion be Φ : F∗ → X , where

X = R255 × R

n × Rn × L2[R] × L2[R]. (12)

Here L2[R] is the vector space of real functions, RR =

{ f : R → R}, with the usual norm [10]. We define

Φ(flow) = ⟨etype,esrc,edst, f, g

⟩, (13)

where ek is the kth standard basis vector, and f, g ∈L2[R] are functions defined by

f (t) =pkts∑

k=1

{1 if t ≥ timek

0 if t < timek,(14)

g(t) =pkts∑

k=1

{sizek if t ≥ timek

0 if t < timek.(15)

Thus, f and g are both integer-valued, monotonicallyincreasing step functions, giving the cumulative numberof packets and total bytes sent, respectively, at time t.From the representation, Φ(flow), of a single flow, wecan readily recover its type, source, destination, and thetransmission times and sizes of packets.

Why is Φ natural in the same sense that the PBDrepresentation is natural? Suppose we have a trafficsample, T ∈ F∗. The first component of Φ(T) is a 255-dimensional vector, whose coordinates are a histogramof the number of times each flow type occurs in T.Similarly, the second and third components of Φ(T)

are n-dimensional histograms of the number of timeseach node occurs as source and destination. The lastcomponents are step functions, like f and g. Vectoraddition in L2[R] dictates that these functions give thetotal number of cumulative packets and bytes sent,respectively, over all flows in T. Thus, summation ofvectors under the representation Φ yields a natural anduseful description of the aggregate behavior of flows.

Consider, for contrast, if we had represented flowtype simply by type ∈ N instead of etype ∈ R

255. The nat-ural numbers containing type can readily be embeddedinto R

1, which is a vector space. However, the additionof vectors is highly unnatural. The sum of protocol

Mobile Netw Appl

numbers is meaningless: two TCP flows plus five ICMPflows does not equal a single UDP flow, despite thefact that 2 · 6 + 5 · 1 = 17. The summation of vectors inthe representation �, on the other hand is 5 e1 + 2 e6,indicating the frequency with which each type occurred.

Both of the example representations discussed herecan be applied in practice. The PBD representationnaturally allows us to compute aggregate and averageCBR behaviors of collections of flows. With a suit-able machine encoding of step-functions, Φ can beimplemented and used in practice. It provides com-plete information about individual flows and highlydetailed (though not complete) information about col-lections of flows. Representations that preserve evenmore information about traffic instances can readilybe constructed. We turn now, however, in a differentdirection, and introduce a specific linear representationthat strikes a practical balance in fidelity between pbdand Φ.

4 The general matrix model

The PBD representation of flows uses three dimensionsto encode the behavior of each flow, while Φ uses255 + 2n dimensions, plus the continuum of dimensionsof two Banach spaces (i.e. the two copies of L2[R]). Inthis section, we define a representation, φ, that splitsthe difference and uses a large but finite number ofdimensions to represent each flow. Once the vectorrepresentation of each flow is defined, we may con-struct matrices of representation vectors, with a rowfor each flow in a traffic sample. We call the matrixrepresentation of network traffic the general matrixmodel (GMM). Having expressed network behavior interms of matrices, we can very simply and efficientlycompute a broad variety of network behavior prop-erties using standard matrix operations. Furthermore,we can express uniformity and marginality assumptions

about behavior as equality constraints between therows and columns of the traffic matrix. Moreover, wecan transform unconstrained traffic instances into uni-form, marginal, or conditional approximations via rightor left multiplications by easily computed matrices.

The general strategy for defining φ is to divide thedescriptive properties of each flow into bins (one pervalue for properties that are discrete already and re-quire no compression), and map values in each bin toa standard basis element of R

d, where d is the numberof bins. In this scheme, representation vectors are ef-fectively histograms, with coordinate values in each di-mension counting items falling into the correspondingbin. Table 1 lists the properties that are used to defineφ, together with the quantization and dequantizationfunctions. Each quantization function takes a possibleproperty value as input and maps it to an index value in{1, . . . , d}, where d is the number of bins, i.e. dimension,for the property in question.

The histogram-vector approach produces some rep-resentations of properties that are not immediatelyintuitive, and may seem to use an excessive number ofrepresentation dimensions. However, this approach au-tomatically satisfies the “naturalness” criterion for lin-ear representations. That is, the representation, φ(T),naturally describes the aggregate behavior of flowsin T. Perhaps even more importantly, arbitrary lin-ear combinations of vectors can be meaningfully in-terpreted as describing some realizable, hypotheticaltraffic instance. In Section 4.5 we will describe how togenerate flows from arbitrary composite representationvectors.

4.1 Model specification

The quantization functions specified in Table 1 allow usto define the representation φ : F∗ → R

N :

φ(flow) = f = 〈t, s, d, a, b, z, v〉, (16)

Table 1 Quantization and dequantization functions for properties of flows

Property (prop) Domain Quantization (Qprop) Dim. Dequantization (Q−1prop)

IP type type ∈ {1, . . . , 255} y �→ y 255 x �→ �x�Source node src ∈ {1, . . . , n} s �→ s n x �→ �x�Destination node dst ∈ {1, . . . , n} d �→ d n x �→ �x�Start time start ∈ [0, tmax] t �→ max

{1,

⌈da

(t

tmax

)⌉}da x �→ (tmax)

(x

da

)

Bytes (flow size) bytes ∈ {1, . . . , bmax} b �→ 1 +⌈(db − 1)

(b−1

bmax−1

)β⌉

db x �→ max

{1,

⌈(b max − 1)

(x

db

) 1β

⌉}

Packet size size ∈ {1, . . . , MTU} z �→ z dz x �→ �x�Inter-packet interval ival ∈ [0, vmax] v �→ max

{1,

⌈dv

(v − vmin

vmax−vmin

)γ ⌉}dv x �→ vmin + (vmax − vmin)

(x

dv

) 1γ

The domain specifies the set of possible input values to the quantization function. Quantized values are elements of {1, . . . , d}, where dis the dimension specified. The dequantization function maps a real value, x ∈ [0, d], back into the original domain.

Mobile Netw Appl

where N = 255 + 2n + da + db + dz + dv and vectorsare

t = ek k = Qtype(type)

s = ek k = Qsrc(src)

d = ek k = Qdst(dst)

a = ek k = Qstart(start)

b = ek k = Qbytes(bytes)

z = ∑ek k ∈ {Qsize(sizei) : i ≤ pkts}

v = ∑ek k ∈ {Qival(ivali) : i < pkts}.

The functions Qprop are the respective quantizationfunctions for each property. If there are f flows ina traffic instance, then the collective behavior can bedescribed with seven matrices—one for each property;the rows of each matrix are the behavior vectors ofthe flows, indexed in some chosen order. We use thefollowing notations for the matrices and indexed flowvectors:

� types: T = (ti) = (tij) ∈ Rf×255

� sources: S = (si) = (sij) ∈ Rf×n

� destinations: D = (di) = (dij) ∈ Rf×n

� start times: A = (ai) = (aij) ∈ Rf×da

� sizes in bytes: B = (bi) = (bij)∈ Rf×db

� packet sizes: Z = (zi) = (zij) ∈ Rf×dz

� inter-packet intervals: V = (vi) = (vij) ∈ Rf×dv .

The total behavior matrix is the horizontal concatena-tion of these component matrices:

F = [T S D A B Z V

] ∈ Rf×N. (17)

We call this total representation the general matrixmodel. While it is by no means the only possible repre-sentation of traffic in terms of matrices, as we demon-strate in the following sections, many common trafficmodeling assumptions can be expressed succinctly andprecisely in terms of transformations or conditions ofthe GMM and its components. Furthermore, realisticworkload can be generated from matrix instances.

4.2 Choosing constants

Before continuing with less mundane topics, we mustbriefly address our choices of constants appearing inTable 1. The externally predetermined constants are:dz = MTU = 1, 500, determined by the wireless net-work medium; b max ≈ 109 MB, determined by the sizeof the largest flow in our trace data; tmax = vmax = 600,determined by our experimental methodology whichslices traces into 10-min scenarios; and vmin = 10−6,determined by the time resolution of our trace. For

lack of a better choice, we have simply chosen to usethe same number of quantization bins as packet sizefor the unfixed dimension constants: da = db = dv =1, 500. Arguments can certainly be made for otherchoices, but this choice appears to work adequately inpractice.

Two other unspecified constants appear in Table 1:β and γ . These parameters are non-linear scaling ex-ponents that allows the quantization to smoothly shiftfrom fine-grained resolution at the low end of thespectrum to coarse-grained resolution at the high end.We have chosen β = 1/2.7 and γ = 1/3 because thesevalues yield desirable dynamic ranges: from single bytesat the lower end to megabytes at the high end for flowsizes; from microseconds to seconds for intervals.

4.3 Computation of common properties

Having defined a specific matrix representation of net-work behavior, we turn now to deriving useful expres-sions for commonly used network properties in termsof these algebraic building blocks.

4.3.1 Packet size and inter-packet interval statistics

From the matrices Z and V we can compute a variety ofstatistics about packet sizes and inter-packet intervals.The row vectors in the matrices may represent indi-vidual flows or composites of many flows. The compu-tations require knowing the values that each matrixcolumn represents, whether size or interval. The vec-tors of representative values for Z and V respectivelyare:

σ = 〈1, 2, 3, . . . , dz〉 (18)

τ = ⟨Q−1ival

(k − 1

2

)⟩dv

k=1. (19)

The packet count, total byte and total duration vectorsare computed as

�pkts = Z1 = V1 ∈ Rf (20)

�bytes = ZσT ∈ Rf (21)

� ivals = VτT ∈ Rf . (22)

Here 1 is a row vector of all ones.6 Packet count andtotal bytes values are exact; the total durations are onlyapproximate because information is irretrievably lost inthe quantization process. Using these vectors of totals,

6Through out this paper, 1 denotes a matrix of all ones; dimen-sions of the matrix are inferred by context. Where necessary, wedisambiguate the dimensions by explicitly specifying the dimen-sions of the final matrix product, as we do here.

Mobile Netw Appl

we may readily compute the average packet sizes andinter-packet intervals:

z̄ = �bytes ./�pkts = ZσT ./ Z1 (23)

v̄ = � ivals ./�pkts = VτT ./ V1. (24)

The symbol ./ here represents element-wise division ofvectors, as it does in Matlab.

4.3.2 Inter-node communication graphs

The total number of source flows for each node isgiven by the vector ST1 ∈ R

n, and the total number ofdestination flows is given by DT1 ∈ R

n. The numberof flows between each pair of nodes in the network isexpressed by the matrix product STD ∈ R

n×n. That is(STD)ij is the number of flows from node i to nodej. Interpreting this sparse matrix as a weighted graphimmediately yields the source-destination flow commu-nication graph. Figure 1 gives two visual examples ofsuch graphs. Similar graphs counting other quantitiescan readily be computed by inserting the appropriateweighting matrix. For example, we may compute theinter-node packet transmission graph:

STdiag(�pkts)D ∈ Rn×n. (25)

Here �pkts is the byte vector defined in Eq. 20, andthe “diag” operator gives the square, diagonal matrixwith the given vector of values on its diagonal. Similargraphs for total bytes and total duration of flows maybe computed using �bytes or �ival in place of �pkts.

4.4 Expression of common modeling concepts

The simplifying assumptions made by most traffic mod-els can be classified into three categories. We use theterms uniform, marginal, and conditional to describethese classes of assumptions. They are all assumptionsabout the “sameness” of aspects of network behaviors;they differ, however, in whether they make stipulationsabout uniformity within each flow, across all flows, orwithin groups of flows. In the following sections, we ex-plore how various modeling concepts can be expressedin the framework of the general matrix model. Modelsare expressed as transformations that convert arbitrarytraffic matrix instances into similar ones that satisfy themodel’s assumptions. These transformations later allowus to evaluate the effect of each model’s simplificationson the realism of generated workloads.

4.4.1 Uniform modeling

The concept of uniformity, as used here, entails thatcertain properties of each flow are assumed to be statis-tically or deterministically uniform. Examples include:assuming that each flow’s packets have the same sizeand inter-packet interval (i.e. the CBR flow model);assuming that each node is equally likely to be thesource of a given flow; assuming that all start times fora flow are equally likely.

Since uniform behaviors entail homogeneity of prop-erties with respect to each flow individually, a tracematrix can be transformed to a matrix satisfying themodel’s assumptions via right matrix multiplication.Right multiplication mixes the elements within eachrow, and if the mixing weights are uniform, the resultis uniform behavior. For example, to make the roles ofthe nodes with respect to each flow statistically identi-cal, we transform S and D:

S′ = rs(S1) = 1n S1 ∈ R

f×n (26)

D′ = rs(D1) = 1n D1 ∈ R

f×n. (27)

The “rs” operator scales each row to sum to unity, thusmaking the entire matrix row-stochastic; in this case,this is equivalent to scaling by 1/n. These transforma-tions of S and D make all the values in each flow’srow identically the average of the original n values.As another example, the constant bit-rate (CBR) flowmodel, in its strictest sense, is an intra-flow model,which stipulates that all the packets in each flow behaveidentically, having identical sizes and intervals; thuseach flow’s bit-rate is constant, as the name implies.The strict CBR model can be easily expressed withthe packets-bytes-duration (PBD) model used as anexample in Section 3. The pair of matrices [Z V] istransformed to PBD form using the vectors defined inSection 4.3.1: [Z V] ∈ R

f×3, where

=(

1 σ 00 0 τ

)∈ R

(dv+dz)×3. (28)

Figure 1 graphically depicts examples of uniform be-haviors as compared to realistic behaviors from traces.

4.4.2 Marginal modeling

Marginality is an orthogonal concept to uniformity:the distribution of some property is identical acrossall flows. For example, if we determine a distributionof flow sizes across the entire network and assumethat this distribution applies independent of other flowproperties, this is a marginal model for flow size. Mar-ginal behavior is effected by left multiplication by a

Mobile Netw Appl

matrix with uniform columns. This specific exampleis generated by left-uniformizing the flow size ma-trix: B′ = (1/ f ) 1B. The traffic models proposed byHernández-Campos et al. are all marginal: they pro-pose certain network-wide parametric distributions forstart times, flow sizes and the number of flows per node.These models do not account for non-independencebetween these and other flow properties. We discussthis limitation and its implications when we analyze ourexperimental results.

4.4.3 Conditional modeling

The final form of homogenization we consider is con-ditionality. This is a restricted form of marginality:we may assume that marginal distributions apply toall the flows in certain groupings. For example, wecan provide a distribution of flow sizes on a per-source basis. That is, each source node can haveits own distribution, and the sizes of its flows aredrawn from that distribution, independent of theirother properties. This is simply accomplished by left-multiplication by a non-uniform matrix. For example,to make flow behavior uniform per source, consider the“source-conditioning” matrix: STS ∈ R

n×n. This matrixhas (STS)ij = 1 if and only if flows i and j have the samesource node. To condition flow behavior by source, weleft-multiply by the stochasticized source-conditioningmatrix:

F′ = rs(STS)F. (29)

In this case, rows do not have the same sums, so the“rs” operator is not equivalent to scalar multiplication.Per destination uniformity can be achieved similarly,replacing S with D.

The GMM has the benefit of making the condition-ing of behaviors completely explicit. It also allows usto easily define intermediate choices. If we wish, forexample, to make the behavior of each flow a weightedaverage of the composite behaviors of its source anddestination nodes, we can express this easily:

F′ = ((1 − α) rs(STS) + (α) rs(DTD))F. (30)

The parameter α allows us to smoothly vary the weight-ing of the source and destination behaviors. Source-only behavior is given by α = 0; destination-only byα = 1; to weight source and destination equally, takeα = 1/2.

4.5 Traffic generation

Given an instance of the general matrix model, how dowe produce an actual workload from it? Each row of

the model is used to randomly generate the behaviorof a single flow. To do this, we use the “dequantiza-tion” functions listed in Table 1. The dequantizationprocedure works as follows. Let f be a row of the trafficmatrix, and let prop be a property. Write fprop for thecorresponding component vector of f. Chose an indexvalue k ∈ {

1, . . . , dprop}, randomly, with weight given

by the coefficients of fprop. Choose u ∈ [0, 1] uniformlyat random. Then x = k − u ∈ [0, d ], with probabilityof lying in the interval [i − 1, i] proportional to thecoefficient of ei in fprop. The dequantization function forprop, namely Q−1

prop, will map the value x back into theappropriate range of property values for prop.

The procedure above allows us to sample randomvalues for each property in accordance with the behav-ioral description implied by f. This allows us to stochas-tically generate a sequence of packets in accordancewith a row from an arbitrary GMM matrix instanceusing the following procedure:

1. Sample random source and destination node pairs.Since a flow must have distinct source and destina-tion nodes, the sampling of endpoints must be donejointly, avoiding collisions. The simplest techniqueis the rejection method: if the same node is chosenfor both source and destination, reject that pair andbegin the selection process again.

2. Next, randomly sample the number of bytes.3. The packet size and inter-packet interval vectors,

z and v, allow us to compute the expected averagepacket size and inter-packet intervals for the flow(see Section 4.3.1); call them z̄ and v̄. From these,together with the flow size in bytes, we estimate theexpected duration of the flow: d̄ = b v̄/z̄.

4. Randomly sample a start time for the flow. Ensurethat the start time plus the expected duration, d̄,does not exceed the maximum simulation time forthe workflow by using the rejection method.

5. The first packet is sent at the chosen start time.The size of each packet is determined by samplinga random size value. The interval until the nextpacket is determined by sampling a random intervalvalue. The packet stream ends when the number ofbytes allocated to the flow have been emitted orwhen the packet emission time exceeds tmax. Thelast packet is only the size of the remaining numberof bytes.

Some remarks about the procedure:

– For some vectors in Rf×N , it may be impossible

to choose distinct source and destination nodes orflow sizes and start times satisfying the procedure.However, for any vector that is a sum of vectors

Mobile Netw Appl

representing valid flows—i.e. ones having distinctsource and destination nodes and feasible combina-tions of start time, duration, and packet behaviors—the procedure will terminate.

– Because the representation φ models the sevenproperties independently of each other, it is im-possible to do better than to sample the propertiesindependently, modulo the requirements of the pro-cedure. In the case of a row which is simply a repre-sentation of an actual flow, this makes no difference,except for the generation of packets (which aremodeled independently). In the case of compositerows, however, this implies that flow properties aremodeled independently of each other.

Despite these remarks, we will show in our experimen-tal evaluation that this procedure generates sufficientlyrealistic wireless workloads [5, 6].

5 Experimental methodology

To evaluate whether traffic models are realistic or not,we use the method of paired differential simulation in-troduced in our previous work on traffic modeling [5, 6].Paired differential simulation is based on the standardscientific technique of controlled experimentation. Thehypothesis we are testing is that a given synthetic modelaccurately reproduces the performance metrics exhib-ited by real network traffic. To test this hypothesis, weconduct a series of paired experiments and comparethe results for each test subject with a correspond-ing control subject. In this case, the experiments arewireless network simulations, and the subjects are realor synthetic workload instances. The control group isthe set of simulations where the workload is an actualtraffic trace, recorded as described in Section 5.2. Thetest group is a set of simulations where workload issynthetically generated using a simplified traffic model.The experiments are paired: for each control simula-tion, there is a synthetic workload, with behavior assimilar to the control as model will allow, against whichthe results are compared. The output of the series ofexperiments are paired values of performance metricsfor each simulated scenario: one from the control, onefrom the test model.

5.1 Traffic models

The first model we evaluate is the general matrix modelitself. We derive the matrix representation directlyfrom the trace data and use it to generate workloadas described in Section 4.5. This serves to test whether

the GMM is sufficiently general: if it accurately repro-duces all performance metrics across the collection ofscenarios simulated, this provides strong evidence thatGMM captures enough of the detail of trace behaviorto be considered a sufficiently general representation ofnetwork traffic patterns.

After validating the GMM, we explore the effectsof various uniformity, marginality, and conditioning as-sumptions on the realism of traffic models. The uniformmodels are the most common, being the simplest bothto conceptualize and to implement: simply make nodistinction between start times, source or destinationnodes. Choose them at random, using no informationfrom reality other than the number of nodes and thepossible range of start times. We do not evaluate uni-form packet behavior because it has been sufficientlydiscredited in our previous work.

In more sophisticated modeling work, where unifor-mity is discarded as too simple a model for behavior,marginality is often applied instead. It is assumed—often implicitly—that marginal models are adequate tocapture the essential aspects of network behavior. Weput that implicit assumption to the test here. How welldo perfect marginal models represent real behavior?When we say that our marginal models are “perfect,”we mean that they are nonparametric and calculateddirectly from trace behavior. These are the marginalmodels that most accurately describe the trace traffic,because they are directly derived from it and makeno parametric approximations. This provides an upperbound on the quality of marginal models in general,since parametric marginal models are derived, in turn,as simplifications of these nonparametric marginal be-haviors. Table 2 provides a complete list of all the trafficmodels we compare in our evaluation.

5.2 Trace data

Our general methodology is to compare performancemetrics in simulations using real traffic patterns fromtraces to the same metrics in simulations using a vari-ety of trace-based models and synthetic traffic modelsdefined using the general matrix model. We use a 24-htrace recorded in an infrastructured 802.11 g wirelessLAN with 18 access points, deployed at the 60th In-ternet Engineering Task Force meeting (IETF60), heldin San Diego during August of 2004. The traffic tracewas captured using tcpdump at a single router, throughwhich all wireless traffic for the meeting was routed, in-cluding traffic between wireless nodes. The snap lengthof the capture for each packet was 100 bytes, allowingIP, ICMP, UDP and TCP headers to be analyzed.We limit our work to the 24-h sub-trace recorded on

Mobile Netw Appl

Table 2 Matrix-based traffic models used in paired differential simulation experiments

Model Transformation Description

GMM — The general matrix modelTime uniform A′ = (1/n) A1 All start times are equally likelyNode uniform S′ = (1/n) S1 D′ = (1/n) D1 All nodes are equally likely as source or destinationFull uniform All uniform transformations All start times, sources and destinations are equally likelyTime marginal A′ = (1/ f ) 1A Start times chosen from marginal distribution of start timesSize marginal B′ = (1/ f ) 1B Flow size (bytes) is chosen from marginal distribution of sizesNode marginal S′ = (1/ f ) 1S D′ = (1/ f ) 1D Source and dest. chosen marginally, independent of other propertiesPacket marginal Z′ = (1/ f ) 1Z V′ = (1/ f ) 1V Packet behavior is variable bit-rate, marginal across all flowsFull marginal All marginal transformations Applies all the marginality assumptions of the above four modelsSource conditional F′ = rs(ST S) F Flow behaviors are aggregated on a per source-node basisDest. conditional F′ = rs(DT D) F Flow behaviors are aggregated on a per destination-node basis

Wednesday, August 4th. This trace contains a broadvariety of behaviors and entails a very large volumeof traffic: 2.1 million flows, 58 million packets, and52 billion bytes.

We do not assume or claim that the traffic foundat IETF60 is representative of conference settings ingeneral. The observed behaviors are also unlikely toresemble those found in a typical commercial or res-idential setting. We have chosen this trace, however,because within it can be found behaviors resemblingmany different types of wireless usage cases. Figure 2shows the wide variations in the number of active flowsand nodes over the course of the trace. In the nightand morning hours, the traffic patterns are similar tothose one might find in a moderately trafficked businessor residential area. During working group sessions, wesee highly concentrated, heavy usage patterns. At thezenith of activity, over 800 users, 33 thousand flows,and 1 million packets are seen in a single 10-min tracesegment. At the nadir, a lone node sent only a single61-byte packet over the course of 10 min. All levelsof activity between these extremes are represented.Moreover, the mix of traffic types observed changesdramatically over the course of the day, providing awide representation of possible blends of behavior. Thisheterogeneity and extreme range of behaviors makesthe IETF data set ideal for this evaluation. The varietyof activity gives us greater confidence that success or

0 3 6 9 12 15 18 21 24

0

2

4

6

8

hour

100s

of

node

s

0

10

20

30

1000

s of

flo

ws

nodesflows

Figure 2 The number of active nodes and flows over time

failure of traffic models is not tied to any specific net-work condition, but is broadly and generally applicable.

Before using the traces, it is necessary to extractapplication-level behavior from the trace header data.First, we split the trace into individual packet flows.A flow is a series of packets sharing the followingfive attributes: IP and transport protocols (raw IP,ICMP, TCP, UDP); source and destination IP ad-dresses and TCP/UDP port numbers. Next, the quan-tity of application-initiated data contained in eachpacket is calculated. For non-TCP packets, this quantityis simply the size of the transport-layer payload, butfor TCP the calculation is more complicated: only newdata transfers, explicitly initiated by the application,are counted. Data retransmitted by TCP is disregarded,and empty ACKs are ignored. SYN and FIN flags inpackets (even empty ones) are counted as a singlebyte each, since they are explicitly signaled by theapplication.

5.3 Simulations

We use the Qualnet wireless network simulator (ver-sion 4.0.1) to perform our experiments. We simulate astationary multi-hop 802.11 g network using the Ad hocOn-demand Distance Vector (AODV) routing proto-col [9], with nodes placed randomly in a square fieldwith sides of 150 m but a radio range of only 20 m—bothin accordance with an indoor network environment. Inaddition to the active nodes corresponding to trace IPs,half as many passive “infrastructure” nodes are addedto each simulation: these nodes initiate no data andsimply serve as additional network relays. Our simu-lations resemble multi-hop mesh networks of the kindthat are increasingly studied and deployed for deliveryof broadband access in residential, corporate and con-ference settings. We do not attempt to reproduce thephysical environment of the original wireless network,nor do we simulate mobility. The only aspect of the

Mobile Netw Appl

original network’s behavior that is reproduced is thetotal pattern of network-wide traffic.

There are a number of potential objections to this ap-proach. We use single-hop trace data to drive multi-hopsimulations; the physical environment, node mobility,handover behavior, and closed-loop dynamics of theoriginal wireless setting are not faithfully reproduced.One must keep in mind, however, that the goal ofthis research is not to understand the conditions of theoriginal network. Rather, we are using the traffic behav-iors observed as examples to help us better understandhow different types of workload can affect performancemetrics. In particular, we aim to understand how realworkload compares with common synthetic traffic mod-els. Of course, the reason for such objections is that net-working researchers understand that the many aspectsof behavior interact with each other in a complex andnearly inextricable manner. However, before we canhope to understand the interaction between workloadand other features affecting network behavior, we muststudy traffic patterns alone, and learn to model themwith reasonable accuracy in the absence of additionalcomplicating factors. Accordingly, in this study, wedetach application level traffic patterns from the otherfactors influencing network conditions, and study themin isolation.

The 24-h trace is split into 144 10-min segments, eachof which serves as the basis for a set of simulations usingdifferent traffic models. The traffic models range froma completely realistic trace-driven model, to a standardCBR traffic model. Various partially synthetic interme-diate models, described in Section 5.1, are simulated tostudy the impact of different aspects of traffic behavioron network performance. To preserve the fairness ofthe performance comparison, we keep as many featuresas possible constant across different traffic models. Thetraffic generated by each synthetic model preserves asmany characteristics from the original trace as pos-sible, within the constraints of the model. Moreover,the following features are preserved across all models:the number of wireless nodes, the number of flows, thenumber of application-initiated data units sent, the totalbytes of application data sent, and the average flowduration (and therefore the average data rate).

In our previous work, we approximated TCP witha pair of half-duplex UDP flows [5, 6]. For this work,we have implemented a new Qualnet application driverthat allows trace-driven full-duplex TCP flows. Thisallows us to accurately reproduce the full dynamics ofTCP feedback. Moreover, raw IP, UDP and TCP flowsare implemented using the same code, guaranteeingthat all traffic is simulated uniformly and that all per-formance metrics are aggregated in the same manner.

5.4 Performance metrics

We have selected six performance metrics to presenthere. They are commonly used as indicators of net-work performance at the application, network, and linklayers of the protocol stack:

1. Application: average end-to-end delay, packet de-livery ratio, received throughput.

2. Network: AODV control overhead (RREQ/RREP/RERR), packets dropped in routing queues.

3. Link: 802.11 control overhead (RTS/CTS/ACK).

These metrics are commonly used to evaluate wire-less protocols. We have examined a broad variety ofadditional wireless network performance metrics, andalthough we do not have room to present or discussthem here, the results shown are representative of theoverall realism of the traffic models.

6 Results

Our simulation results are summarized in Fig. 3. Eachsubfigure shows a single performance metric. The dis-tribution of log-ratio error values for each traffic modelis visualized with a box-and-whisker plot. The box in-dicates the range from the 25th to 75th percentiles ofvalues, while the “whiskers” indicate the full range, ex-cluding outliners (which are shown as isolated points).These plots allow immediate assessment of realism: agood traffic model should have error values that aretightly clustered around the center, with a small, evenlybalanced box, and relatively small whiskers. Addition-ally, the mean and median markers should be closeto the center. It should be noted that the error scaleis logarithmic, so even a slight increase in spread ordeviation from the center indicates a disproportion-ately large decrease in model quality. Non-logarithmicscale markers are shown above each plot; the valuesindicate the factor of over- or under-representation forthe performance metric.

The primary result is that the GMM accurately re-flects the performance of the trace data it is basedon. It has small, centered error bars for every per-formance metric presented here, as well as for thosewe have looked at otherwise. This is an essential re-sult, since otherwise, GMM is at best a shaky foun-dation for further modeling work. With these results,we can be assured that if we can approximate thematrix for a given example of trace traffic, then wehave also approximated the original trace behavior. Ina sense, this result is a reduction of the general traffic

Mobile Netw Appl

●●

●●● ●●

●● ●

● ●● ●

●

dst conditionalsrc conditional

full marginalpacket marginal

node marginalsize marginaltime marginal

full uniformnode uniformtime uniform

GMM

-4 -2 0 2 4

1/29 1/15 1/8 1/4 1/2 1 2 3 5 7 11 18 29

●● ●●● ●

●

●●

●

● ●





GMM

-4 -2 0 2 4

1/22 1/12 1/6 1/3 1 2 3 5 7 11 17

●●

●

●

●

●





GMM

-3 -2 -1 0 1 2 3

1/8 1/5 1/3 1/2 1 2 3 4 5 7

●●●





GMM

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

1/3 1/2 1 2 3

●●●●●●●●

●●●

●●●●●●

●●●●●

●●●





GMM

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

1/3 1/2 1 2 3

●

● ●

●





GMM

-4 -2 0 2 4

1/16 1/9 1/5 1/3 1 2 3 4 6 8 11 16a

c

e

b

d

f

Figure 3 Box-and-whisker plots of log-ratio error values forall metrics and traffic models. The lower axis indicates the log-ratio, while the upper axis shows raw ratio values. Each boxcontains the central majority of log-ratio values: the left and rightbounds are at the 25th and 75th percentiles. The dark middle lineindicates the median value, while the diamond marks the mean.

The whiskers (dotted lines) extend to the furthest non-outliervalues, while the points beyond that are outliers. The notch inthe middle of each bar indicates a 95% confidence interval for thetrue underlying median value; if two notches do not overlap, theyare very unlikely to have the same median

modeling problem to a more tractable matrix modelingproblem.

It is worth noting that GMM has competition froma few of the time-simplified models: the time marginaland time uniform model both do approximately as well,and in some cases better, for the metrics examined.Hernández-Campos et al. found strong evidence thatsession arrivals followed a time-varying Poisson arrivalprocess, meaning that other than large scale time-vary-ing arrival rate and very small scale clustering of flowswithin sessions, the overall flow arrival rate is fairlysmooth. We suspect that on the scale of minutes to tensof minutes in which our simulations exist, it is sufficientto model both flow and session arrivals uniformly. Thisresult indicates that the precise temporal placement offlows is not highly sensitive: they tend to be relativelyevenly spaced out at short time-scales, and networkperformance is not sensitive to changes in start-time onthat scale.

Not all marginal models have such a benign effecton the accuracy of performance metrics. The worstoffender by far is the size marginal model. This modeldistorts behavior at every level, by more than a factor20 in 25% of scenarios in the case of routing queuepacket drops. This is a highly significant result becausemodeling flow size using a network-wide marginal dis-tribution is precisely what Hernández-Campos et al.,for example, attempt. This result informs us that thisapproach is doomed to failure: flow size cannot beassigned without considering its relation to other flowproperties; source and destination nodes, and packetbehavior at the very least.

The packet marginal model is another offender, al-beit not as egregiously. The direction of the misrep-resentations in this case, however, is more dangerous:received throughput is overestimated by more thanthree times, 75% of the time. On the other hand,AODV control overhead is underestimated by half on

Mobile Netw Appl

average. Such serious misrepresentations are especiallynoteworthy given the apparently innocuous assumptionapplied by this model: the packet marginal model isessentially a standard variable bit-rate (VBR) packetbehavior model, using the empirical network-wide dis-tributions of packet sizes and inter-packet intervals.Otherwise the model matches the original trace behav-ior exactly. And yet these drastic distortions occur. Inour previous work, we have shown similarly drastic mis-representations to occur when CBR packet behavioris assumed [6]. This result indicates that to reproduceaccurate network performance, we must provide packetbehaviors on a more granular level: each node or eveneach flow must be assigned a “custom” packet behaviorby some means.

These strong negative results for two fundamentaland very common marginal models provide severe lim-its on the realism that can be achieved through marginalmodeling. Traffic modeling research must move beyondsimply finding parametric models for marginal distrib-utions of network-wide properties. Realistic behaviorcannot be generated using these distributions alone.Something must be captured about the interaction be-tween the elements of flow behavior.

Finally, we turn to the conditional models. Thesedemonstrate neither exceptionally good nor exception-ally bad behavior. That the conditional models shouldbe similar to the full marginal model but somewhatmore realistic is entirely expected: conditioning is arestricted form of marginal modeling, where each groupof flows has its own marginal distribution. In thesecases, the flows are grouped by source or by destina-tion. Neither source nor destination conditioning ap-pears to out-perform the other: they perform identicallyin each case; neither is perfect, but both are betterin general than any marginal or uniform model. Thesource/dest. conditional models tend to err in the samedirection as the full marginal model, but in a few caseswhere the aspects of marginality lead to errors in differ-ent directions, the outcome follows one of the aspectsmore than others. In the case of received throughput,for example, the packet marginal tendency to overes-timate error overshadows the tendency of the othermarginal aspects to underestimate. Overall, we see thatthe conditional models are more realistic than othermodels, but this is effect occurs only as the numberof nodes—i.e. the number of conditioning classes—approaches the full rank of the original traffic matrix.In our discussion, we present and demonstrate a farmore compelling method of reducing and simplifyingthe representation of the behavior of large collectionsof flows.

7 Discussion

Although we have shown it to be sufficiently realisticfor driving wireless simulations, the general matrixmodel is not intended to be a final model used bynetworking researchers. Nor is it intended to be theultimate perfect model of network behavior. Ratherit serves as both a research tool and a stepping stoneto better understanding of network traffic. Because itcaptures so much detail about traffic sample—withoutmaking assumptions of uniformity or marginality offlow behavior—it serves as a basis for deriving moreadvanced models that preserve more of the original be-havior of networks than any uniform or marginal modelever can. The first important role it plays is in allowingus to succinctly mathematically express common mod-els. All of the traffic models included in this paper areimplemented in only 19 lines of Matlab code, applyingsimple matrix transformations to the GMM representa-tion of traffic. Through the lens of linear algebra, we seewhy uniform, marginal and even conditional models failto adequately capture the gestalt behavior of networks:these simplifying assumptions force us to ignore mostof the information present in a trace. In what follows,we address limitations of the results presented in thispaper, but conclude with an exciting and promising newtechnique for traffic analysis that amply demonstratesthe potential of linear representation of network traffic.

7.1 Generality of results

A significant limitation on the generality of our re-sults is that they are based on a single data set fromIETF60—albeit a large and varied one. It is possiblethat traffic in this trace happens to produce networkperformance that is unusually dissimilar to standardtraffic behavior. This data set, however, represents ahighly heterogeneous collection of network usage be-haviors, from slow and steady off-peak usage, to ex-tremely heavy peak usage: over 800 users, 33 thousandflows, and 1 million packets in a single 10-min tracesegment. Despite the broad variety of behaviors, theresults are consistent: in all types of usage scenarios,uniform and marginal models, and to a lesser degreeconditional ones, systematically skew important perfor-mance measurements at all levels of the network. Whilethe precise results for other data sets might differ,it is very unlikely that these traffic models will hap-pen to accurately reproduce realistic performance inother experiments. This paper provides strong evidencethat better traffic models are needed for performanceevaluations. Thus, while it is necessarily limited, our

Mobile Netw Appl

evaluation points us strongly in the direction away fromuniform and marginal models.

Another potential concern about generality is therelatively limited set of performance metrics evaluated.As mentioned in Section 5.4, we have evaluated a broadvariety of performance metrics—28 in all. The metricsevaluated here were selected both because they areamong the most commonly used in evaluations, andbecause they are representative of the results for othermetrics. In general, metrics at the same level of theprotocol stack tend to behave in related ways: they

may be positively or negatively correlated, but if onemetric is severely distorted, it is unlikely that others areleft intact. The direction of the distortion of metrics ishighly sensitive to the experimental setup; the presenceof error, however, is not. In other words, models thatare sufficiently realistic in one experimental setup tendto be sufficiently realistic in other experiments; modelsthat misrepresent performance metrics in certain casestend to do so in other cases as well. The direction ofthe distortion may change, but the presence or absenceof accuracy does not. See previous work using the

Figure 4 The ten mostprevalent components of flowbehavior as pairs of packetpayload size and ofinter-packet intervaldistributions. Thedistributions are shown ascumulative distributionfunctions: the x-axisrepresents payload size (inkilobytes) or inter-packetinterval duration (inlog-seconds, base 10),respectively; the y-axisindicates the probability avalue occurring, less than orequal to the x value. Thebehaviors are ordered indescending rank of how manypackets they explain; to theleft, each behavior’s rank isindicated, together with thepercentage of total packetsassociated with it, and thecumulative percentage ofpackets explained. To theright of each behavior is a piechart showing the breakdownof associated packets byprotocol, as determined bywell-known port numbers.Pie slices are only labeled ifthey constitute at least 3% ofassociated packets

rank: 1pkts: 26%cum: 26%

0. 0.2 0.4 0.6 0.8 1. 1.2 1.4payload size (kB)

_5 _3 _1 0 1 2 3

_5 _3 _1 0 1 2 3

_5 _3 _1 0 1 2 3

_5 _3 _1 0 1 2 3

_5 _3 _1 0 1 2 3

intervals (log10_seconds)

payload size (kB) intervals (log10_seconds)




http (server)imaps (server)


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

ssh (client)

ssh (server)


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

http (client)http (server) ssh (client)


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

http (server)

icmp (client)

icmp (server)


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

http (client)http (server)

icmp (client)icmp (server)

gnutella_svc (client)

gnutella_svc (server)

gnutella_alt (server)imaps (client)

other

Mobile Netw Appl

same technique for evaluating realism for further exam-ples with different experimental conditions and moreperformance metrics considered [5, 6].

7.2 Computational complexity and overhead

Another potential concern about the approach of linearrepresentation is the additional computational over-head it may induce in experimental settings. Large ma-trix representations of traffic models are certainly morecomputationally costly than simpler models. Currently,however, realistic local-area traffic modeling is an un-solved problem. The purpose of this work is to establish

how traffic can be represented and generated realisti-cally at all, not how it can be done most efficiently. Weobserve, however, that generating application traffic fora simulation takes orders of magnitude less CPU-timethan running the simulation itself. Therefore, trafficgeneration complexity is considered a non-issue.

A related issue is the significant simulation timeincurred when experiments use realistic volumes oftraffic. Indeed, this is a problem: we have run oursimulations on a 64-node Itanium cluster, and the ex-periments still take weeks to complete. Given the cur-rent state of the art in simulators and CPU power,however, this problem is unavoidable. Including model

Figure 4 Continued


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4payload size (kB)

_ 5 _ 3 _ 1 0 1 2 3

_ 5 _ 3 _ 1 0 1 2 3

_ 5 _ 3 _ 1 0 1 2 3

_ 5 _ 3 _ 1 0 1 2 3

_ 5 _ 3 _ 1 0 1 2 3

intervals (log10_seconds)





gnutella_svc (client)gnutella_svc (server)

gnutella_alt (client)gnutella_alt (server)

other


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

http (server)

gnutella_svc (client)gnutella_alt (client)

pop3 (server)other


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

icmp (client)

icmp (server)


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

http (server)other


0. 0.2 0.4 0.6 0.8 1. 1.2 1.4

gnutella_svc (client)

other

Mobile Netw Appl

preparation times, running simulations with uniform ormarginal models takes negligibly less time than runningsimulations using realistic models like the GMM. Oneof the possible effects of this research, however, maybe the ability to classify and catalog a realistic set of“exemplary” traffic scenarios for various applications.Such a catalog is far from existing, but its existencewould allow researchers to run far fewer simulationsto achieve incomparably more general and reliableresults.

7.3 Nonnegative factorization of packet behavior

It remains to show how we can simplify and reducethe representation of network behaviors without mak-ing unwarranted and manifestly false assumptions ofuniformity or marginality of network behaviors. InSection 4.4, we observed that application of unifor-mity and marginality assumptions to traffic instancesis equivalent, in the GMM, to right and left multipli-cation, respectively, by matrices of rank one. Thesetransformations destroy almost all of the informationin the original traffic matrix. Conditional models areequivalent to left-multiplication by somewhat less de-generate matrices, but they achieve realism only byapproaching the GMM itself. In this section we describea matrix factorization technique that can provide anapproximate low-rank representation of the originaltraffic without assuming uniformity or marginality. Thistechnique not only allows us to simplify and analyzetraffic without destroying the information content, butalso leads us to a deep analytical understanding ofnetwork behaviors that matches our intuitive under-standing of typical modes of network behavior.

We present here preliminary results from applyingnonnegative matrix factorization (NMF) to the packetbehaviors of real flows. Recall from Section 4 that thematrices Z and V represent the packet size and inter-packet interval histograms of a collection of flows. Weapply NMF to the concatenation of these matrices,P = [Z V] ∈ R

f×(dz+dv). NMF attempts to approximateP as the product of nonnegative matrices:

P ≈ W�H, (31)

W ∈ Rf×b is column-stochastic, � ∈ R

b×b is diagonalwith descending diagonal entries, and H ∈ R

b×(dz+dv) isrow-stochastic. The inner dimension, b ∈ N, must bedetermined as well. What is the interpretation of sucha factorization? The rows of H are basic behaviors—pairs of packet size and inter-packet interval distrib-utions, such that each flow’s packet behavior can beapproximated as a linear combination of these basicbehaviors. Specifically, the matrix M� gives the mixing

Table 3 Components of flow packet behavior, ranked in order ofprevalence, with percentage of packets explained

Rank Description Well-known ports

1 26% Download HTTP, IMAPS, SSH2 14% SSH Typing SSH3 9% Web surfing HTTP, SSH4 9% ??? ICMP, HTTP, Gnutella5 8% ??? Everything6 7% Gnutella control Gnutella, SSH, HTTP, other7 7% Download HTTP, Gnutella, other8 6% Ping ICMP, Gnutella9 6% Download HTTP, other10 3% Gnutella server Gnutella, other

Also shown are our interpretation of each behavior and well-known port numbers associated with each.

coefficients for each flow. The diagonal entries of �

have a special meaning as well: they are the number ofpackets associated with each basic behavior. Thus thebasic behaviors are sorted in descending order of howmuch traffic they explain.

The striking results of applying nonnegative matrixfactorization to P are shown in Fig. 4. Twelve basic be-haviors suffice to explain all the traffic in a “toy” sampleof 4138 flows. The pairs of packet size and inter-packetinterval distributions are almost immediately recogniz-able as resembling intuitive modes of network behav-ior. The first basic behavior, for example, has all packetsizes near the MTU, while intervals are smoothly dis-tributed around a fraction of a millisecond. This is whatwe would expect from downloading large files; andindeed, the breakdown of packets by port number bearsout this interpretation—almost all of the traffic is HTTPserver traffic, with small slivers of IMAP and SSHserver traffic. The second basic behavior is equally in-tuitive: tiny packets with frequency distributed around0.1 s—this is all interactive SSH traffic, as indicated bythe associate pie chart. The other behaviors are listed inTable 3 with interpretations. The true significance ofthis result lies in the fact that NMF succeeds in classi-fying flows by network behavior completely automat-ically, not only without using port numbers, but alsowithout any information about the number or kind ofplausible interpretations. The algorithm not only agreeswith our intuition about network behavior, but exceedsit by discovering real but unknown behaviors awaitinginterpretation.

8 Conclusions

We have presented a fundamentally new, algebraic wayof representing traffic patterns in local area networks.

Mobile Netw Appl

We call this representation the general matrix model.Each flow of a traffic collection is represented as asingle high-dimensional vector with components rep-resenting the IP protocol type, source and destinationnodes, start time, flow size, and packet behavior. Theessential property of the algebraic representation is thatstandard vector operations compute natural and use-ful descriptions of aggregate behavior over the collec-tion of flows represented. Our experimental validationdemonstrates that the general matrix model accuratelyreproduces performance characteristics of real traffic.

The benefits of a performance-preserving algebraicrepresentation are multifold. The algebraic forms ofthe simplifying assumptions made by many commonmodeling approaches provide unprecedented clarityand coherence to a complex and confusing subject.For example, assumptions that various aspects of flowbehavior are stochastically or deterministically uniformall take the same form algebraically: left matrix mul-tiplication. The other major class of common simpli-fying assumptions made by traffic models is to applysome marginal behavior across the entire collection offlows or to disjoint subsets of the flows. Such simpli-fications correspond to right matrix multiplication inGMM. Because of the simplicity and clarity that thealgebraic structure brings to the subject, we can seethat both common types of simplifying assumptionsare naïve and simplistic. Our experimental results bol-ster this conclusion: both uniform and marginal mod-els significantly misrepresent many important networkperformance metrics. With the general matrix model,however, we stand poised to use the powerful tools ofmodern linear algebra to develop general, elegant andeffective models of network behavior.

References

1. Avallone S, Emma D, Pescap A, Ventre G (2004) A dis-tributed multiplatform architecture for traffic generation. In:International symposium on performance evaluation of com-puter and telecommunication systems, San Jose, CA, USA,July

2. Clark D (1988) The design philosophy of the DARPA inter-net protocols. In: ACM Sigcomm. ACM, New York

3. Hernández-Campos F (2006) Generation and validation ofempirically-derived TCP application workloads. PhD thesis,Univ. of North Carolina, Chapel Hill

4. Hernández-Campos F, Karaliopoulos M, Papadopouli M,Shen H (2006) Spatio-temporal modeling of traffic workloadin a campus WLAN. In: 2nd annual international workshopon wireless internet, Boston, MA, USA, August

5. Karpinski S, Belding E, Almeroth K (2007) Towards re-alistic models of wireless workload. In: 3rd workshop onwireless network neasurements (WinMee), Limassol, Cyprus,April

6. Karpinski S, Belding E, Almeroth K (2007) Wireless traffic:the failure of CBR modeling. In: 4th international confer-ence on broadband communications, networks, and systems(Broadnets). Raleigh, NC, USA, September

7. Lee D, Seung H (2001) Algorithms for non-negative matrixfactorization. Adv Neural Inform Process 13:556–562

8. Paxson V, Floyd S (1995) Wide-area traffic: the failure ofpoisson modeling. IEEE/ACM Trans Networking 3(3):226–244, June

9. Perkins C, Belding-Royer E, Das R (2003) Ad hoc on-demand distance vector (AODV) routing. Request forcomments (Experimental) 3561, Internet Engineering TaskForce, July

10. Rudin W (1987) Real and complex analysis, 3rd edn. McGrawHill, New York, NY

11. Sommers J, Barford P (2004) Self-configuring network trafficgeneration. In: ACM/USENIX internet measurement confer-ence, Taormina, Sicily, Italy, pp 68–81 October

12. Weigle M, Adurthi R, Hernández-Campos F, Jeffay K, SmithF (2006) Tmix: a tool for generating tcp application work-loads in ns-2. In: ACM/SIGCOMM computer communica-tion review, Pisa, Italy, July

linear representation of network trafï¬c

Documents