video transmission over lossy wireless networks: a cross

16
Video Transmission over Lossy Wireless Networks: A Cross-layer Perspective Scott Pudlewski, Member, IEEE, Nan Cen, Student Member, IEEE, Zhangyu Guan, Member, IEEE, and Tommaso Melodia, Member, IEEE, Abstract—Video content currently makes up nearly half of the “fixed” internet traffic and more than a third of the mobile traffic in North America, with most other regions showing similar trends. As mobile data rates continue to increase and more people rely on 802.11 wireless for home and commercial internet access, the amount of video transmitted over at least one wireless hop will likely continue to increase. In addition, as cameras continue to become smaller and cheaper, the demand for video services in sensor and MANET networks will also increase. In this article, we examine state-of-the-art of wireless video communication at each layer of the networking stack. We consider both existing and emerging technologies at each layer of the protocol stack as well as cross-layer designs, and discuss how both these solutions can increase the video experience for the end user. The goal of this work is to begin discussing the way forward for wireless video-enabled applications and devices. I. I NTRODUCTION In 2013, it was reported [1] that 28.18% of all peak-time Internet traffic was attributed to Netflix [2], while YouTube [3] accounted for another 16.78%. To put this in perspective, all http traffic only accounted for 9.26% of traffic [1]. In addition, as available data rates increase, interactive video services [4] are becoming more profitable and therefore more prevalent. As low priced video-enabled mobile devices (such as smart phones) become more common, video creation and consumption on mobile devices is increasing dramatically. Most of these advances are due (at least in part) to ad- vances in predictive video encoding algorithms (for example, H.264/AVC [5] and the soon-to-be-released HEVC [6]), which enable extreme compression of video traffic with very little distortion. However, restrictions in data rate, computational complex- ity, battery life, cost, and channel quality still limit the per- formance of wireless video streaming on resource-constrained devices. The current practice of focusing predominantly on rate-distortion performance with little consideration for these restrictions has led to the development of encoders that often perform poorly in non-ideal network conditions. Current wire- lessly networked streaming systems are therefore affected by the following limitations. A. Data Rate Constraints While the exact data rate required for a video stream is highly variable and dependent on specific parameters of the This paper is based upon work supported in part by the US Office of Naval Research under grant N00014-11-1-0848 and by the US National Science Foundation under grant CNS1117121. S. Pudlewski is with the Massachusetts Institute of Technology (MIT) Lin- coln Laboratory, Lexington, MA 02420 (email: [email protected]). N. Cen, Z. Guan and T. Melodia are with the Department of Electrical Engineering, University at Buffalo, The State University of New York, Buffalo, NY 14260 (email {nancen, zguan2, tmelodia}@buffalo.edu). video encoder, video streaming traditionally requires high throughput. To use commercial systems as an example, the YouTube streaming rate is usually around 285 kbit/s, and it can reach rates of up to 1005 kbit/s [7]. Similarly, Netflix data rates range from 100 kbit/s to 1750 kbit/s for standard definition, and 2350 kbit/s to 3600 kbit/s for high definition, with the application choosing the highest rate that the network is able to support [8]. Because these streaming are designed to fully take advantage of any wireless capacity, they will often force the network to operate in a near-congested regime, forcing the application to adapt the video rate to prevent congestion as other flows enter the network. B. Complexity Constraints While high-end mobile devices have recently become commercially available (i.e., smartphones, tablets), these de- vices are for the most part battery-powered and resource- constrained. While they are capable of implementing complex video encoding algorithms, the real-time execution of such algorithms will drain the battery of the device very quickly [9]. This leads to a tradeoff between data rate requirements and power constraints. Reducing the complexity of the com- pression algorithm to conserve energy at the encoder decreases the resulting encoding efficiency, resulting in larger amounts of data that need to be transmitted. C. Channel Conditions Compensating for lossy wireless channels is a major chal- lenge. It is well known that predictively encoded video is susceptible to bit errors. This is due in part to the use of variable-length coding (i.e., Huffman coding) in which a single bit error can cause the loss of entire blocks of data. In data networks, bit errors are usually dealt with using some form of error correction scheme which generally has an all- or-nothing approach to error correction, in that a received packet is either entirely correct or is discarded and must be retransmitted. However, while too many errors can cause significant distortion to the end user, guaranteeing error-free reception is unnecessary [10]. While the quality does decrease sharply when the BER increases beyond some threshold, for low levels of BER there is typically no perceptually observable decrease in video quality. This leads to an obvious tradeoff between the quality of the received video and the techniques used to reduce the BER. This is shown in Fig. 1, which shows the performance of compressive video sensing (CVS) [9] and H.264/AVC [5] for increasing bit error rates (BER). CVS is a video encoder based on compressed sensing principles, which will be discussed in detail in Section III-D. In these tests, we see that there is little Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Upload: others

Post on 21-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Transmission over Lossy Wireless Networks: A Cross

Video Transmission over Lossy Wireless Networks: ACross-layer Perspective

Scott Pudlewski, Member, IEEE, Nan Cen, Student Member, IEEE, Zhangyu Guan, Member, IEEE,and Tommaso Melodia, Member, IEEE,

Abstract—Video content currently makes up nearly half ofthe “fixed” internet traffic and more than a third of the mobiletraffic in North America, with most other regions showing similartrends. As mobile data rates continue to increase and more peoplerely on 802.11 wireless for home and commercial internet access,the amount of video transmitted over at least one wireless hopwill likely continue to increase. In addition, as cameras continueto become smaller and cheaper, the demand for video services insensor and MANET networks will also increase. In this article,we examine state-of-the-art of wireless video communication ateach layer of the networking stack. We consider both existingand emerging technologies at each layer of the protocol stack aswell as cross-layer designs, and discuss how both these solutionscan increase the video experience for the end user. The goal ofthis work is to begin discussing the way forward for wirelessvideo-enabled applications and devices.

I. INTRODUCTION

In 2013, it was reported [1] that 28.18% of all peak-timeInternet traffic was attributed to Netflix [2], while YouTube[3] accounted for another 16.78%. To put this in perspective,all http traffic only accounted for 9.26% of traffic [1]. Inaddition, as available data rates increase, interactive videoservices [4] are becoming more profitable and therefore moreprevalent. As low priced video-enabled mobile devices (suchas smart phones) become more common, video creation andconsumption on mobile devices is increasing dramatically.Most of these advances are due (at least in part) to ad-vances in predictive video encoding algorithms (for example,H.264/AVC [5] and the soon-to-be-released HEVC [6]), whichenable extreme compression of video traffic with very littledistortion.

However, restrictions in data rate, computational complex-ity, battery life, cost, and channel quality still limit the per-formance of wireless video streaming on resource-constraineddevices. The current practice of focusing predominantly onrate-distortion performance with little consideration for theserestrictions has led to the development of encoders that oftenperform poorly in non-ideal network conditions. Current wire-lessly networked streaming systems are therefore affected bythe following limitations.

A. Data Rate ConstraintsWhile the exact data rate required for a video stream is

highly variable and dependent on specific parameters of the

This paper is based upon work supported in part by the US Office of NavalResearch under grant N00014-11-1-0848 and by the US National ScienceFoundation under grant CNS1117121.

S. Pudlewski is with the Massachusetts Institute of Technology (MIT) Lin-coln Laboratory, Lexington, MA 02420 (email: [email protected]).

N. Cen, Z. Guan and T. Melodia are with the Department of ElectricalEngineering, University at Buffalo, The State University of New York,Buffalo, NY 14260 (email {nancen, zguan2, tmelodia}@buffalo.edu).

video encoder, video streaming traditionally requires highthroughput. To use commercial systems as an example, theYouTube streaming rate is usually around 285 kbit/s, and itcan reach rates of up to 1005 kbit/s [7]. Similarly, Netflixdata rates range from 100 kbit/s to 1750 kbit/s for standarddefinition, and 2350 kbit/s to 3600 kbit/s for high definition,with the application choosing the highest rate that the networkis able to support [8]. Because these streaming are designedto fully take advantage of any wireless capacity, they willoften force the network to operate in a near-congested regime,forcing the application to adapt the video rate to preventcongestion as other flows enter the network.

B. Complexity ConstraintsWhile high-end mobile devices have recently become

commercially available (i.e., smartphones, tablets), these de-vices are for the most part battery-powered and resource-constrained. While they are capable of implementing complexvideo encoding algorithms, the real-time execution of suchalgorithms will drain the battery of the device very quickly[9]. This leads to a tradeoff between data rate requirementsand power constraints. Reducing the complexity of the com-pression algorithm to conserve energy at the encoder decreasesthe resulting encoding efficiency, resulting in larger amountsof data that need to be transmitted.

C. Channel ConditionsCompensating for lossy wireless channels is a major chal-

lenge. It is well known that predictively encoded video issusceptible to bit errors. This is due in part to the use ofvariable-length coding (i.e., Huffman coding) in which a singlebit error can cause the loss of entire blocks of data. Indata networks, bit errors are usually dealt with using someform of error correction scheme which generally has an all-or-nothing approach to error correction, in that a receivedpacket is either entirely correct or is discarded and mustbe retransmitted. However, while too many errors can causesignificant distortion to the end user, guaranteeing error-freereception is unnecessary [10]. While the quality does decreasesharply when the BER increases beyond some threshold, forlow levels of BER there is typically no perceptually observabledecrease in video quality.

This leads to an obvious tradeoff between the quality of thereceived video and the techniques used to reduce the BER.This is shown in Fig. 1, which shows the performance ofcompressive video sensing (CVS) [9] and H.264/AVC [5] forincreasing bit error rates (BER). CVS is a video encoder basedon compressed sensing principles, which will be discussed indetail in Section III-D. In these tests, we see that there is little

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 2: Video Transmission over Lossy Wireless Networks: A Cross

2

10−6

10−5

10−4

10−3

10−2

10−1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bit Error Rate (BER)

Str

uctu

ral S

imila

rity

(S

SIM

)

SSIM vs BER for Constant Encoded Video Rate

Measured SSIM for H.264Theoretical SSIM for H.264Measured SSIM for CVSTheoretical SSIM for CVS

Fig. 1: SSIM [11] vs BER for H.264 and CVS Encoders.

or no effect in the perceivable quality in the received video forBER rates of up to 10−4 for H.264 [5] or for 10−3 for CVS[9]. This shows that, while channel conditions can be highlyvariable, in many cases we can simply ignore errors. When welook at a system implementation, this intuitively means thateven with low-quality and low-power transceivers such as inwireless sensor network (WSN) systems [12], [9], it is stillpossible to achieve high-quality video by i) avoiding the useof packet-level error detection schemes such as CRC, and ii)assuring that the BER is “low enough”. Below, we will discusstechniques at both the video encoder and PHY layer that canpotentially accomplish this.

D. Network ConstraintsThese tradeoffs get tightly coupled and become much

more difficult to understand when streaming video over largescale networks, e.g., mesh, sensor networks, and vehicularnetworks. First, it is by no means easy for the video sourceto acquire exactly the dynamic state information of the entirenetwork, which may comprise multiple hops, multiple paths,and be characterized by mobility and heterogeneity of devices.Second, information-centric (or content-aware) transmissionis not easy at intermediate nodes that, unlike the originalsource, may not be aware of all the details of the videos tobe transmitted. Instead, only limited video information (e.g.,roughly-classified priority level, delay requirements) may beavailable, while the packetized video content is typically notdecoded at intermediate nodes. To achieve a good tradeoffbetween in-network processing and transmission, cross-layerdesign may be needed that jointly considers application-specific Quality of Experience (QoE) requirements, end-to-end rate control, path and link maintenance, available radioresources, and available energy budget of the nodes. It is still achallenge to dynamically and adaptively find the optimal pathsunder application-specific constraints to improve the end users’QoE. Some approaches designed to address these challengeswill be discussed in the remainder of this paper.

E. SummaryIn this work, we will discuss how these challenges are

being addressed by state-of-the-art solutions at all layers ofthe networking protocol stack. We will examine both thecurrent state-of-the-art as well as emerging trends in wirelessvideo streaming. We will discuss recent work being performedat each layer to enhance video communications, as wellas discuss open problems in wireless video networking andcomment on the way forward.

Tactical MANET

Cellular

WiFi Router

Home Wireless

(a) Cellular network topology.

Tactical MANET

Cellular

WiFi Router

Wireless

(b) Wireless network topology.

Tactical MANET

Cellular

WiFi Router

Home Wireless

(c) Tactical MANET network topology.

Sensor Network

(d) Sensor network topology.

Fig. 2: Network Categories.

The remainder of this paper is structured as follows. InSection II, we will discuss the system models and networkscenarios that will be considered throughout the article. Sec-tion III discusses challenges in the design of modern videoencoders for wireless networks. We will then discuss solutionsat each of the layers of the networking protocol stack. We willconcentrate on transport layer in Section IV, network layerin Section V, data link layer in Section VI, physical layerin Section VII, and on cross layer solutions in Section VIII.Finally, in Section IX we will draw the main conclusions.

II. WIRELESS VIDEO TRANSMISSION OVERVIEW

In this section, we briefly introduce the system models andnetwork scenarios that will be considered as a reference in theremainder of this article. We first discuss a number of “typical”wireless paradigms, and discuss for each of them advantagesand challenges for video streaming. We then discuss eachlayer of the networking protocol stack, and discuss their rolesand influence on wireless video transmission. The goal is todemonstrate how the received quality of wireless video isinfluenced by control decisions taken at each and every layerof the protocol stack.

A. Video Network ParadigmsNot surprisingly, the performance of wirelessly-transmitted

video is highly dependent on the type of wireless networkconsidered. In general, it is more challenging to transmitvideo over networks with multliple wireless hops. While thesolutions we present here attempt to be independent of any

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 3: Video Transmission over Lossy Wireless Networks: A Cross

3

specific wireless networking protocol, they typically fall intoone or more of the following categories.

• Cellular networks. As available data rates in cellularnetworks increase, the number of users watching or cre-ating video on mobile phones is increasing dramatically.Current-generation cellular networks are wireless onlyat the first and last hop, leading to what is essentiallya wired network with a wireless “last-mile” link1. Thiswired backbone allows for much higher data rates thanare possible in fully wireless networks. However, the de-vices are usually resource constrained. A typical cellularnetwork architecture is shown in Fig. 2a.

• Wireless local area network (WLAN). Similar to cellu-lar networks, a home or commercial wireless local areanetwork generally relies on existing wired infrastructurefor the majority of the link, and some version of the IEEE802.11 [13] standard to bridge the network from the fixedinfrastructure to an end user. Similar to a cellular network,since there is only a single wireless hop, the data rate istypically limited by the rate achievable on the wirelesslink. The network architecture of a WLAN is shown inFig. 2b.

• MANET, and tactical MANET. A mobile ad-hoc net-work (MANET) is a self-organizing, self-configuringinfrastructureless multi-hop wireless network. WhileMANETs have seen limited commercial use, they arefundamental to military tactical networking solutions, andcan include highly heterogeneous links. Because they relyon multiple wireless hops, the data rate of these networksis often very limited. An example of such a network isshown in Fig. 2c.

• Sensor networks. Sensor networks are similar toMANET networks in that they also form an infrastruc-tureless multi-hop wireless network. However, sensornetworks are generally designed to be low cost, singlepurpose networks designed to sense and relay their localenvironment. This can be through the use of any numberof sensing modalities, from simple light and temperaturesensors to more complex audio and video collection. Inmost cases, the sensor networks either take advantageof existing devices networks (such as using existingsmartphones to monitor an urban environment), or deployspecific single-use devices. Often in the latter case, thewireless devices are deployed where accessing them isvery difficult, and the devices are treated as disposable.An example of a sensor network attached to a wirelessrouter is shown in Fig. 2d.

B. Video Traffic ParadigmsAlong with the network topology, the performance of

wireless video is also influenced by the nature of the videoapplication. The specific application influences the toleranceto latency, distortion, and bandwidth, as well as constraints onprivacy or security restrictions and format requirements. Be-low, we will introduce a number of common traffic paradigms.

1It is worth noting that LTE-advanced, as well as next-generation standardsmay take advantage of multiple wireless hops. However, the assertion that themajority of the link between two wireless cellular nodes is able to rely onwired infrastructure is still valid.

• Real-time video. Real-time video is a video trafficparadigm in which the video is being used for somereal-time application such as video telephony. Becausethe video is being used to communicate in real time,low latency is essential. Even small delays can have asignificant impact on the quality of the video commu-nication experience. Such services can be viewed as asystem that attempts to minimize the delay between thecontent being captured at the source and the content beingdisplayed by the receiver. Because of this emphasis ontimely delivery, such systems may be willing to sacrificequality for latency.

• Video gaming. Video gaming applications are similarto real-time video applications in that the latency of thedelivered video is the primary concern. However, unlikevideo telephony which can be viewed as a small numberof independent video streams, video gaming applicationstend to be highly interactive. Such systems add additionallow-latency interactive requirements to the network.

• Video on demand. Unlike real-time video services, videoon demand services transmit pre-recorded content basedon the demands of the end user. Such services generallytake advantage of a small buffer as well as the availabilityof the entire video stream to deliver much higher qualityvideo than more time-sensitive applications.

• Interactive video. By its traditional definition, interactivevideo is essentially video on demand in which the usercan interact with the playback of the video. This includestraditional playback commands such as pause, fast for-ward, and rewind. With the huge demand for internetvideo services such as Netflix and YouTube, interactivevideo has become the dominant source of internet trafficin many countries.

• Multimedia surveillance. Video surveillance networksuse video applications to monitor an area, usually toattempt to detect unauthorized or unexpected activity.While they often deliver real-time video content to an enduser, video surveillance applications may also deliver ahigh-quality version of the video to a storage device forlater forensic requirements.

C. The Networking Protocol StackFor this work, we will be using the networking protocol

stack shown in Fig. 3. In the following sections, we willdiscuss how decisions at the application, transport, network,data link, and physical layers influence the received quality ofvideo transmitted over a wireless network. Specifically:

• Application layer. The application layer is responsiblefor the compression and formatting of the video stream.In Section III we will discuss how different compressiontechniques can influence the quality of video transmittedin lossy channels.

• Transport layer. The transport layer controls the end-to-end delivery of the video packets. This influences boththe rate at which the video packets can be transmittedthrough the network as well as any end-to-end deliveryguarantees that may or may not be provided by theprotocol. We will discuss techniques for rate control andadmission control for wireless video in Section IV.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 4: Video Transmission over Lossy Wireless Networks: A Cross

4

Application Layer

Transport Layer

Network Layer

Data Link Layer

Physical Layer

· Video Compression· Formatting of Video Bytestream

· Rate Control· End-to-end Delivery

· Multi-Hop Path Selection· End-to-End Delay Guarantee

· Channel Access· One-Hop Delay Guarantee

· Channel Adaptation· Error Correction

Fig. 3: Networking protocol stack for wireless video streaming.

• Network layer. The network layer controls the pathselection for the video packets. Section V will discusshow using video-specific metrics in the routing decisionscan significantly affect the video quality.

• Data link layer. Among the functionalities handled atthe data link layer, medium access control (MAC) is incharge of enabling fair sharing of the wireless broadcastmedium among different users. We will see in SectionVI that by designing the MAC protocol specifically forvideo transmission, we can greatly increase the receivedvideo quality.

• Physical layer. The physical (PHY) layer influences thedata rate and bit error rate of the video transmission.In Section VII, we will discuss how new advances incooperative communication techniques and in ratelesscoding, among others, can be leveraged to improve videoquality.

By looking at the above list, it is easy to see that manyadvances can be achieved by taking video-specific (i.e., ap-plication layer) information into account when making lowerlayer decisions. This cross-layer approach can help enablemuch of the video-quality optimization that is used in theseapproaches.

III. VIDEO ENCODERS

Wireless video streaming is fundamentally different fromvideo streaming over wired networks. In this section, we willintroduce some video encoding techniques that are relevant towireless video transmission. The reader is referred to [14], [5],[15], [16] and the references therein for more details.A. Intra-Frame Encoding

Intra-frame video encoders process each frame independent-ly using still image encoding. The main difference betweenthis class of encoders and a simple collection of images isthat, while the frames are encoded independently, they aretransmitted as a single video file (including video header-s). While these protocols have relatively poor rate-distortionperformance compared to the more complex predictive videoencoders, they have low complexity at the encoder. This can beimportant for those multimedia devices that do not have theprocessing power (or battery capacity) to execute the more

complex algorithms, e.g., wireless sensor nodes. In addition,since each frame is independent, there is far less visibledistortion from channel errors.

1) Motion JPEG (MJPEG): MJPEG video encoding isan intra-frame encoding scheme based directly on the JPEGimage compression standard [15]. Although there is no officialMJPEG standard, in most implementations, each frame is firstdivided into 8×8 blocks that are transformed to the frequencydomain using the discrete cosine transform (DCT), creating an8× 8 block of DCT coefficients.

From this point, the DCT coefficients of each macroblockare quantized and entropy-encoded, resulting in significantcompression. The DCT concentrates most of the signal energyinto a few coefficients; this allows the bit-allocation to easilygive more weight to the low-frequency components. Moreover,this corresponds well to the human visual system, whichallows the encoder to remove many of the high frequencycomponents with little effect on the perceived quality of theresulting image. In MJPEG, this process is done for eachframe independently. The resulting video has compressionand quality comparable to JPEG image compression andcan be extended without significant processing and powerrequirements at the encoder.

2) Motion JPEG 2000 (MJ2): Similar to MJPEG, MJ2video is an intra-frame encoding standard based on the JPEG2000 compression standard [16]. Each MJ2 frame is first splitinto rectangular regions referred to as tiles. Once the imagehas been decomposed into tiles, a discrete wavelet transform(DWT) is applied. At this point the wavelet transformedcoefficients are quantized and entropy encoded. While thecompression rate of MJ2 may be better than that of MJPEG[16], many of the advantages of MJ2 over the older MJPEGare with the flexibility of the standard. These include anaturally scalable representation and the availability of regionsof interest, where a spatial portion of the image is encoded athigher resolution than the rest of the image.

B. Predictive Encoding

Discussing the differences between common video encoders(i.e., MPEG2 [17], H.263 [14], H.264/AVC [5], HEVC [6])would involve understanding details of them, which is beyondthe scope of this work. Instead, we are going to use theH.264/AVC video encoder as defined in [5] as an example ofpredictive video encoders in general. For details of the otherencoders, the reader is referred to the works cited above andthe references therein.

Along with the frequency transform - quantization - entropyencoding functions found in MJPEG and MJ2, predictive videoencoders take each image block and compare it to othermacroblocks both within the same frame (intra-prediction)or in a previous frame (inter-prediction). By identifying thedifference between two macroblocks and encoding only thatdifference, the encoder can significantly reduce the amountof information necessary to represent a video for a desiredquality.

For wireless video transmission, the prediction operationhas several important consequences. Primarily, it creates aninterdependence between frames in a video stream. Generally,video streams are divided into three types of frames. There

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 5: Video Transmission over Lossy Wireless Networks: A Cross

5

are intra-encoded (I) frames, predicted (P ) frames, and bi-directionally predicted (B) frames. As the names suggest, theI frames are encoded using only blocks contained within thatframe2, P frames are encoded using both blocks within theframe as well as blocks in the previous I or P frame, and Bframes are predicted from blocks within the frame and fromboth the previous and next I or P frame. Some encodersmay also use more complex frame reference structures inwhich B frames are predicted from multiple I , P , or evenother B frames. As we will see later, while prediction greatlyincreases the rate-distortion performance of the encoder, thereare a number of tradeoffs in both the error resiliency andthe complexity. Even though these systems attempt to beresilient to errors, their primary focus is on improved ratedistortion performance. Because of this, while state-of-the-art commercial encoders have very impressive rate-distortionperformance, encoders that focus more on these other aspectsmay have other desirable properties in specific video streamingapplications.

C. Distributed Video Encoding

It was shown in [18] [19] that the performance of sidechannel coding could theoretically match that of predictivecoding. Since side channel coding naturally pushes much ofthe complexity from the encoder to the decoder, this could helpaddress one of our primary challenges if it could be used todevelop a video encoder. In [20], the authors accomplish thisby developing the Power-efficient, Robust, hIgh-compressionSyndrome based Multimedia (PRISM) encoder.

To understand the key concepts of PRISM and similarencoders, first note that the motion prediction is the primarysource of computational complexity. Unfortunately, motionprediction is also the primary source of compression in mostencoders. Therefore, an encoder that could take advantage ofthe motion prediction capabilities without the added complex-ity could be ideal for wireless devices. Side channel codingtechniques enable this functionality. Assume as in [20] thatX is the block to be coded in the current video frame, Y isthe block that represents the best predictor for X , and theyare related through a noise term N such that X = Y + N .Even though it is computationally infeasible to determine Yat the source, it is appropriate to assume that the destinationwill have access to Y and that the statistics of the “correlationnoise” N are known. We can then (after intra-encoding) usethe knowledge of N to partition the codeword space of X andsend only this coset of X to the destination. The destinationcan then use Y together with the statistics of N to determinewhich member of the coset should be decoded.

Since there are fewer cosets of X than there are originalcodewords of X , fewer actual bits need to be transmitted.However, since we know the relationship between X andY , we can still achieve compression rates similar to that ofpredictive coding without determining Y at the source. In thecase of multi-view video, this also provides a potential toexploit inter-view correlation at the receiver side, e.g., jointlydecode independently-encoded multi-view videos [21].

2In addition to intra-encoding similar to that described in Section III-A,most I frames also use intra-prediction where blocks within a frame arepredicted from other blocks within the same frame.

D. Compressed Sensing Based Video Encoding

Compressed sensing (CS, aka “compressive sampling”) is anew paradigm that allows the faithful recovery of signals fromM � N measurements where N is the number of samplesrequired for the Nyquist sampling [22]. Hence, CS can offer analternative to traditional video encoders by enabling imagingsystems that sense and compress data simultaneously at verylow computational complexity for the encoder. CS images andvideo are also resilient to channel errors [10].

Similar to distributed video encoding described above, com-pressed sensing based video encoders, specifically CVS [9],take advantage of known properties of the encoded videoframe to reduce the number of bits that need to be transmittedto accurately represent the frame. In this case, rather than as-suming some prediction signal (as in the Y signal above) CVSleverages the sparsity of the DWT or DCT encoded frame,along with the sparsity of the difference between frames, tocompress the video. Specifically, the intra-encoded frame yi

is calculated from the raw frame xi as yi = φxi, whereφ ∈ RM×N ,M � N is a “noise-like” under-determinedsampling matrix. To recover xi from yi at the receiver,the decoder solves a sparse signal reconstruction problem,essentially finding the “sparsest” signal that fits the measure-ments in yi. To take advantage of temporal correlation, CVStakes advantage of the fact that the difference between twocorrelated frames is sparse, and uses this information to againcompressively sample the difference between two encodedframes.

While in general compressed-sensing-based encoders cannotachieve the same compression rates as predictive encod-ing schemes, there are additional properties including errorresilience and extremely low computational complexity [9]that make such encoders appealing for resource-constrainedenvironments. In resource constrained environments with alimited energy budget, CVS was shown to perform better thanintra-encoded H.264/AVC [9] in terms of overall energy-rate-distortion performance.

IV. TRANSPORT PROTOCOLS

Streaming video has traditionally relied on user datagramprotocol (UDP) at the transport layer [23]. This was a directresult of the delivery guarantee requirement for TCP. For filetransfer applications, it is essential that every packet is deliv-ered. Because of this, TCP will (sometimes severely) increasethe latency in order to deliver lost packets. However, as wediscussed above for real-time streaming video, latency is thecritical parameter in many video streaming applications. Thisled to video being streamed over the potentially lossy UDPprotocol, while the reliability responsibilities were transferredup the protocol stack to protocols such as RTP with RTCP.

However, as video on demand (interactive internet videoservices) became more prevalent, the congestion control com-ponents of TCP became necessary to prevent congestionat major video distribution services. In addition, higher-bandwidth/lower-latency networks, along with cheap memoryfor buffering video packets, let to many major video distri-bution services, including YouTube and Netflix, relying onstandard TCP connections [24].

While this is feasible in cellular or Wi-Fi type networkswhere the data rate is relatively high, MANET and sensor

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 6: Video Transmission over Lossy Wireless Networks: A Cross

6

networks pose a different type of challenge. Because of thelossy nature of such networks, TCP performs too poorly tobe practical. However, since the data rate of these networksis so low [25], [26], congestion control is essential for mul-tiple video streams to share the network. We will show thatintegrating congestion control with the video encoder at theapplication layer can substantially increase the overall receivedvideo quality [25].

In this section, we will examine the impact of transportcontrol on wireless video transmission. We will examinehow traditional congestion control protocols can be usedfor different types of wireless video transmission, includingan overview of how congestion control is implemented inYouTube. We will also examine some ways in which thesetraditional techniques can be improved in wireless networks.Finally, we will examine an approach to congestion control inhighly-constrained sensor networks.A. Congestion Control for Wireless Video Using TraditionalProtocols

The limited capacity of wireless networks requires thatspecific functionalities need to be defined to control the videorate when the offered traffic exceeds the network capacity.Therefore, to begin, we will first examine the applicabilityof standard well-known protocols to the congestion controlproblem for streaming video.

1) The “Call Manager” Approach: At a high level, a callmanager is responsible for admission control into a bandwidth-limited network. The general concept is simply that if thetotal available capacity of the network is C, and each calleri requires a minimum capacity ci to achieve an acceptable

received call quality, then only k callers such thatk∑

i=1

ci ≤ Cwill be allowed into the network.

While this approach was originally designed for wirednetworks, it can be applied to wireless video streaming, eventhough some challenges need to be addressed. First, evenin a relatively stable network, wireless links are affected bybandwidth fluctuations. Therefore, the value of C will likelychange over time because of channel fluctuations, requiringrules for “dropping” video streams that exceed the availablecapacity. Second, this approach requires a centralized controlpoint in the network that has the ability to prevent newvideo streams from joining the network. Again, this may bechallenging especially MANET networks where there is bydefinition no central coordinating node.

Even with these limitations, this approach can enable videoin networks where the source of the video is a remote node,the capacity of the network is limited, and/or the payload ofthe packets is encrypted (and therefore it is not possible tointelligently drop frames). If the video rate can not be adaptedat all, this approach may be the only feasible method formaintaining video quality for at least a subset of users.

2) Congestion Control Based on Standard TCP: If thelatency constraints of a real-time streaming system can berelaxed, it is possible to implement video streaming servicesbased on standard TCP protocols. Recently, this has approachhas become more prevalent as the need for congestion controlof video streams is becoming more important. For TCP-basedvideo streaming services to be able to deliver acceptable per-formance, however, a few requirements need to be observed:

Sequ

ence Offset (M

B)

Relative Time (s)

0

1

2

3

0 302010

Data FlightReceive WindowAcknowledged Data

Fig. 4: A time sequence graph for YouTube video streaming [27].

• Sufficient capacity. Traditional TCP approaches arewidely used in commercial systems (i.e., YouTube). Thisis largely due to the use of advances video encoders (i.e.,H.264 [5]). These encoders allow YouTube to compressvideo to far below the capacity of most cellular andwireless networks (from 285 kbit/s up to 1005 kbit/s[7]).

• Sufficient buffers. Capacity in wireless networks ishighly variable. Ensuring that the receive buffers are largeenough t o “smooth over” capacity fluctuations allows foran uninterrupted user experience [27].

• Sufficient channel quality. TCP will dramatically un-derestimate the capacity of a wireless network that hasa high packet loss rate [28], [29]. For video streamingto work with TCP, the packet loss rate needs to be verylow, or some performance enhancing proxy (PEP) mustbe utilized to maintain a high enough rate.

Block-based flow control: In [27], in order to understandthe operation of the proprietary YouTube video streamingserver protocol, the authors investigate application flow controlof streaming YouTube video. They were able to determinethat the server was limiting the rate that packets were beingintroduced into the network based on the playback rate of thevideo. The graph showing the TCP sequence number vs. timefrom [27] shown in Fig. 4 demonstrates this “application flowcontrol” scheme based on network traces of YouTube stream-ing video. For the first 7 seconds, the sequence resemblesstandard TCP slow-start. However, after this point, the packetsare transmitted in “blocks” at regular intervals. The intuitivereason for this is that while the initial slow-start phase isrequired to build up a buffer at the receiver, there is no reasonto transmit at a significantly higher rate than the playback rateof the video. It is conceptually similar to using a third playbackwindow to avoid transmitting more packets than are required.There seems to be two main consequences of this additionalwindow. First, it prevents the server from transmitting videopackets that will not be viewed at the receiver, which overmany video streams could lead to significant resource savings.It also prevents unnecessary reconstruction and buffering ofvideo content at the receiver. It is also worth noting that whilethese results are based on YouTube traces, similar functionalityhas been noted in Netflix streams.

While this scheme effectively limits the transmission rateof the video to the video playback rate, as the authors of[27] point out, the block based streaming scheme as shown inFig. 4 does not perform well in congested or nearly-congested

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 7: Video Transmission over Lossy Wireless Networks: A Cross

7

Fig. 5: Congestion caused by block-based streaming in YouTube[27].

networks. The blocks themselves are relatively large3 (com-pared to a single packet). While the rate that new blocks areintroduced is limited, all of the packets from an individualblock are transmitted at a much higher instantaneous rate.However, since the blocks are large enough to overflow apartially full queue, it is possible that these short bursts oftraffic can cause congestion, causing all of the flows to backoff (including the offending video stream itself). In a nearlycongested home wireless network, this would essentially causea periodic decrease in the overall throughput of the network.This behavior is verified in [27] and shown in Fig. 5, whichplots the TCP sequence offset vs time. Each time the sourcetransmits a block of video packets, the TCP protocol experi-ences a loss event and returns to a slow-start phase.

Non bandwidth-intrusive video streaming. While theblock-based implementation may have some drawbacks, theconcept of a playback-limited flow control for streaming videocan have a lot of benefit. In [30], the authors also noted theblock-based rate control. However, they then propose a newend-to-end solution that would limit the rate of video packetsinto the network based on the video playback rate, but withoutthe burstiness of the existing implementation.

The rate that packets are introduced into the network isbased on the playback rate. The server defines an applicationwindow that limits the number of in-flight packets basedon the video playback buffer at the client. The window islimited based on how fast the user is watching the video.If the user pauses the video, the playback buffer will notdrain and therefore the server will reduce the introduction ofnew packets into the system. While this algorithm adapts theTCP sending rate to the video parameters, it does not takethe wireless channel into account directly. This is examinedexplicitly below.

B. TCP Proxy for Wireless Streaming

As wireless networks became more popular, it was clear thatTCP alone would not perform well. From very early work in[28] to recent work in [29], it has been shown many timesthat allowing TCP to treat channel losses as congestion willdrastically decrease throughput. Based on this insight, manyprotocols have been developed to distinguish between channel

3The authors propose that the block size may be related to the blockstorage of the videos. This would imply that the block-based performance isa hardware-optimization choice as opposed to a network-optimization choice.

losses and congestion, and most have shown to perform verywell. Many approaches have been implemented includingretransmitting packets over the wireless link [28], using errorcorrection codes [29], and using feedback to estimate thebandwidth explicitly [31], and most of them have been shownto greatly mitigate this problem.

Unfortunately, in most cases, we cannot control the TCPprotocol parameters in use at the server. To compensate forthis, a performance enhancing proxy (PEP) is often used tosplit the connection before the TCP data encounters links thathave high losses and/or very high latency4. PEPs essentiallysplit the TCP stream into two sections. The video is thenstreamed using TCP over the regions of the network whereit performs well, and some other ad hoc protocol in theregions of the network where it does not. In [32], the authorsdescribe such a PEP that is designed to enhance streamingvideo performance.

In [32], unlike the previous TCP applications we havediscussed, the proposed system adapts the video quality tothe channel quality. The PEP is located in the wireless accesspoint and is responsible for the last-hop wireless delivery ofthe video stream. This placement allows the PEP to seamlesslyintegrate with existing protocols without any modifications ateither the client or server. The proposed scheme consists oftwo main components.

Fair resource sharing. One main advantage to placing theproxy at the access point (AP) of a wireless network is thatthere is a simple one-to-one connection between the AP andthe wireless users. This allows the AP to very easily share thewireless resources using any available method (even simpleround-robin scheduling could work for this).

Video adaptation. One advantage to using a PEP is thatthe proxy can inspect and selectively drop video packets in away that would result in a lower-quality but still acceptablevideo stream. Below we describe one simple method foraccomplishing this for H.264/AVC [5] encoded video bytaking advantage of the network abstraction layer (NAL).

First, the protocol reorders packets based on information inthe NAL unit (NALU) headers. The NALU headers indicate,along with the frame type, the importance of the frame. Fora full explanation of the method that the NAL uses to definethe importance of a specific NALU, the reader is referred to[5]. For the purpose of this paper, this value allows the PEPto order the NALUs in order of importance to the receivedvideo quality. This ensures that the wireless user will receivethe most important portions of the video first.

After the video frames are ordered, the PEP then tries totransmit as many frames as possible. If the wireless channelcan support the playback video rate of all videos beingtransmitted, then there is no noticeable effect. If, however,the channel cannot support the full rate for every user, thePEP will allocate a fair portion of what is available to eachnode. Since the channel cannot support the full video rate,some packets will be dropped. By ordering and transmittingthe packets in the order of their importance of the receivedvideo quality, we can ensure that each receiver will receivethe highest quality video that the channel can support.

4For example, PEPs are used extensively in SATCOM-on-the-move termi-nals, which experience both very high latency as well as satellite blockagelosses.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 8: Video Transmission over Lossy Wireless Networks: A Cross

8

C. Distortion Minimizing Rate Control (DMRC)

Wireless multimedia sensor networks (WMSNs) are a classof wireless sensor networks that record and collect video data.The classic example of such a network is an infrastructurelessvideo surveillance network for tasks such as wildlife monitor-ing. Because of constraints in the cost of the devices (they areoften damaged and/or lost, and must therefore be low cost),as well as power (they are usually battery powered), WMSNspose significant challenges for traditional congestion controlprotocols. However, one advantage to such networks is that,since they are essentially a “single-use” network, the designerhas much more freedom that in traditional network design tooptimize the network specifically for video transmission.

In this section, we look at the case where the streamingvideo server is able to intelligently adapt the video encodingrate based on measured channel statistics. In [25], we introducethe distortion minimizing rate controller (DMRC) that usesthe signal to noise ratio (SNR) and round trip time (RTT)to determine the cause of distortion at the receiver. Unlikethe solutions described above, we assume that the transportprotocol is directly able to set the video encoding rate. Thisallows the transport protocol to make decisions based on thereceived video quality. Since the video quality may not bedirectly correlated to the encoding rate for different videos,this ensures that each client receives a “fair” quality video. Inaddition, DMRC uses forward error correction (FEC) codesto compensate for channel errors. This gives the protocol two“dials” to control to deliver the best quality video possible tothe receiver.

Specifically, to determine the overall rate Ri at any decisionperiod i, the video source first uses the difference in theaverage RTT to determine the network “state”. If the RTT ison average increasing, then we can assume that the network isbecoming more congested and we should decrease the videorate. Conversely, if the average RTT is decreasing, then we canassume that there is some unused capacity than we can takeadvantage of to increase the encoding rate. The magnitudethat the node increases or decreases the encoding rate isdetermined by the magnitude of the RTT change as well as thecurrent video encoding rate. Intuitively, this is used to weightthe flows so that low quality flows increase faster than highquality flows, and that high quality flows decrease faster thanlow quality flows. Overall, this has the effect that nodes areunlikely to stay in a low-quality state for an extended periodof time.

While we are able to show in [25] (and in [26] forcompressed-sensing based video encoding) that DMRC (andC-DMRC) outperform traditional rate controllers, this comesat the cost of system complexity. This protocol is designed towork directly with the video encoder, forcing the sender toencode the video real time, which may be infeasible for manylarge-scale systems. However, in systems (such as Netflix) thatalready pre-encode video at many different encoding rates,there is some potential to adaptively select the encoding rateon-the-fly to adapt to changing network conditions.

V. ROUTING PROTOCOLS

Transmitting video streams over networks with multiplewireless hops (MANET and Sensor networks mentioned inSection II) is more challenging, where routing, aka path

selection, plays a critical role. Different from traditionalapplication-agnostic routing policies, state-of-the-art video-application-centric routing schemes need to consider user-perceived video quality, playback deadlines, and energy con-sumption rather than hop count and throughput as designmetrics. In this section, we will discuss the state-of-the-art video-specific routing protocols that are suitable to thefollowing three types of video applications: video on demand,real-time video services and multimedia surveillance. To bespecific, three representative video-centric routing approachesfor each aforementioned application will be discussed indetail: i) minimum distortion routing to assure the qualityrequirement of VoD, ii) packet-delay-deadline-aware routingto satisfy the stringent delay for real-time video services,and iii) power efficient routing to prolong the life time ofmultimedia surveillance sensors deployed in WMSNs.

A. Minimum-Distortion Routing for Video Streaming

Video on Demand as discussed in Sec. II is usually delay-tolerant. This is because video frames are pre-coded at theserver end and may be buffered before being played out atthe receiver. However, because of the ever-increasing highentertainment demands of users, routing protocols with as-sured minimum distortion are needed. Some existing routingschemes [33][34][35] have been proposed to maximize thereceived video quality for video transmission over multi-hopwireless networks. We start the discussion by looking at theMinimum Distortion Routing (MDR) [35] scheme, where anexisting rate-distortion model is used to evaluate the impactof packet loss rate on end-to-end video quality.Distortion model based on frame loss probability. Inwireless multi-hop networks, video quality is mainly impacted,in addition to the compression strategy at the source, byframe losses (evaluated in terms of Packet Loss Rate (PLR)).Therefore, how to choose an accurate distortion model withrespect to frame loss is an important design consideration.In [35], the authors consider video frames with Group ofPictures (GOP) structure that consists of an I-frame followedby (F − 1) P-frames. A temporal-distance-based distortionmodel is considered, where distortion for the entire GOP isestimated, for the case in which i ≤ F − 1 frames are lost, as

D(i) = (F − i) · i · F ·Dmin + (F − i− 1) ·Dmax

(F − 1) · F , (1)

where Dmax = D(0) is the resulting distortion when the firstI frame is lost, and Dmin = D(F−1) is the distortion if thelast P frame is lost, respectively. A loss probability distributionfunction πk

t of kth frame is measured at each hop t based ona transition matrix, which represents a mapping from packetloss rate to frame loss rate. Since the value of distortion D athop t depends on the position of the first unrecoverable framein the GOP, therefore, the abovementioned loss probabilitydistribution function πk

t can be adopted to calculate theprobability of the first lost frame in the GOP and hence, furtherspecify the distortion D. Finally, a modified distortion modelbased on πk

t is proposed. Readers are referred to [35] forfurther details.MDR routing design. The algorithm then selects the optimalpath x∗ from a set of candidate paths X by minimizing the

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 9: Video Transmission over Lossy Wireless Networks: A Cross

9

expected cost C, i.e., x∗ = argminx∈X

C(x,D(x)), where D(x)

represents the distortion estimated for path x according to themodel discussed above. A dynamic programming approach isused to solve the optimal routing problem discussed above.The MDR algorithm examined here therefore adopts an end-to-end video distortion model as the routing metric.

B. Packet-Delay-Deadline-Aware Routing for Real-time VideoTransmission

Video conferences and online gaming are typical appli-cations belonging to the real-time video traffic paradigmcategory. In such scenarios, constraining the end-to-end delayis the most important design consideration, followed by theperceived video quality, which should be as high as possible.The work in [36] proposes a packet-delay-deadline-aware rout-ing protocol to meet strict video playback deadline constraintswhile assuring the best user-perceived video quality. Thetechnical approach is based on jointly selecting the best packetrouting path and video encoding parameters (e.g., quantizationstep (QP) size).Packet-delay-deadline-aware expected distortion. The de-sign objective here is to choose the best transmission path andoptimal video encoding parameter (i.e., QP) for each slice5

i ≤ I , where I is the total number of slices that composethe current video frame to be transmitted, so as to minimizethe total expected distortion, under the constraint that thetransmission delay Ti of each slice i is lower than a pre-definedframe decoding deadline T budget, i.e.,

minimizeSi,ξi

I∑i=1

E[Di]

subject to max {T1, . . . , TI} ≤ T budget, (2)

where Si and ξi are source coding parameters and the selectedtransmission path for packet i, and E[Di] is the resultingexpected distortion. To evaluate E[Di], the authors elaborate adistortion model based on packet loss probabilities consideringi) packet drop probability due to violating the predefineddecoding deadline and ii) packet error probability caused bylossy wireless links. An M/G/1 queue model is adopted toestimate the packet delay.

By explicitly introducing packet arrival deadline constraintsin the optimization problem in (2), the proposed routingalgorithm guarantees good perceived video quality while at-tempting to satisfy stringent delay requirements for real-timevideo services. Other routing approaches that consider packetdelay constraints in routing optimization for video streamingservices over multi-hop wireless networks include the Multi-path Multi-SPEED (MMSPEED) protocol [37] and the delay-bounded energy-constrained adaptive routing (DEAR) protocol[38].

C. Power Efficient Multimedia RoutingEnergy conservation is another important consideration in

wireless networks, especially networks (i.e., MANET, TacticalMANET and sensor networks) deployed with inconvenientlyreplaced and battery-powered nodes, where the aforemen-tioned multimedia surveillance is a typical application greatly

5A slice consists of a number of encoded macroblocks in a certain rasterscan order within one video frame.

desiring high energy efficiency [39] due to significant energyconsumption caused by large volume video acquisition andprocessing. For this reason, several energy-efficient routeselection schemes have been proposed for streaming videotraffic, including the Energy-Efficient QoS Assurance Rout-ing (EEQAR) in [40], Hierarchical Energy Aware Protocolfor routing (HEAP) [41] and Power Efficient Multi-mediaRouting (PEMuR) protocol [42]. In particular, PEMuR in [42]is designed with the dual objective of reducing the energyconsumption and of improving the perceived video quality inwireless video communications.Energy efficient hierarchical routing protocol. Here, wediscuss the energy-minimized routing scheme - PEMuR pro-posed in [42], suitable to multimedia surveillance application.In PEMuR, a hierarchical routing selection scheme adoptingthe current residual energy of each sensor node as a metric isproposed, attemping to prolong the life of the entire network.Specifically, the considered hierarchical model includes a basestation and a set of homogeneous sensor nodes, randomlydistributed within the area of interest. The topology is assumedto be static.

All nodes are grouped into clusters. One node in eachcluster is elected as cluster head by the base station basedon residual energy. Cluster heads are classified into twocategories: a) Upper level cluster heads and b) Lower levelcluster heads, depending on whether the nodes can directlytransmit information to the base station or not. The routingpath between the lower level cluster heads and the base stationis selected based on

pk = maxl∈A

RI(pl), (3)

where A is the set of all possible paths from one cluster headto the base station, RI(pl) is the Routing Index (RI) definedas

RI(pl) =

n−1∑i=2

Eri −n−1∑i=1

E(pl, ci, ci+1), (4)

where Eri denotes the residual energy of the cluster head nodeci and E(pl, ci, ci+1) is the energy required to route messagebetween the node ci and ci+1.

By solving the optimization problem (3), the appropriatepath from the cluster head to the base station is determined.In case the transmission rate exceeds the available bandwidthof the selected path, the combination of video packets resultingin minimum distortion will be dropped based on a distrotionprediction model [43]. Therefore, this approach is particularlyappealing for multimedia surveillance application distributedin hierarchical structure.

D. Future Work

Adaptive video-centric routing. As the amount of videocontent increases in wireless networks, there still does not existan ideal routing protocol that can help effectively improvethe perceived quality by jointly considering the application-specific requirements, such as timeliness, end-to-end quality,and energy consumption. To cater for the explosion of video-based content in wireless networks, recently, a clean archi-tecture approach named as Information-Centric Networking(ICN) has drawn significant attention [44], which aims atproviding a general infrastructure that provides in-network

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 10: Video Transmission over Lossy Wireless Networks: A Cross

10

caching so that content is distributed in a scalable, cost-efficient manner. However, how to select the path to requestvideo content that may have been cached in networks ornot to satisfy diverse video-specific requirements is greatlyunexplored so far. Therefore, developing a more efficientrouting scheme for video transmission in information-centricnetworks may be another possible future direction of research.Routing for CS-based video. So far, all routing protocolsdiscussed above are designed for wireless video transmissionadopting traditional video encoding paradigms discussed inSection III-A and Section III-B. However, recently, compres-sive sampling based video applications (e.g., CS-based multi-view video streaming [21]) have been emerging [22], whichare different from conventional video encoding methods. Werefer the readers to Section III-D for more details. As discussedin Section III-D, CS-based video encoding paradigms are moreresilience to channel error but with lower compression ratecomparing to predictive coding approach. However, in multi-hop wireless networks, the advantage of error resilience will beweakened. Hence, how to select the best path to make up theundesirable consequence of above mentioned issue but withoutdecreasing compression ratio is unexplored and challenging.

VI. MAC LAYER SOLUTIONS

The Medium Access Control (MAC) layer aims at regu-lating channel access so that multiple wireless devices canshare the same wireless link effectively and fairly. An efficientMAC protocol is essential for all video networking paradigmsaforementioned in Section II. With the growing demand formultimedia applications, the MAC is expected to support var-ious video applications with diverse Quality of Service (QoS)requirements. For example, real-time video is delay-sensitive,while large volume video data access poses challenges toVoD and multimedia surveillance applications. Moreover, inMANET and sensor networks, battery-powered devices requireenergy efficient access schemes. To address these challenges, aplethora of video-application-aware MAC protocols have beenproposed [45][46][47][48][49]. MAC protocols are usuallyclassified into two categories: contention-based and schedule-based. Below, we examine two MAC schemes: schedule-based priority-guaranteed real-time video transmission access[48] and contention-based massive video traffic access [49]to address delay sensitive and energy efficiency high-volumetransmission requirements.

A. Collision-free MAC with Priority Guaranteed for Real-timeMultimedia Transmission

Conventional schedule-based (especially TDMA-based)MAC protocols are often used in wireless video transmissionbecause they can guarantee collision-free and QoS-complianttransmissions [48] but with hard-to-control end-to-end de-lay. First we will discuss a representative TDMA-like MACscheme for wireless mesh backbone network proposed in [48]which attempts to provide priority guarantees for real-timevideo applications besides assuring the required quality-level.Next, we examine two key techniques of the protocol: a)guaranteed priority access for real-time video traffic and b)improved per-flow fairness.Guaranteed priority access for real-time video traffic. Thecore idea of priority-guaranteed access policy for real-time

video steaming is to use a novel distributed time slot structurethat includes a control segment and transmission segment. Thecontrol segment consists of mini-slots, further divided intoreal-time mini-slots with descending urgency level Ui, andnon-real-time mini-slots, allocated to each node to contend thecorresponding slot if there are packets to send. For real-timepriority access, the urgency level of the real-time packet j ismeasured by the value tj

nj(with tj and nj being the remaining

time and remaining hops). If Ui−1 ≤ tjnj≤ Ui and no other

jamming signal is detected, the node owning packet j willwin the corresponding slot and later transmit that packet inthe collision-free transmission slot. Therefore, the proposedscheme guarantees that nodes with real-time video packetshave priority access over those with delay tolerant packets.Moreover, a mini-slot reuse policy is proposed by allowingtwo devices with more than two-hop distance to use the samemini-slot. Based on this, the duration period of the contentionphase is only dependent on the number of devices in a two-hop neighborhood but not on the total number of nodes inthe entire network, which overcomes the limited scalability ofsome schedule-based MAC protocols.Improved per-flow fairness. Fairness among end users re-trieving videos (per-flow) or intermediate nodes routing thevideos (per-node) is another important factor for schedulingdesign. In [48], per-node fairness is achieved by using themini-slot rotation scheme, based on which, per-flow fairness isimplemented as follows: First, nodes need to exchange trafficinformation (e.g., the number of flows) with their one-hop andtwo-hop neighbors. Then, nodes redetermine the fraction oftime to access the channel in proportion to the number of flowsbeing transmitted. Although resources are allocated more fairlyto flows, this introduces extra control overhead required toexchange traffic information with neighbors. Therefore, thereis a tradeoff between per-flow fairness and efficient resourceallocation.

To summarize, the idea of using real-time rotating mini-slots into a schedule-based MAC proposed in [48] achievescollision-free transmissions as well as some application-specific requirements in terms of priority-guarantees and fair-ness for real-time video streaming.

B. Massive Video Transmission Assured MAC Protocol

In the MAC layer of WMSNs, energy efficiency and large-scale transmission are required for delivering multimedia datawhich can lead to high traffic. Recently, some contention-based MAC policies for high-volume video data transmissionshave been proposed [49], [50]. Next, a representative scheme- Massive Transmission Scheme (MTS) [49], is examined indetail.MTS: contention-based massive MAC. MTS is a novelcontention-based massive transmission scheme designed tosupport massive multi-media data delivery in WMSN systems.Specifically, the proposed scheme operates in two modes: a)Massive Transmission (MT) mode, b) Normal Operation (NO)mode. In the MT mode, fast, continuous and massive videodata transmission in a multi-hop environment with a reducednumber of ACK frames and contention time is proposed.Moreover, a Network Allocation Vector (NAV) is employedto reduce unnecessary listening energy consumption, avoid

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 11: Video Transmission over Lossy Wireless Networks: A Cross

11

collisions and solve the hidden terminal problems caused byhalf-duplex.Switch method between MT and NO. To concurrentlysupport video streaming and common data transmission, aswitch criterion is proposed for switching between MT andNO modes. For this purpose, a dynamic SDTL (Short DataThreshold Length) based switching criterion is proposed givena value λt at time t, expressed as

λt = α · (Ct − Ct−1) + λt−1, (5)

with Ct and Ct−1 being the number of packets at time tand t − 1, respectively, and α is an empirically predefinedcoefficient. Then, the value of SDTL is updated as SDTLt =SDTLt−1 + λt · lMTU , where lMTU denotes the length ofmaximum allowed transmission unit (MTU). If the length ofthe remaining data in the queue is higher than SDTLt, MTmode will be started, otherwise the system stays in NO mode.

The idea to massively and successively transmit videodata by reserving the wireless link for some dedicated timecan reduce the end-to-end delay. However, a switching cri-terion based only on the amount of data in the buffer isinsufficient to assure a guaranteed level of perceived videoquality because the time-varying channel condition may causedrastic throughput degradation. Therefore, the MTS accesspolicy still may be enhanced by considering a combination ofapplication-aware requirements (e.g., playback deadline) andinstantaneous channel conditions into the switching threshold.

C. Future WorkAdaptive and application-centric MAC protocols. As dis-cussed in Section II, various video applications may traverseheterogeneous wireless architectures with diverse QoS require-ments. These heterogeneities must be dynamically handledby mobile terminals. Thus, the integration of the existingmedium access schemes aforementioned spotlights an adaptiveand seamless medium access control [50] layer that can meetdiverse QoS requirements and improve network utilization.

Due to many shortcomings of current Medium Access Con-trol protocols, including hidden-terminal-like problems causedby the half-duplex nature of current wireless devices, full-duplex radios [51], [52] are emerging, which can be consideredas another promising approach to develop new MAC schemesto satisfy the diverse QoS requirements of various kinds ofvideo applications. However, self-interference cancellation isthe major challenge to design full-duplex MAC schemes andhow to integrate it with video transmission is substantiallyunexplored.

VII. PHYSICAL LAYER SOLUTIONS

Wireless links provide only limited and time-varying datatransmission rates, and may hence become the bottleneck inmobile video applications. This is conventionally alleviatedby adopting techniques like joint source and channel resourceallocation [53], [54], [55] or streaming video content in ascalable fashion [56], [57]. However, given the ever-increasingvideo traffic and the low-energy consumption requirements,wireless video streaming is far from being a mature technolo-gy. This has stimulated additional research efforts to redesignthe video encoder/decoder to adapt more smoothly to the time-varying channels with even higher recieved video quality, e.g.,by leveraging the channel state information (CSI) with a finer

granularity or integrating video streaming tightly with newly-emerging physical layer technologies. While a comprehensivediscussion on physical layer techniques for video streaming isclearly beyond the scope of this paper, in what follows we relyon three representative technologies to discuss newly emergingsolutions and challenges: i) soft coding and decoding to exploiterror resilience in wireless video streaming [58], ii) coopera-tive video streaming to enhance transmission reliability or rate[59], and iii) compressive and cooperative video streamingto achieve both error resilience and low-power transmissions[60]. While the first approach is mostly suitable for cellularnetworks and WLANs, the latter two can find more broadapplicability, including in wireless ad hoc and sensor networks.

A. Soft Video Coding and DecodingIn traditional video encoders, the quantized DCT or Wavelet

coefficients are in general encoded using variable lengthcoding (VLC) [61]. In the case of bit flips caused by er-rors in wireless transmission, this may cause synchronizationproblems in decoding VLC-coded packets (aka “all-or-none”problem). Moreover, the wireless channel is usually quitedifferent for each video terminal, and the decoding capabilitiesfor each terminal can also be different. As a result, withoutperfect adaptation of transmission schemes (which is not easyto achieve because of the unpredictability of the time-varyingwireless channel quality), wireless video streaming may sufferfrom severe cliff effects and be rather choppy. In [58], [62],[63], [64], the authors address this challenge by proposing softvideo coding and decoding, which, roughly speaking, leveragesoft physical-layer information like bit-level probability belief(instead of only the hard binary channel-decoding-results), oruse linear analog coding (instead of variable-length digitalcoding) for video compression and channel error protection.Next, we discuss FlexCast [58] as an example. The core designobjective of FlexCast is to gracefully adapt the reconstructedvideo quality to the time-varying channel quality, and henceavoid cliff effects.FlexCast: constant-length representation. The core ideaof FlexCast is to use rateless coding (whose benefits arediscussed later) to protect constant-length-encoded coefficients(rather than VLC), and then decode the received, possiblynoisy, coefficients by integrating soft information provided byphysical-layer channel decoding.

The motivation for using constant-length coding (e.g., rep-resent each quantized coefficient using, say 8 bits) is thatthis can completely avoid the problem of possibly losingsynchronization in decoding VLC-encoded bit sequences. Toachieve high bandwidth efficiency, FlexCast assigns the avail-able bit rate budget among the resulting raw bits in anunequal fashion based on rateless channel coding. For thispurpose, the raw bits are grouped into multiple clusters withdifferent importance levels according to their contribution tothe overall reconstruction error if they are not successfullydecoded. In [58], the authors use the bit positions in each 8-bit coefficient to indicate their importance, e.g., the MSBs aremore important than the LSBs since flips occurred at MSBscause larger deviation to the original coefficient value, andhence possibly larger video reconstruction distortion.Rateless coding and soft decoding. Given the allocated bitrate budget, error protection bits are generated for each groupbased on rateless coding and then appended to the raw bits.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 12: Video Transmission over Lossy Wireless Networks: A Cross

12

Advantages of rateless coding include i) it may produce codedbits with arbitrary, and hence very flexible, channel codingrate; and ii) by decoding the rateless coded bits, the receiveris able to compute a soft information for each raw bit, i.e., theprobability that a bit is “1” or “0” (rather than a hard decision).Denoting p1(i) as the resulting probability that the ith bit inthe representation of a DCT component is “1”, FlexCast thenestimates the DCT component as

y =

L−1∑i=1

2i · p1(i), (6)

where L is the quantization bit-depth. Since p1(i) containssoft information about the ith bit, the reconstruction errorwith DCT component y is allowed to gracefully adapt to thevariations of channel coding rate and channel quality, withstronger error protection and better channel quality leading tolower distortion.

The idea of exploiting “soft information” to enable noise(error) resilience has also been leveraged in SoftCast [62],its MIMO-OFDM extension ParCast [63], and also in D-Castdesigned for distributed video coding [64]. There, the availabletransmission power budget is allocated among DCT coeffi-cients proportionally to their magnitude levels, e.g., throughmagnitude scaling and DCT-chunk transformation in [62]. Theauthors in [58], [62], [63], [64] show that soft video streamingenables the reception video quality to scale gracefully withwireless channel quality, and in practice be higher than that ofnon-soft ones; this graceful reception characteristic can also beachieved using the CVS encoder introduced in Section III-D.Hence, soft video streaming may provide a promising solutionfor those mobility-rich video applications where the channelquality may change very fast, like VoD on smart phonesand in-vehicle VoD, and video broadcasting and multicastingapplications where video terminals may experience differentwireless channels hence having different decoding capabilities.

B. Cooperative Video Streaming

Instead of redesigning video encoders, a second possible ap-proach to address the choppiness of wireless video streamingis to enhance the transmission reliability, so that good or atleast acceptable video quality can be received for most levelsof channel quality. This approach is particularly appealingfor applications like video surveillance with wireless sensornetworks, where it is not easy to run intelligent but potentiallyvery complex video encoders in battery-powered sensors dueto energy and computational capability limitations.

At the physical layer, emerging technologies with potentialfor improved reliability include cooperative relaying [65]and interference alignment [66]. While a detailed treatmentof these approaches is out of the scope of this paper, wetake cooperative relaying as an example to show how thesetechnologies can be used to enhance wireless video quality[59], [67].Cooperative relaying. Cooperative relaying techniques at-tempt to leverage the spatial diversity of the wireless chan-nel in a distributed fashion. While this is traditionally doneby relying on multiple transceiver antennas, it may not bepractical to implement this on sensor nodes usually with onlylimited size. Instead, cooperative relaying relies on antennas

of neighboring devices to form a virtual multiple-input-single-output (VMISO) link and hence to achieve spatial diversity[65].

A cooperative transmission is typically completed in twoconsecutive time slots. In the first time slot, the source nodebroadcasts information to both destination and potential relaynodes, and in the second, the relay node forwards the receivedinformation to the destination. The resulting cooperative linkcapacity Ccop can be expressed as

Ccop =1

2Ctwo slot, (7)

with Ctwo slot being the capacity achievable through combin-ing at the destination signals received in the two time slots.Capacity-matched video streaming. In [59], we studied acooperative video streaming network, which consists of aset of video sensor nodes transmitting the captured videosequences to their own intended destinations, either througha direct link or through cooperative relaying. In the lattercase, the source node optimally selects a relay from a set ofpotential relay nodes for cooperative transmission, resultingin an overall capacity expressed in (7). Due to the coefficient12 there, the capacity Ccop can be higher, or lower than thatusing only direct transmission. Hence, at physical layer, it isimportant to decide for each session, i) whether to transmitusing a cooperative relay or using the direct link only, and ii)which relay should be selected in the former case.

At the same time, at the application layer, the videoencoding rate is jointly controlled to match the resultingphysical-layer link capacity. While a low video rate causeshigh encoding distortion, too high a video rate may potentiallymake the network more congested causing high packet droprate caused by exceeding the video playout deadline. In [59],we formulated the problem of optimal joint relay selectionand rate control in a multi-user wireless network as a mixednonlinear, nonconvex combinational problem (MINLP), andsolved it through newly designed distributed and centralizedalgorithms. We found that a noticeable gain in sum PSNR canbe achieved through cooperative relaying with in practice evenlower average transmission power compared to using directtransmissions only. The effectiveness of cooperative relayingis also demonstrated in [67] by considering uplink wirelessvideo streaming in cellular networks.

C. Compressive and Cooperative Video Streaming (CCVS)

Finally, we discuss CCVS, an approach exploiting botherror resilience and enhanced transmission for wireless videostreaming [60]. Different from using complex redesignedvideo encoder as discussed in Section VII-A, in [60] weshowed that error resilience can also be achieved whilekeeping video encoders very simple, hence facilitating videoapplications on resource-constrained devices with limited com-putational capabilities, e.g., video surveillance using wirelesssensor networks discussed in Section II. The core idea ofCCVS is to leverage the inherent error resilience propertiesof compressive sampling discussed in III-D at the applica-tion layer, and at the physical layer, to develop cooperativewireless networking based on the unique properties of videorepresentation with compressed sensing.Error resilient compressive sensing (ERCS). ERCS was firstformalized in [68], saying that a bit-vector can be exactly

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 13: Video Transmission over Lossy Wireless Networks: A Cross

13

recovered from a linearly encoded and then sparsely-error-corrupted version of the vector, by solving an `1-minimizationproblem. While ERCS may work well in principle, the ad-ditional communication overhead introduced by the linearencoding can however be quite high for wireless video stream-ing. This is because i) quantizing real-valued CS samplesmay add additional quantization noise to the signal, and ii)removing all of the bit errors in CS samples requires gettingevery reconstructed sample exactly correct, this may requireexcessive overhead if only non-important bits (e.g., LSBs) areflipped.

In [60], we showed that both of these problems can beavoided by i) using real valued (unquantized) CS samples tocreate the error correction samples, and ii) using mean squareerror (MSE) of the reconstructed CS samples as performancemetric. Compared to bit error rate (BER), MSE can naturallyweight the significance of the small scale (e.g., LSB) errorsless than more important (e.g., MSB) errors, and hence needsless parity bits.

We showed that, with ERCS and cooperative networking,the reconstructed video quality in terms of SSIM can beconsiderably improved. This has the important consequence ofpotentially enabling systems that can transmit video at SNRvalues that are a fraction of traditional cooperative relayingsystems without sacrificing video quality, hence enabling extralow-power video sensors.

D. Future WorkSofter video streaming. While different schemes have beenstudied to exploit error resilience and hence to achieve softvideo streaming, there are still several challenges to address.i) Since the discussed FlexCast and ParCast are limited tounicast, and D-Cast targets only Gaussian channels, soft videostreaming schemes still need to be designed for multicast,broadcast and P2P video streaming, and by considering fad-ing channels; ii) SoftCast, ParCast and CCVS exploit onlypartial temporal correlations in video encoding hence possiblycausing waste of bandwidth. There is still significant roomfor creating encoders with higher bandwidth efficiency; iii)While all the discussed works focus on 2D videos, designingsoft streaming systems to support the emerging 3D/Multi-view/Stereo video applications [69] can be another potentialresearch direction.

VIII. CROSS LAYER SOLUTIONS

We conclude our discussion by focusing on cross-layerdesign work that has concentrated on joint optimization ofapplication and multiple lower layers. We discuss severalrepresentative examples of cross-layer video streaming withdifferent global design objectives: i) maximize the quality ofexperience (QoE) of users (application + Transport + MAC),considering cellular networks and WLANs [70], ii) minimizesystem energy consumption (application + MAC + physical)[9], considering wireless ad hoc or sensor networks, and iii)meet the stringent delay constraint (application + network+ physical), considering hybrid cellular/ad hoc networks (orcellular networks with device-to-device (D2D) communica-tions enabled) [71]. In addition to discussions about differentdesign objectives, we also highlight the potential advantages ofincorporating CLD in those newly emerged video applicationssuch as cloud-assisted wireless video gaming discussed in the

last example. Readers are referred to [72], [73] and referencestherein for comprehensive surveys of this area.

A. DASH with Channel-content-aware Rate ControlAs discussed in Section IV, TCP connections have been

used in several commercial streaming systems such as Y-ouTube and Netflix. In wireless video streaming with remotelylocated video servers, server-to-user TCP connections maybehave quite differently in their wired and wireless compo-nents in terms of average throughput, jitter, and delay. Thismay potentially cause severe network congestion (when sourcerates are too high) or bandwidth waste (if the selected videorate is too low). To address this challenge, the authors of [70]proposed WiDASH, a DASH (Dynamic Adaptive Streamingover HTTP [74]) framework to jointly coordinate the wiredand wireless TCP connections for multiple concurrent user-s. Different from the TCP performance enhancement proxy(PEP) discussed in Section IV-B, which focuses primarily onthe adaptability of a single TCP proxy to wireless channelquality, here we discuss WiDASH and its application in opti-mization of multiuser video streaming systems with concurrentTCP links.WiDASH via TCP splitting. WiDASH splits a long TCPconnection into cascaded multiple short TCP connections, andhence several problems related to the wireless networks canbe effectively isolated from the wired, e.g., the higher trans-mission error and unpredictable time-varying channel quality.Taking 3G UMTS as an example, WiDASH can be located atthe Gateway GPRS Support Node (GGSN), which splits theserver-to-user TCP link into two shorter ones, one betweenthe server and the WiDASH proxy, and the other betweenthe WiDASH and mobile users. Consequently, it is easier forthe proxy to collect in real time the link quality informationfor both shorter links, and use the collected information foroptimally coordinating the wired and wireless transmissions.WiDASH-based optimization. In 3G UMTS cellular net-works, concurrent users are scheduled by differentiating twopriority levels, say {plow, phigh}. To assure QoE fairnessbetween high-rate and low-rate video users, and between videotraffic and background traffic (e.g., file downloading), it isdesirable to dynamically assign different priority to each videouser and to each packet for a single user. Intuitively, this canhelp prevent that a video user requests a high streaming rateand in the same time its packets are assigned high priority.

For this purpose, WiDASH adopts a linear-mapping-basedpriority-assignment framework, with mapping function

w(r) =

0, r ≤ rlowx−rhigh

rhigh−rlow, rlow < r ≤ rhigh

1, r > rhigh

(8)

where r is the video rate associated with the packet, rlowand rhigh are two thresholds representing low and high rates,respectively. WiDASH assigns high priority phigh to the packetwith probability 1−w(r), and low priority plow with probabil-ity w(r); here, the random priority mapping is used to avoidTCP synchronization among different DASH flows.

In this framework, WiDASH then maximizes the averageQoE of the users, by i) coordinating the concurrent wired TCPlinks through jointly deciding the video rate for each user,and ii) scheduling transmissions over the wireless TCP linksbased on real-time observation of the wireless channel quality.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 14: Video Transmission over Lossy Wireless Networks: A Cross

14

1.0

Raw Video

Encoding Energy

ChannelEncoder

SinkVideo Sensor

VideoEncoderEncoded

Video Rate

Energy Budget

Channel Encoding Rate Trans.

Energy

VideoDecoder

LossyWirelessChannel

Energy Budget (mJ)0 10 20 30 40 50

SSIM

0.5

0.7

0.9

0.6

0.8 CVS

H.264MJPEG

Fig. 6: Energy-aware video encoding and transmission [9].

It is shown WiDASH is able to support higher video rate andsmoother rate changes for each individual user compared withthe competing scheme Akamai [70].

B. Rate-Energy-Distortion-Predictable Video Streaming

Different from the discussions above, where the objectiveof WiDASH is to maximize the system QoS in cellular videostreaming systems, here we concentrate our discussion onenergy-efficient video streaming based on CLD, consideringdifferential compressive video sensing (D-CVS) [9]. Here,D-CVS means integrating inter-frame predication (will beexplained later on) into the compressive video sensing (CVS)introduced in Section III-D; D-CVS has been shown to havea potential to reduce the required energy and computationalcomplexity for video capturing, processing and streaming [9],and hence to boost applications like video surveillance in wire-less multimedia sensor networks discussed in Section II andocean exploration in the envisioned Internet underwater [75].

We discuss the rate-energy-distortion behavior of D-CVS,considering the interactions among application-layer videoencoding, MAC-layer error handling, and physical-layer SNRadaption. An illustrative scenario is given in Fig. 6, where theenergy budget available at the source node is split between theenergy needed for video encoding and transmission.Differential compressive video sensing (D-CVS). Theindividual-frame-based CS encoding takes advantage of thespatial correlation within each video frame, in [9] we proposedD-CVS to exploit the inter-frame correlation and in the mean-while avoid complex motion estimation operations. Considera group of pictures (GOP) patterned as IPP · · ·P . Then, D-CVS encodes the I-frame in traditional intra fashion, whileencoding the ith P-frame yi through encoding the differencevector dvi between the frame and a selected reference frameyref computed as dvi = yi − yref . The advantage of doingso is that (as discussed in Section III-D), in the case oflow- or moderate-level motion, consecutive two or severalframes do not differ too much from each other, and theresulting difference vector might be much sparser than theframe themselves. Then, the data rate needed to transmit overthe lossy channel can be considerably reduced, to achievecertain reconstructed video quality.Error-level-adaptive packet dropping. As discussed in Sec-tion I, keeping bad samples (i.e., with bit errors) may onlyslightly degrade the reconstructed video quality in the case oflow bit error rate (e.g., lower than 10−3 in Fig. 1). This impliesthat, retransmitting a sample packet is necessary only if theBER is high. Then, together with considering the effects ofD-CVS discussed above, the video quality can be empirically

modelled using a low-pass-filter function of encoding rate rvand transmission rate rch,

U(rch, rv) =D0 − θ

rv−R0√1 + τ2(BER(rch, rv))2

, (9)

where D0, θ and R0 are video dependent constants that can bedetermined through linear least squares estimation techniques,τ is the encoder dependent constant used to indicate the qualitydegradation level with respect to BER(rch, rv). Then, givenan energy budget, the optimal rate allocation rv and rch, andhence the optimal energy allocation can be obtained by solvinga nonlinear optimization problem; here, the nonlinearity is dueto the complex expression of (9). As shown in Fig. 6, D-CVSoutperforms the competing schemes H.264 and MJPEG in awide range of energy budget in terms of structural similarity(SSIM).C. Cloud-assisted Mobile Gaming

In addition to low energy consumption and high recon-struction quality as discussed above, some applications likemobile video gaming also needs to meet a very stringentdelay requirement. However, both the processing capabilityof mobile devices (e.g., smart phones, glasses) and the band-width of wireless links are only limited, which may result inunacceptable processing and transmission delay.

Cloud computing technology has emerged with the potentialto enable high-quality and energy-efficient wireless videostreaming, while still meeting the stringent delay requirement[76]; and hence to enable a wide set of multimedia-richand mobility-rich applications, like mobile video gaming,3D/Stereo/Multiview video streaming. Next, we discuss cloud-assisted cross-layer optimization, considering mobile gamingapplications as an example [71].Rate minimized video encoding. In [71], the authors con-sidered a gaming network that consists of a game cloud anda set N of mobile players. Different from traditional gameservers, which respond to player commands by renderingand streaming their video content separately, the game clouduses an additional video encoder server to exploit inter-playercorrelation in favor of higher encoding compression ratio.

For this purpose, in [71] the authors first group game playersinto different groups, e.g., players closely located in the gamecan be grouped into one group, since they are seeing similargame scenes.6 Then, the video sequences in each group canbe encoded in the same way of encoding multi-view 3Dvideos [69] which captures the same scene of interest butfrom different eyes. Specifically, frames are encoded in threemodes, {I, Pintra, Pinter}, to exploit the intra-frame spatialcorrelation, inter-frame temporal correlation, and the inter-view correlation, respectively. Experiments of encoding naturalvideos showed that, exploiting the inter-view correlation mayreduce up to 50% the encoder output rate. By searching for theoptimal player grouping scheme and frame encoding modesusing the powerful video cloud, the authors of [71] showedthat this rate reduction can be up to 70% for the consideredvideo gaming with 10 game players.Delay reduction through multi-hop avoidance. Since thevideo sequences are encoded in a correlated fashion, players

6In [71] , the authors concentrated on the third-person game, where playerswatch the whole game scene in a bird-view. Examples of such game areDiablo, Command&Conque.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 15: Video Transmission over Lossy Wireless Networks: A Cross

15

still need, e.g., through WiFi- or bluetooth-based ad hoc links,to share their reconstructed video frames with other players inthe same group if their frames were selected as references.This implies that a player cannot decode his/her Pinter-encoded frames before receiving the reference frames fromother players, and as a result this causes one-hop processingand transmission delay. If cross-reference encoding is usedin the cloud, i.e., an Iinter encoded frame of a player isfurther used as reference for other players, the delay willaccumulate along the multi-hop video content sharing path.In [71], the authors avoid this by simply disabling the cross-reference encoding. This however takes a cost of aroundtwice video encoding output rate compared with that of multi-hop. Therefore, it is still necessary to explore the tradeoffbetween low video encoding rate in the cloud server andthe possible high multi-hop sharing delay among players, byjointly considering game scene characteristics, location-awaremulti-hop sharing, and also wireless channel quality in a cross-layer manner.

D. Future WorkCross-layer design for energy-efficient 3D video stream-ing. 3D/Multi-view/Stereo video applications have recently e-merged as services with a potential to offer a higher Quality ofExperience (QoE) compared with conventional 2D video [69].However, the computational complexity of encoding 3D multi-view video and transmitting the encoded data may result in ahigh energy burden for mobile devices, which ultimately leadsto short operational lifetime. It is therefore essential to designnovel transmission schemes, e.g., in cross-layer manner, andclean-slate network architectures with higher energy efficiencyby integrating compressive sampling technology [22].Cloud-assisted mobile video streaming. Another potentialdirection is to integrate mobile cloud computing (MCC) tech-nologies with wireless video streaming [77]. Then, mobile de-vices can continuously offload their computationally-intensivetasks to a remote cloud server, hence potentially extendingthe battery lifetime. On the other hand, by optimizing thestreaming strategies using the powerful cloud server, this mayenable real-time, network-friendly and scalable video stream-ing. Through cloud-enabled mobile networks, we envision thathigh quality mobile video streaming will be supported withoutconsiderably increasing the energy consumption.

IX. CONCLUSIONS

In this work, we discussed state-of-the-art video encodersand networking protocols for wireless video streaming. Wefirst examined emerging video encoders, and discussed howcompressed sensing and distributed systems could help en-hance video systems where the video was created by resourceconstrained wireless devices; We then examined transportprotocols, with a lot of attention paid to techniques usedby large-scale streaming services such as YouTube. Then,we discussed application-centric routing protocols for videoapplications including on-demand video streaming, real-timeinteractive video services and video surveillance in WSN. InMAC layer, channel access policies for high-volume videodata transmission are examined. In physical layer, we dis-cussed error-resilient video streaming, by highlighting softvideo encoding/decoding and cooperative streaming. Finally,we examined cross layer solutions, by paying attention to

newly emerging service architectures like DASH-based andcloud-assisted wireless video streaming.

REFERENCES

[1] I. U. SANDVINE, “Global internet phenomena report: 2h 2013,” 2013.[2] Netflix. [Online]. Available: http://www.netflix.com[3] YouTube. [Online]. Available: http://www.youtube.com[4] C. Huang, J. Li, and K. W. Ross, “Can internet video-on-demand be

profitable?” in Proc. of ACM SIGCOMM. ACM, 2007, pp. 133–144.[5] “Advanced Video Coding for Generic Audiovisual Services,” ITU-T

Recommendation H.264, 2005.[6] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, and T. Wiegand, “High

efficiency video coding (hevc) text specification draft 6,” DocumentJCTVC-H1003, 8th JCT-VC meeting, San Jose, California, 2012.

[7] X. Cheng, J. Liu, and C. Dale, “Understanding the characteristics ofinternet short video sharing: A youtube-based measurement study,” IEEETransactions on Multimedia, vol. 15, no. 5, pp. 1184–1194, 2013.

[8] V. Adhikari, Y. Guo, F. Hao, M. Varvello, V. Hilt, M. Steiner, and Z.-L. Zhang, “Unreeling netflix: Understanding and improving multi-cdnmovie delivery,” in Proc. of IEEE Conf. on Computer Communications(INFOCOM), 2012, pp. 1620–1628.

[9] S. Pudlewski and T. Melodia, “Compressive Video Streaming: Designand Rate-Energy-Distortion Analysis,” IEEE Trans. on Multimedia, InPress 2013.

[10] ——, “On the Performance of Compressive Video Streaming for Wire-less Multimedia Sensor Networks,” in Proc. of IEEE Int. Conf. onCommun. (ICC), Cape Town, South Africa, May 2010.

[11] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess-ment: from error visibility to structural similarity,” IEEE Transactionson Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.

[12] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “WirelessSensor Networks: A Survey,” Computer Networks (Elsevier), vol. 38,no. 4, pp. 393–422, March 2002.

[13] “Ieee std 802.11b-1999/cor 1-2001,” 2001.[14] “Video Coding for Low Bit Rate Communication,” ITU-T Recommen-

dation H.263.[15] “Digital Compression and Coding of Continuous-Tone Still Images -

Requirements and Guidelines,” ITU-T Recommendation T.81, 1992.[16] “JPEG2000 Requirements and Profiles,” ISO/IEC JTC1/SC29/WG1

N1271, March 1999.[17] “Generic Coding of Moving Pictures and Associated Audio Information:

Video,” ISO/IEC 13818-2, ITU-T Rec. H.262, 1995.[18] D. Slepian and J. Wolf, “Noiseless Coding of Correlated Information

Sources,” IEEE Trans. on Information Theory, vol. 19, no. 4, pp. 471–480, July 1973.

[19] A. Wyner and J. Ziv, “The Rate-distortion Function for Source Codingwith Side Information at the Decoder,” IEEE Trans. on InformationTheory, vol. 22, no. 1, pp. 1–10, January 1976.

[20] R. Puri and K. Ramchandran, “PRISM: A New Robust Video CodingArchitecture Based on Distributed Compression Principles,” in Proc.of the Allerton Conf. on Communication, Control, and Computing,Allerton, IL, USA, October 2002.

[21] N. Cen, Z. Guan, and T. Melodia, “Joint Decoding of IndependentlyEncoded Compressive Multi-view Video Streams,” in Proc. of PictureCoding Symp. (PCS), San Jose, CA, Dec. 2013.

[22] S. Pudlewski and T. Melodia, “A Tutorial on Encoding and WirelessTransmission of Compressively Sampled Videos,” IEEE Commun. Sur-veys and Tutorials, vol. 15, no. 2, pp. 754–767, SECOND QUARTER2013.

[23] D. Wu, Y. T. Hou, W. Zhu, Y.-Q. Zhang, and J. M. Peha, “Streamingvideo over the internet: approaches and directions,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 11, no. 3, pp. 282–300, 2001.

[24] A. Rao, A. Legout, Y.-s. Lim, D. Towsley, C. Barakat, and W. Dabbous,“Network characteristics of video streaming traffic,” in Proc. of theCOnference on Emerging Networking EXperiments and Technologies(CONEXT), 2011, pp. 25:1–25:12.

[25] S. Pudlewski and T. Melodia, “A Distortion-minimizing Rate Controllerfor Wireless Multimedia Sensor Networks,” Computer Communications(Elsevier), vol. 33, no. 12, pp. 1380–1390, July 2010.

[26] S. Pudlewski, T. Melodia, and A. Prasanna, “Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks,”IEEE Transactions on Mobile Computing, vol. 11, no. 6, pp. 1060 –1072, June 2012.

[27] S. Alcock and R. Nelson, “Application flow control in youtube videostreams,” SIGCOMM Comput. Commun. Rev., vol. 41, no. 2, pp. 24–30,Apr. 2011.

[28] H. Balakrishnan, S. Seshan, and R. Katz, “Improving reliable transportand handoff performance in cellular wireless networks,” ACM WirelessNetworks Journal (WINET), vol. 1, no. 4, pp. 469 – 481, December1995.

[29] J. K. Sundararajan, D. Shah, M. Medard, M. Mitzenmacher, and J. Bar-ros, “Network coding meets tcp,” in Proc. of IEEE Conf. on ComputerCommunications (INFOCOM), April 2009, pp. 280–288.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing

Page 16: Video Transmission over Lossy Wireless Networks: A Cross

16

[30] H. Hisamatsu, G. Hasegawa, and M. Murata, “Non bandwidth-intrusivevideo streaming over tcp,” in Information Technology: New Generations(ITNG), 2011 Eighth International Conference on, 2011, pp. 78–83.

[31] S. Mascolo, C. Casetti, M. Gerla, M. Y. Sanadidi, and R. Wang, “Tcpwestwood: Bandwidth estimation for enhanced transport over wirelesslinks,” in Proc. of ACM Intl. Conf. on Mobile Computing and Networking(MobiCom), Rome, Italy, July. 2001.

[32] W. Pu, Z. Zou, and C. W. Chen, “New tcp video streaming proxy designfor last-hop wireless networks,” in Image Processing (ICIP), 2011 18thIEEE International Conference on, 2011, pp. 2225–2228.

[33] X. Tong, Y. Andreopoulos, and M. van der Schaar, “Distortion-DrivenVideo Streaming over Multihop Wireless Networks with Path Diversity,”IEEE Trans. on Mobile Computing, vol. 6, no. 12, pp. 1343–1356, Oct.2007.

[34] B. Rong, Y. Qian, K. Lu, R. Hu, and M. Kadoch, “Multipath Routingover Wireless Mesh Networks for Multiple Description Video Trans-mission,” IEEE J. on Selected Areas in Commun., vol. 28, no. 3, pp.321–331, March 2010.

[35] G. Papageorgiou, S. Singh, S. Krishnamurthy, R. Govindan, and T. Porta,“Distortion-Resilient Routing for Video Flows in Wireless Multi-hopNetworks,” in Proc. of IEEE Int. Conf. on Network Protocols (ICNP),Austin, TX, Nov. 2012.

[36] D. Wu, S. Ci, H. Wang, and A. Katsaggelos, “Application-CentricRouting for Video Streaming Over MultiHop Wireless Networks,” IEEETrans. on Circuits and Syst. for Video Technology, vol. 20, no. 12, pp.1721–1734, Dec. 2010.

[37] E. Felemban, C.-G. Lee, and E. Ekici, “MMSPEED: Multipath Multi-SPEED Protocol for QoS Guarantee of Reliability and Timeliness inWireless Sensor Networks,” IEEE Trans. on Mobile Computing, vol. 5,no. 6, pp. 738–754, June 2006.

[38] S. Bai, W. Zhang, G. Xue, J. Tang, and C. Wang, “Dear: Delay-boundedenergy-constrained adaptive routing in wireless sensor networks,” inProc. of IEEE Conference on Computer Communications (INFOCOM),Orlando, FL, March 2012.

[39] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “A Survey on WirelessMultimedia Sensor Networks,” Computer Networks, vol. 51, no. 4, pp.921–960, March 2007.

[40] K. Lin, J. Rodrigues, H. Ge, N. Xiong, and X. Liang, “Energy EfficiencyQoS Assurance Routing in Wireless Multimedia Sensor Networks,”IEEE Syst. J., vol. 5, no. 4, pp. 495–505, Dec. 2011.

[41] M. Moazeni and A. Vahdatpour, “HEAP: a Hierarchical Energy AwareProtocol for Routing and Aggregation in Sensor Networks,” in Proc. ofInt. conf. on Wireless Internet (WICON), Austin, TX, Oct. 2007.

[42] D. Kandris, M. Tsagkaropoulos, I. Politis, A. Tzes, and S. Kotsopoulos,“Energy Efficient and Perceived QoS Aware Video Routing over Wire-less Multimedia Sensor Networks,” Ad Hoc Network, vol. 9, no. 4, pp.591–607, June 2011.

[43] I. Politis, M. Tsagkaropoulos, T. Pliakas, and T. Dagiuklas, “Distortionoptimized packet scheduling and prioritization of multiple video streamsover 802.11e networks,” Advances in Multimedia, vol. 2007, pp. 1 – 11,Aug. 2007.

[44] L. Saino, I. Psaras, and G. Pavlou, “Hash-routing Schemes for Infor-mation Centric Networking,” in Proc. of ACM SIGCOMM Workshop onInformation-centric Networking (ICN), Hong Kong, China, Aug. 2013.

[45] W. Ye, J. Heidemann, and D. Estrin, “An Energy-efficient MAC Protocolfor Wireless Sensor Networks,” in Proc. of IEEE Conf. on Comput.Commun. (INFOCOM), New York, NY, June 2002.

[46] Rajendran, Venkatesh, Obraczka, Katia, Garcia-Luna-Aceves, and J.J.,“Energy-Efficient, Collision-Free Medium Access Control for WirelessSensor Networks,” Wireless Networks, vol. 12, no. 1, pp. 63–78, Feb.2006.

[47] P. Wang, R. Dai, and I. Akyildiz, “A Differential Coding-Based Schedul-ing Framework for Wireless Multimedia Sensor Networks,” IEEE Trans.on Multimedia, vol. 15, no. 3, pp. 684–697, April 2013.

[48] P. Wang and W. Zhuang, “A Collision-free MAC Scheme for MultimediaWireless Mesh Backbone,” IEEE Trans. on Wireless Commun., vol. 8,no. 7, pp. 3577–3589, July 2009.

[49] J.-H. Lee, “A Massive Transmission Scheme in Contention-Based MACfor Wireless Multimedia Sensor Networks,” Wireless Personal Commun.,vol. 71, no. 3, pp. 2079–2095, Aug. 2013.

[50] Y. Z. Zhao, C. Miao, M. Ma, J. B. Zhang, and C. Leung, “A Surveyand Projection on Medium Access Control Protocols for Wireless SensorNetworks,” ACM Computing Surveys, vol. 45, no. 1, pp. 7:1–7:37, Dec.2012.

[51] J. I. Choi, M. Jain, K. Srinivasan, P. Levis, and S. Katti, “AchievingSingle Channel, Full Duplex Wireless Communication,” in Proc. of Int.Conf. on Mobile Computing and Networking (MobiCom), Chicago, IL,Sept. 2010.

[52] S. Gollakota and D. Katabi, “Zigzag Decoding: Combating Hidden Ter-minals in Wireless Networks,” SIGCOMM Comput. Commun. Review,vol. 38, no. 4, pp. 159–170, Oct. 2008.

[53] Y. Zhang, C. Zhu, and K.-H. Yap, “A Joint Source-Channel Video Cod-ing Scheme Based on Distributed Source Coding,” IEEE Transactionson Multimedia, vol. 10, no. 8, pp. 1648–1656, Dec. 2008.

[54] D. Wang, V. S. Somayazulu, and J. R. Foerster, “Efficient Cross-layerResource Allocation for H.264/SVC Video Transmission over Downlinkof an LTE System,” in Proc. of IEEE International Symposium on aWorld of Wireless, Mobile and Multimedia Networks (WoWMoM), SanFrancisco, CA, June 2012.

[55] Z. Chen and D. Wu, “Rate-distortion optimized cross-layer rate controlin wireless video communication,” IEEE Transacctions on Circuits andSystems for Video Technology, vol. 22, no. 3, pp. 352–365, March 2012.

[56] H. Zhang, Y. Zheng, M. A. A. Khojastepour, and S. Rangarajan, “Cross-layer optimization for streaming scalable video over fading wirelessnetworks,” IEEE Journal on Selected Areas in Communications, vol.28, no. 3, pp. 344–353, April 2010.

[57] E. Ekmekcioglu, H. K. Arachchi, C. G. Gurler, and S. S. Savas, “ContentAware Delivery of Visual Attention based Scalable Multi-View Videoover P2P,” in Proc. of IEEE International Packet Video Workshop,Munich, Germany, May 2012.

[58] S. T. Aditya and S. Katti, “FlexCast: Graceful Wireless Video Stream-ing,” in Proc. of ACM Int. Conf. on Mobile Computing and Networking(MobiCom), Las Vegas, NV, Sept. 2011.

[59] Z. Guan, T. Melodia, and D. Yuan, “Jointly Optimal Rate Control andRelay Selection for Cooperative Wireless Video Streaming,” IEEE/ACMTrans. on Networking, vol. 21, no. 4, pp. 1173–1186, Aug. 2013.

[60] S. Pudlewski and T. Melodia, “Cooperating to Stream CompressivelySampled Videos,” in Proc. of IEEE Int. Conf. on Commun. (ICC),Budapest, Hungary, June 2013.

[61] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview ofThe H.264/AVC Video Coding Standard,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July2003.

[62] S. Jakubczak and D. Katabi, “A Cross-Layer Design for ScalableMobile Video,” in Proc. of ACM Int. Conf. on Mobile Computing andNetworking (MobiCom), Las Vegas, NV, Sept. 2011.

[63] X. L. Liu, W. Hu, Q. Pu, F. Wu, and Y. Zhang, “ParCast: Soft VideoDelivery in MIMO-OFDM WLANs,” in Proc. of ACM Int. Conf. onMobile Computing and Networking (MobiCom), Istanbul, Turkey, Aug.2012.

[64] X. Fan, F. Wu, D. Zhao, and O. C. Au, “Distributed Wireless VisualCommunication With Power Distortion Optimization,” IEEE Trans. onCircuits and Syst. for Video Technology, vol. 23, no. 6, pp. 1040–1053,June 2013.

[65] J. N. Laneman, D. N. C. Tse, and G. W. Wornell, “Cooperative Diversityin Wireless Networks: Efficient Protocols and Outage Behavior ,” IEEETrans. on Inf. Theory, vol. 50, no. 12, pp. 3062–3080, Dec. 2004.

[66] S. A. Jafar and S. Shamai, “Degrees of Freedom Region of the MIMOX Channel,” IEEE Trans. on Inf. Theory, vol. 54, no. 1, pp. 151–170,Jan. 2008.

[67] N. Mastronarde, F. Verde, D. Darsena, A. Scaglione, and M. van derSchaar, “Transmitting Important Bits and Sailing High Radio Waves:A Decentralized Cross-Layer Approach to Cooperative Video Trans-mission,” IEEE J. on Selected Areas in Commun., vol. 30, no. 9, pp.1597–1604, Oct. 2012.

[68] E. Candes, M. Rudelson, T. Tao, and R. Vershynin, “Error Correctionvia Linear Programming,” in Proc. of IEEE Symp. on Foundations ofComput. Science (FOCS), Pittsburgh, PA, Oct. 2005.

[69] P. Merkle, J. B. Singla, K. Muller, and T. Wiegand, “Stereo VideoEncoder Optimization for Mobile Applications,” in Proc. of 3DTVConference: The True Vision - Capture, Transmission and Display of3D Video (3DTV-CON), Antalya, Turkey, May 2011.

[70] W. Pu, Z. Zou, and C. Chen, “Video Adaptation Proxy for WirelessDynamic Adaptive Streaming over HTTP,” in Proc. Int. Packet VideoWorkshop (PV), Munich, Germany, May 2012.

[71] W. Cai and V. C. M. Leung, “Multiplayer Cloud Gaming System withCooperative Video Sharing,” in Proc. of Int. Conf. on Cloud ComputingTechnology and Science (CloudCom), Taipei, Taiwan, Dec. 2012.

[72] A. Seema and M. Reisslein, “Towards Efficient Wireless Video SensorNetworks: A HW/SW Cross Layer Approach to Enabling Sensor NodePlatforms,” IEEE COMSOC MMTC E-Letter, vol. 7, no. 4, pp. 6–9,April 2012.

[73] C. Zhu and Y. Li, Advanced Video Communications over WirelessNetworks. USA: CRC Press, 2013.

[74] I. Sodagar, “The MPEG-DASH Standard for Multimedia Streaming Overthe Internet,” IEEE Multimedia, vol. 18, no. 4, pp. 62–67, 2011.

[75] Y. Sun and T. Melodia, “The Internet Underwater: An IP-compatibleProtocol Stack for Commercial Undersea Modems,” in Proc. ofACM International Conference on UnderWater Networks and Systems(WUWNet), Kaohsiung, Taiwan, Nov. 2013.

[76] Z. Guan and T. Melodia, “Cloud-assisted Smart-camera Networks forEnergy-efficient 3D Video Streaming,” IEEE Computer, pp. 26–32, May2014.

[77] W. Zhu, C. Luo, J. Wang, and S. Li, “Multimedia Cloud Computing,”IEEE Signal Proc. Magazine, vol. 28, no. 3, May 2011.

Manuscript accepted for publication to IEEE Journal on Selected Topics in Signal Processing