evso: environment-aware video streaming …we also extend the media presentation description, in...

9
EVSO: Environment-aware Video Streaming Optimization of Power Consumption Kyoungjun Park and Myungchul Kim School of Computing Korea Advanced Institute of Science and Technology Email: {kyoungjun, mck}@kaist.ac.kr Abstract—Streaming services gradually support high-quality videos for the better user experience. However, streaming high- quality video on mobile devices consumes a considerable amount of energy. This paper presents the design and prototype of EVSO, which achieves power saving by applying adaptive frame rates to parts of videos with a little degradation of the user experience. EVSO utilizes a novel perceptual similarity measurement method based on human visual perception specialized for a video encoder. We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for additional consideration of the user’s battery status. EVSO’s streaming server preprocesses the video into several processed videos according to the similarity intensity of each part of the video and then provides the client with the processed video suitable for the network bandwidth and the battery status of the client’s mobile device. The EVSO system was implemented on the commonly used H.264/AVC encoder. We conduct various experiments and a user study with nine videos. Our experimental results show that EVSO effectively reduces the energy consumption when mobile devices uses streaming services by 22% on average and up to 27% while maintaining the quality of the user experience. I. I NTRODUCTION Video streaming on mobile devices such as smartphones and tablets has seen unprecedented growth in recent years. According to [1], mobile video traffic accounted for more than 60% of the total mobile traffic in 2017, and it is expected to grow by 54% annually, reaching 78% of all traffic by 2021. Currently, many video streaming services and smartphones support high frame rates and high resolutions for better user experiences [2], [3]. In addition, both Virtual Reality (VR) and Augmented Reality (AR) that require high-quality video are considered to be among the next big applications in mobile technology. However, high-quality videos require a significant proportion of device resources, mainly display-related compo- nents, resulting in much higher power consumption [4], [5]. Many mobile device manufacturers have put in much effort to increase the battery lifetime of mobile devices. According to Samsung, the battery capacity of their smartphones has experienced a Compound Annual Growth Rate (CAGR) of 11.17% from 2010 to present indicating that the battery capacity has steadily improved [3]. However, since the amount of mobile video traffic will increase by 54% yearly, the current rate of improvement in the battery capacity will not meet the power consumption requirement of video streaming in the near future. There have been several efforts to reduce the power con- sumption of mobile games, which typically have high frame rates and resolutions causing the rapid battery drain. One approach called Game Tuner [6] allows users to configure parameters related to frame rates or resolutions. However, the disadvantages of this approach are that the user experience is seriously degraded due to the static configuration and that the user intervention is required. To solve these problems, there have been several approaches that attempt to dynamically scale frame rates based on the frame contents [5], [7]. This is achieved by measuring the structural similarity between frames and then dropping redundant or very similar frames. However, these approaches only drop frames on the client side; hence, the number of frames transmitted to the client is not affected. Therefore, the energy consumption required for wireless transmission on the client side remains the same. In addition, the network bandwidth on the client side is not utilized efficiently because unnecessary frames are transmitted. Based on the aforementioned discussion, we propose Environment-aware Video Streaming Optimization (EVSO), which effectively applies adaptive frame rates for videos without requiring any user effort or incurring considerable computational overhead. The environment represents the status of the user’s mobile device and the characteristics of the video that the streaming server delivers to the user. The basic idea behind EVSO is that not all parts of the video require high frame rates. In other words, adaptive frame rates can be applied to parts of the video according to the degree of motion change between frames. For example, in a video where a golfer prepares for a swing, a lower frame rate can be applied due to the low variation in the frames [8]. On the other hand, a higher frame rate can be applied to the high variation in the frames such as a swing in motion or a moving golf ball to avoid a degradation of the user experience. EVSO utilizes a H.264/AVC encoder [9], which is the most widely used video encoder, to be interoperable with existing systems without additional deployment overhead. EVSO in- cludes the Frame rate Scheduler (F-Scheduler), which scales the frame rates of specific parts of videos using a perception- aware analysis based on the information generated during the H.264/AVC encoding process. We note that the H.264/AVC encoder calculates the similarity between frames based on macroblocks during the video compression (or video coding) process. EVSO utilizes these macroblocks to schedule differ- arXiv:1905.06500v1 [cs.MM] 16 May 2019

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

EVSO Environment-aware Video StreamingOptimization of Power Consumption

Kyoungjun Park and Myungchul KimSchool of Computing

Korea Advanced Institute of Science and TechnologyEmail kyoungjun mckkaistackr

AbstractmdashStreaming services gradually support high-qualityvideos for the better user experience However streaming high-quality video on mobile devices consumes a considerable amountof energy This paper presents the design and prototype of EVSOwhich achieves power saving by applying adaptive frame rates toparts of videos with a little degradation of the user experienceEVSO utilizes a novel perceptual similarity measurement methodbased on human visual perception specialized for a video encoderWe also extend the media presentation description in which thevideo content is selected based only on the network bandwidthto allow for additional consideration of the userrsquos battery statusEVSOrsquos streaming server preprocesses the video into severalprocessed videos according to the similarity intensity of eachpart of the video and then provides the client with the processedvideo suitable for the network bandwidth and the batterystatus of the clientrsquos mobile device The EVSO system wasimplemented on the commonly used H264AVC encoder Weconduct various experiments and a user study with nine videosOur experimental results show that EVSO effectively reduces theenergy consumption when mobile devices uses streaming servicesby 22 on average and up to 27 while maintaining the qualityof the user experience

I INTRODUCTION

Video streaming on mobile devices such as smartphonesand tablets has seen unprecedented growth in recent yearsAccording to [1] mobile video traffic accounted for more than60 of the total mobile traffic in 2017 and it is expected togrow by 54 annually reaching 78 of all traffic by 2021Currently many video streaming services and smartphonessupport high frame rates and high resolutions for better userexperiences [2] [3] In addition both Virtual Reality (VR) andAugmented Reality (AR) that require high-quality video areconsidered to be among the next big applications in mobiletechnology However high-quality videos require a significantproportion of device resources mainly display-related compo-nents resulting in much higher power consumption [4] [5]

Many mobile device manufacturers have put in much effortto increase the battery lifetime of mobile devices Accordingto Samsung the battery capacity of their smartphones hasexperienced a Compound Annual Growth Rate (CAGR) of1117 from 2010 to present indicating that the batterycapacity has steadily improved [3] However since the amountof mobile video traffic will increase by 54 yearly the currentrate of improvement in the battery capacity will not meet thepower consumption requirement of video streaming in the nearfuture

There have been several efforts to reduce the power con-sumption of mobile games which typically have high framerates and resolutions causing the rapid battery drain Oneapproach called Game Tuner [6] allows users to configureparameters related to frame rates or resolutions However thedisadvantages of this approach are that the user experienceis seriously degraded due to the static configuration and thatthe user intervention is required To solve these problemsthere have been several approaches that attempt to dynamicallyscale frame rates based on the frame contents [5] [7] Thisis achieved by measuring the structural similarity betweenframes and then dropping redundant or very similar framesHowever these approaches only drop frames on the clientside hence the number of frames transmitted to the clientis not affected Therefore the energy consumption requiredfor wireless transmission on the client side remains the sameIn addition the network bandwidth on the client side is notutilized efficiently because unnecessary frames are transmitted

Based on the aforementioned discussion we proposeEnvironment-aware Video Streaming Optimization (EVSO)which effectively applies adaptive frame rates for videoswithout requiring any user effort or incurring considerablecomputational overhead The environment represents the statusof the userrsquos mobile device and the characteristics of thevideo that the streaming server delivers to the user The basicidea behind EVSO is that not all parts of the video requirehigh frame rates In other words adaptive frame rates can beapplied to parts of the video according to the degree of motionchange between frames For example in a video where a golferprepares for a swing a lower frame rate can be applied dueto the low variation in the frames [8] On the other hand ahigher frame rate can be applied to the high variation in theframes such as a swing in motion or a moving golf ball toavoid a degradation of the user experience

EVSO utilizes a H264AVC encoder [9] which is the mostwidely used video encoder to be interoperable with existingsystems without additional deployment overhead EVSO in-cludes the Frame rate Scheduler (F-Scheduler) which scalesthe frame rates of specific parts of videos using a perception-aware analysis based on the information generated during theH264AVC encoding process We note that the H264AVCencoder calculates the similarity between frames based onmacroblocks during the video compression (or video coding)process EVSO utilizes these macroblocks to schedule differ-

arX

iv1

905

0650

0v1

[cs

MM

] 1

6 M

ay 2

019

0 500 1000 1500 2000

Frame Number [1 N-1]

0

02

04

06

08

1S

SIM

(f N

fN

+1)

Figure 1 Variation of SSIM between adjacent frames in the baseball video

ent parts of the video with appropriate frame ratesThe EVSO system was implemented on a video streaming

server built with Internet Information Services (IIS) [10]We conducted a broad range of experiments and a userstudy with nine videos to evaluate the proposed system Ourexperimental results show that EVSO can greatly reduce theenergy requirement with little impact on the user experienceThe system reduced the energy consumption rate by an averageof 22 and up to 27 In addition the user study shows thatusers cannot clearly distinguish between the original videosand videos processed through EVSO

The main contributions of this paper are as followsbull We propose a new perceptual similarity calculation

method that leverages the information generated by theH264AVC encoding process

bull Based on the similarity calculation method we presenta novel scheduling technique that adaptively adjusts theframe rate according to the degree of motion intensity

bull We extend the media presentation description (MPD) totake into account not only network conditions but alsobattery status

bull We present the design and prototype of the EVSO systemto reduce energy consumption when streaming videos

bull Various experiments and a user study show that theproposed system effectively preserves the user experienceand reduces the power consumption when streamingvideos

The rest of this paper is organized as follows Section IIexplains the motivation behind EVSO Section III describesthe design of EVSO and its main functions Section IVpresents the detailed implementation of EVSO and Section Vreports the various experiments and a user study Section VIintroduces the related work and Section VII concludes thestudy and discusses possible future work

II MOTIVATION AND BACKGROUND

This section highlights the necessity of EVSO and explainsthe information leveraged by EVSO during the H264AVCencoding process

A Similarity Analysis of Adjacent Video Frames

The similarity between frames can be measured using theStructural SIMilarity (SSIM) index [11] Figure 1 shows SSIMindices between adjacent frames in a baseball video [8] Therange of the SSIM index is from 0 to 1 and a value of 09or higher is considered to indicate a strong similarity betweentwo frames [12] The red shaded mark in Figure 1 indicatesthat the SSIM value between adjacent frames is higher than

Frame N Frame N+24 Frame N+48

Slow Movement(Average SSIM 9075)

Fast Movement(Average SSIM 3075)

Figure 2 Degree of frame change during fast and slow movements

09 which means that the degree of change is fairly low Forexample Figure 2 shows examples of a slow movement wherethe pitcher prepares to throw the ball and a fast movementwhere the ball is thrown and the camera quickly follows theball The difference between the average values of SSIM forthe two movement scenarios is 60 which is a large gap Thisindicates that the degree of change can vary greatly accordingto which part of the video is playing even within in a singlevideo Therefore the energy consumption requirement can bereduced by throttling down the frame rate during slow movingparts

B Video Upload Process and H264AVC Encoder

When a user uploads a video to a streaming media servicesuch as YouTube the server transcodes the video into aset of optimized video streams Transcoding is a conversionprocess in which the encoding settings of the original videoare reformatted There are several reasons why the streamingserver must perform transcoding when the video is uploadedThe first case is when the streaming server needs to lowerthe bitrate of the uploaded video which is called transratingThe second case is when the streaming server needs to lowerthe resolution of the uploaded video known as transsizingThe transrating and transsizing are always performed whenthe server creates several processed videos to provide DASHfunctionality [13]

The most commonly used video codec for streaming serversis H264AVC [2] Therefore when transcoding is performedon a streaming server the H264AVC encoder is used toconvert the original video into the desired videos The videoencoding of H264AVC follows a block-based approach whereeach coded picture is represented by block-shaped units calledmacroblocks A macroblock consists of four 8times8 luminance(Y) samples and two 8times8 chrominance (Cb and Cr) samplesin the YCbCr color space with 420 chroma subsampling [9]These Y samples are extracted and utilized in EVSO becausethe human visual system is sensitive to luminance [14]

III EVSO DESIGN

Figure 3 shows the EVSO system which consists ofthree major components Frame rate Scheduler (F-Scheduler)Video Processor (V-Processor) and Extended MPD (EMPD)When a video is uploaded to a streaming server F-Schedulercalculates the perceptual similarity score between adjacentframes with little computational overhead (see Section III-A)Then F-Scheduler determines how to split the video into

Transcoding

Prediction

Transform

Encode

H264MPEG-4 AVC Encoder

F - S c h e d u l e rlt Sec 31 - 33 gt

V - P r o c e s s o rlt Sec 4 gt

E M P Dlt Sec 34 gt

Macroblockrsquos Y(luminance) values

Decoder

E V S O

Copy

Scheduling Results

Input Video

User

High Medium Low

1080p

720p

480p

MPD

Figure 3 Flow of EVSO when a video is uploaded to a streaming server

multiple video chunks with similar motion intensity levels(see Section III-B) and schedules the appropriate frame ratesfor each video chunk according to three battery levels HighMedium and Low (see Section III-C) Although our studyadopts three battery levels other settings can also be appliedThe scheduling results generated from F-Scheduler specifywhich frame rate is appropriate for each part of the video Af-terwards V-Processor processes the original video accordingto what is specified in the scheduling results to produce videossuitable for the three battery levels Detailed implementationof V-Processor is described in Section IV Finally EMPDallows users to request appropriate video chunks taking intoconsideration not only the network condition but also thebattery status (see Section III-D)

Note that this paper focuses on H264AVC as the videocodec however we emphasize that the proposed EVSO sys-tem is not limited solely to H264AVC Since other codecsalso perform video compression using the luminance (Y)values of the macroblock in the similar manner as H264AVCEVSO can easily be applied to other codecs

A Calculating the Perceptual Similarity Score

The human vision system is more sensitive to changes inthe brightness (Y) than it is to changes in colors (Cb and Cr)[9] [14] Hwang et al [5] have shown that the Y-Difference(or Y-Diff) reflects the changes in visual perception fairly wellMoreover they found Y-Diff to be highly correlated with thecommonly used SSIM method through various experimentsand a user study However human vision is sensitive notonly to brightness but also to object movement [15] TheY-Diff value between two images is calculated as the Sumof Absolute Differences (SAD) of the luminance values ona per-frame basis not on a per-block basis Therefore thelocal information can be ignored because the detailed featuresof each block are mixed In other words Y-Diff effectivelyreflects the brightness but can be less accurate to the motionperception of objects Although Y-Diff can partially reflectsome perception of object movement [14] but more attentionneeds to be paid to motion information in order to be morecompatible with human perception

To effectively calculate the perceptual similarity scores byconsidering both the brightness and object movement in videostreaming environments we define the Macroblock-Difference

0 05 1 15 2 25 3 35

Y-Diff (fN

fN+1

) 108

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 4 Correlation between Y-Diff and SSIM (Pearsonrsquos correlation coef-ficient -07329)

0 05 1 15 2 25 3 35

M-Diff (fN

fN+1

) 104

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 5 Correlation between M-Diff and SSIM (Pearsonrsquos correlationcoefficient -07834)

(or M-Diff) DMB(fa fb) between frames fa and fb asfollows

DMB(fa fb) =

Nminus1sumi=0

Mminus1sumj=0

DY (fa(i j) fb(i j)) (1)

DY (fa(i j) fb(i j)) =

1 SADY (fa(i j) fb(i j)) gt θ

0 SADY (fa(i j) fb(i j)) le θ(2)

where fk(i j) is the k-th frame of N times M macroblockswith i and j representing the macroblock coordinatesSADY (fa(i j) fb(i j)) is the SAD value between the lumi-nance value of fa(i j) and fb(i j) and DY (fa(i j) fb(i j))is set to one when the SAD value between the luminance valueof fa(i j) and fb(i j) exceeds a certain threshold θ The valueof θ is set to 320 because OpenH264 [16] determines highmotion blocks when the SAD value of the luminance exceeds320 when detecting a scene change Finding the optimal valueof θ is left as future work

EVSO used M-Diff as a perceptual similarity measurementmethod to cover both brightness and object movement M-Diff uses the luminance values of macroblocks to performbrightness perception and handles object movement throughblock-based calculation The block-based approach performsthe perceptual similarity calculation based on the macroblockunit rather than on the frame unit Figure 4 shows how Y-Diff is similar to SSIM in the analysis on adjacent frames ofseven videos [8] and Figure 5 shows how M-Diff is similar toSSIM in the same videos These figures indicate that M-Diffis highly correlated with SSIM which is approximately 5higher than Y-Diff In other words M-Diff has an accuracycomparable to or better than Y-Diff in perceptual similaritycalculationHow about using SSIM as a method for the perceptualsimilarity calculation More than 72 hours of new videocontent is uploaded to YouTube every minute and the volumeof uploaded videos continues to grow [17] For such a largenumber of videos performing SSIM calculations for consec-utive frames is exorbitant in terms of computational time andpower On a desktop PC equipped with a 35 GHz processor

and 12 GB of memory it took about 401 milliseconds tocalculate the perceptual similarity between two frames with1920times1080 resolution using SSIM If SSIM is used to processa video with 60 frames per second for a duration of 30 minutesthe time overhead is approximately 12 hours On the otherhand there is little or no overhead incurred in the M-Diffcalculation because it uses the luminance differences betweenmacroblocks generated by the H264AVC encoder when avideo is uploaded

B Splitting Video into Multiple Video Chunks

In order to apply multiple frame rates to a single video howto properly split the video into multiple chunks needs to beconsidered If a video is split too finely an appropriate framerate can be set to suit the characteristics of each video chunkbut it takes a very long time to process the video In contrastif a video is split too coarsely it is difficult to calculate theappropriate frame rate because the characteristic of each videochunk becomes ambiguous which can adversely impact theuser experienceEstimating the split threshold A simple approach to splittingthe video is to use the M-Diff value in the perceptual similarityscore as a separation criterion For example if more than 80of the corresponding macroblocks in two frames exceed thethreshold θ then there is a large enough difference between thetwo frames and thus they can be separated Another approachis to use the local statistical properties of the frame sequence todefine a dynamic threshold model These local properties caninclude mean and standard deviation to determine the degreeof change in the frame sequence However the above twoapproaches are used mainly to detect scene changes [16] Onthe other hand our approach is to design separation criteriathat allow each video chunk to have a similar degree ofvariability

We newly define the Estimated Split Threshold (EST) toseparate the video into multiple chunks with similar variabilitylevels as follows

EST (fn) =

1 if (σn gt α or DMB(fnminus1 fn) gt β) and T gt γ

0 otherwise(3)

σn =

radicradicradicradic 1

K minus 1

nminus1sumi=nminusK+1

(DMB(fi fi+1)minusmn)2 (4)

mn =1

K

nminus1sumi=nminusK+1

DMB(fi fi+1) (5)

where mn is the average M-Diff value of the previous Knumber of frames σn is the standard deviation with thewindow size K representing the degree of M-Diff scatteringof the previous K number of frames and EST (fn) is athreshold that determines whether to split When the value ofEST (fn) is one EVSO decides to split at fn In EST (fn)DMB(fnminus1 fn) is used to detect clear scene changes FinallyT is the number of frames in the current video chunk and γis the frame rate of the current video The values of σn and

1000 2000 3000 4000

Factor

0

50

100

150

Pro

cess

ing

Tim

e (s

)

(a) Constant factor α

05 1 15 2

Factor 104

0

50

100

Pro

cess

ing

Tim

e (s

)

(b) Constant factor β

Figure 6 Variation of the processing time according to the constant factors

DMB(fnminus1 fn) tend to be large around the high-motion partof the video which can cause the video to split too finelyTherefore the inequality N gt γ representing the minimumcondition is applied so that each video chunk is at least onesecond longDetermining the α β and K factors The smaller the valueof the constant factors α and β the less frames are assignedto the video chunk so each video chunk can have a moresimilar level of variability In the extreme setting α and β to0 can be effective for the user experience however if α andβ are too low the number of separated video chunks and theprocessing time of V-Processor increase exponentially Thatis α and β should be appropriately selected in considerationof the tradeoff between the computational complexity and theaccuracy of the frame rate estimation process

Figure 6(a) shows the processing time when α is set to1000 2000 3000 and 4000 with nine experimental videos ona server equipped with 220 GHztimes40 processors and 135 GBof memory As α increases the total number of video chunksto be separated decreases reducing the overall processing timeOur findings indicate that in most videos the processing timedecreases drastically when α is set to 3000 after which itremains nearly the same Likewise Figure 6(b) shows thatthe processing time decreases as β increases Moreover inmost of the videos the processing time decreases sharplyuntil β reaches 15000 and then remains almost the sameConsequently α and β are set to 3000 and 15000 respectivelyto allow efficient processing

If the window size K is set too high rapid motion changescannot be detected as separation criteria K values of 1 510 and 30 were tested to determine the appropriate value fordetecting motion changes in our experimental environmentBased on numerous experiments K is set to 10 The optimalvalue of the window size K depends on the characteristic ofthe video being processed

C Estimating Frame Rates for Video Chunks

After splitting the video into multiple video chunks based onthe discussion presented in Section III-B an appropriate framerate needs to be determined for each video chunk Setting theproper frame rate to match the variation of each video chunkis a crucial step in terms of the user experienceEstimating proper frame rates for video chunks Based onthe analysis of the relationship between the M-Diff values andframe rates we define the Estimated Proper Frame rate (EPF)

EMPD

Period 2

Adaptation Set 1 Representation 4

Period 1duration = ldquoPT30Srdquohellip

Period 2duration = ldquoPT5Mrdquohellip

Period 3duration = ldquoPT7Mrdquohellip

hellip

Adaptation Set 0

Adaptation Set 1

Adaptation Set 2

Representation 11080p HighmimeType = ldquovideomp4rdquohellip

Representation 4720p HighmimeType = ldquovideomp4rdquohellip

Representation 21080p MediummimeType = ldquovideomp4rdquohellip

Representation 5720p MediummimeType = ldquovideomp4rdquohellip

Representation 31080p LowmimeType = ldquovideomp4rdquohellip

Representation 6720p LowmimeType = ldquovideomp4rdquohellip

Initialization segment (URL)MIT_course_1280x720_5000k_high_dashinitmp4

Representation 6

Media Segment 1 (URL)

hellip

hellip

Initialization segment (URL)MIT_course_1280x720_5000k_low_dashinitmp4

hellip

hellip

hellip

Media Segment 2 (URL)

Media Segment 3 (URL)

Media Segment 1 (URL)

Media Segment 2 (URL)

Media Segment 3 (URL) hellip

Figure 7 The overall hierarchical structure of EMPD

as follows

EPF (fn fn+1) =

s1 times γ if DMB(fn fn+1) lt τ1s2 times γ if τ1 le DMB(fn fn+1) lt τ2s3 times γ if τ2 le DMB(fn fn+1) lt τ3s4 times γ if τ3 le DMB(fn fn+1) lt τ4s5 times γ if τ4 le DMB(fn fn+1)

(6)

where s1 s2 s3 s4 and s5 are the scaling factors of the framerate γ that match the degree of change between fn and fn+1and τ1 τ2 τ3 and τ4 are the scaling thresholds that distinguishM-Diff values based on similar variations EPF (fn fn+1) isutilized to calculate the appropriate frame rate for adjacentframes and then we define the Estimated Video chunk Framerate (EVF) to obtain the appropriate frame rate for each videochunk as follows

EV F (Sk) =

mminus1sumi=l

EPF (fi fi+1)

mminus l+ δ times σSk

(7)

where Sk represents the k-th video chunk ie(fl fl+1 fl+2 fm) which is needed to estimate theappropriate frame rate and σSk

is the standard deviation ofadjacent frames in Sk based on Equation (4) which is used todetermine whether the degree of M-Diff variation in the videochunk is constant or anomalous If σSk

is high it means thatfast and slow moving parts coexist in a single video chunk Inthis case the overall frame rate can be adjusted to take intoaccount the highly variable parts for a better user experienceOur experiments were performed with δ equal to 00001 butit can be tailored to the specific environment in which EVSOis deployedDetermining τ and s factors To determine the τ factors weanalyzed the relationship between M-Diff and the commonlyused SSIM using a linear regression based on Figure 5 Theresulting equation is as follows

SSIM(fn fn+1) = 10063minus 15903times 10minus5 timesDMB(fn fn+1) (8)

The statistical values of R2 (R-squared) and Pearson Cor-relation Coefficient (PCC) were calculated to measure thegoodness of fit The R2 and PCC values of this regressionmodel are 0613 and -07834 respectively indicating that thecorrelation between SSIM and M-Diff is sufficiently high

If the SSIM index between two frames is above 09 thepeak signal-to-noise ratio is above 50 dB and the mean opinionscore which represents the quality of experience on a scaleof 1 to 5 corresponds to 4 (Good) or 5 (Excellent Identical)[12] Based on this we subdivided the SSIM index into 099098 095 and 091 According to the regression model inEquation (8) the M-Diff values corresponding to these SSIMindices are 500 1500 3000 and 6000 and these valuesrepresent τ1 τ2 τ3 and τ4 in Equation (6) respectively

The s factors are set differently depending on the batterylevel Detailed settings of the s factors are described inSection V

D Extending Media Presentation Description (MPD) forUser Interaction

Dynamic Adaptive Streaming over HTTP (DASH) [13]also known as MPEG-DASH is an adaptive bitrate streamingtechnology that allows a multimedia file to be partitioned intoseveral segments and transmitted to the client over HTTP Inother words when a video is uploaded the streaming servermust generate several processed videos by adjusting the bitrateand resolution to provide DASH functionality

Media Presentation Description (MPD) refers to a manifestthat provides segment information such as resolutions bitratesand URLs of the video contents Therefore the DASH clientchooses the segment with the highest possible resolution andbitrate that can be supported by the network bandwidth andthen fetches the corresponding segment through MPD EVSOextends MPD to allow the client to select segments consideringthe battery status as well as the network bandwidth

Figure 7 shows the hierarchical structure of Extended MPD(EMPD) which consists of one or more periods that describethe content duration Multiple periods can be used when it isnecessary to classify multiple videos into chapter-by-chapteror separate advertisements and contents Each period consistsof one or more adaptation sets which include media streamsThe period typically consists of separate adaptation sets ofaudio and video for efficient bandwidth management Theadaptation set of the video consists of several representationsthat contain specific information about alternative contentssuch as the resolution and MIME type (or content type)EMPD extended this adaptation set to include an EVSOLevelattribute allowing the client to select the representation thatis appropriate for their current battery situation Finally the

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 2: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

0 500 1000 1500 2000

Frame Number [1 N-1]

0

02

04

06

08

1S

SIM

(f N

fN

+1)

Figure 1 Variation of SSIM between adjacent frames in the baseball video

ent parts of the video with appropriate frame ratesThe EVSO system was implemented on a video streaming

server built with Internet Information Services (IIS) [10]We conducted a broad range of experiments and a userstudy with nine videos to evaluate the proposed system Ourexperimental results show that EVSO can greatly reduce theenergy requirement with little impact on the user experienceThe system reduced the energy consumption rate by an averageof 22 and up to 27 In addition the user study shows thatusers cannot clearly distinguish between the original videosand videos processed through EVSO

The main contributions of this paper are as followsbull We propose a new perceptual similarity calculation

method that leverages the information generated by theH264AVC encoding process

bull Based on the similarity calculation method we presenta novel scheduling technique that adaptively adjusts theframe rate according to the degree of motion intensity

bull We extend the media presentation description (MPD) totake into account not only network conditions but alsobattery status

bull We present the design and prototype of the EVSO systemto reduce energy consumption when streaming videos

bull Various experiments and a user study show that theproposed system effectively preserves the user experienceand reduces the power consumption when streamingvideos

The rest of this paper is organized as follows Section IIexplains the motivation behind EVSO Section III describesthe design of EVSO and its main functions Section IVpresents the detailed implementation of EVSO and Section Vreports the various experiments and a user study Section VIintroduces the related work and Section VII concludes thestudy and discusses possible future work

II MOTIVATION AND BACKGROUND

This section highlights the necessity of EVSO and explainsthe information leveraged by EVSO during the H264AVCencoding process

A Similarity Analysis of Adjacent Video Frames

The similarity between frames can be measured using theStructural SIMilarity (SSIM) index [11] Figure 1 shows SSIMindices between adjacent frames in a baseball video [8] Therange of the SSIM index is from 0 to 1 and a value of 09or higher is considered to indicate a strong similarity betweentwo frames [12] The red shaded mark in Figure 1 indicatesthat the SSIM value between adjacent frames is higher than

Frame N Frame N+24 Frame N+48

Slow Movement(Average SSIM 9075)

Fast Movement(Average SSIM 3075)

Figure 2 Degree of frame change during fast and slow movements

09 which means that the degree of change is fairly low Forexample Figure 2 shows examples of a slow movement wherethe pitcher prepares to throw the ball and a fast movementwhere the ball is thrown and the camera quickly follows theball The difference between the average values of SSIM forthe two movement scenarios is 60 which is a large gap Thisindicates that the degree of change can vary greatly accordingto which part of the video is playing even within in a singlevideo Therefore the energy consumption requirement can bereduced by throttling down the frame rate during slow movingparts

B Video Upload Process and H264AVC Encoder

When a user uploads a video to a streaming media servicesuch as YouTube the server transcodes the video into aset of optimized video streams Transcoding is a conversionprocess in which the encoding settings of the original videoare reformatted There are several reasons why the streamingserver must perform transcoding when the video is uploadedThe first case is when the streaming server needs to lowerthe bitrate of the uploaded video which is called transratingThe second case is when the streaming server needs to lowerthe resolution of the uploaded video known as transsizingThe transrating and transsizing are always performed whenthe server creates several processed videos to provide DASHfunctionality [13]

The most commonly used video codec for streaming serversis H264AVC [2] Therefore when transcoding is performedon a streaming server the H264AVC encoder is used toconvert the original video into the desired videos The videoencoding of H264AVC follows a block-based approach whereeach coded picture is represented by block-shaped units calledmacroblocks A macroblock consists of four 8times8 luminance(Y) samples and two 8times8 chrominance (Cb and Cr) samplesin the YCbCr color space with 420 chroma subsampling [9]These Y samples are extracted and utilized in EVSO becausethe human visual system is sensitive to luminance [14]

III EVSO DESIGN

Figure 3 shows the EVSO system which consists ofthree major components Frame rate Scheduler (F-Scheduler)Video Processor (V-Processor) and Extended MPD (EMPD)When a video is uploaded to a streaming server F-Schedulercalculates the perceptual similarity score between adjacentframes with little computational overhead (see Section III-A)Then F-Scheduler determines how to split the video into

Transcoding

Prediction

Transform

Encode

H264MPEG-4 AVC Encoder

F - S c h e d u l e rlt Sec 31 - 33 gt

V - P r o c e s s o rlt Sec 4 gt

E M P Dlt Sec 34 gt

Macroblockrsquos Y(luminance) values

Decoder

E V S O

Copy

Scheduling Results

Input Video

User

High Medium Low

1080p

720p

480p

MPD

Figure 3 Flow of EVSO when a video is uploaded to a streaming server

multiple video chunks with similar motion intensity levels(see Section III-B) and schedules the appropriate frame ratesfor each video chunk according to three battery levels HighMedium and Low (see Section III-C) Although our studyadopts three battery levels other settings can also be appliedThe scheduling results generated from F-Scheduler specifywhich frame rate is appropriate for each part of the video Af-terwards V-Processor processes the original video accordingto what is specified in the scheduling results to produce videossuitable for the three battery levels Detailed implementationof V-Processor is described in Section IV Finally EMPDallows users to request appropriate video chunks taking intoconsideration not only the network condition but also thebattery status (see Section III-D)

Note that this paper focuses on H264AVC as the videocodec however we emphasize that the proposed EVSO sys-tem is not limited solely to H264AVC Since other codecsalso perform video compression using the luminance (Y)values of the macroblock in the similar manner as H264AVCEVSO can easily be applied to other codecs

A Calculating the Perceptual Similarity Score

The human vision system is more sensitive to changes inthe brightness (Y) than it is to changes in colors (Cb and Cr)[9] [14] Hwang et al [5] have shown that the Y-Difference(or Y-Diff) reflects the changes in visual perception fairly wellMoreover they found Y-Diff to be highly correlated with thecommonly used SSIM method through various experimentsand a user study However human vision is sensitive notonly to brightness but also to object movement [15] TheY-Diff value between two images is calculated as the Sumof Absolute Differences (SAD) of the luminance values ona per-frame basis not on a per-block basis Therefore thelocal information can be ignored because the detailed featuresof each block are mixed In other words Y-Diff effectivelyreflects the brightness but can be less accurate to the motionperception of objects Although Y-Diff can partially reflectsome perception of object movement [14] but more attentionneeds to be paid to motion information in order to be morecompatible with human perception

To effectively calculate the perceptual similarity scores byconsidering both the brightness and object movement in videostreaming environments we define the Macroblock-Difference

0 05 1 15 2 25 3 35

Y-Diff (fN

fN+1

) 108

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 4 Correlation between Y-Diff and SSIM (Pearsonrsquos correlation coef-ficient -07329)

0 05 1 15 2 25 3 35

M-Diff (fN

fN+1

) 104

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 5 Correlation between M-Diff and SSIM (Pearsonrsquos correlationcoefficient -07834)

(or M-Diff) DMB(fa fb) between frames fa and fb asfollows

DMB(fa fb) =

Nminus1sumi=0

Mminus1sumj=0

DY (fa(i j) fb(i j)) (1)

DY (fa(i j) fb(i j)) =

1 SADY (fa(i j) fb(i j)) gt θ

0 SADY (fa(i j) fb(i j)) le θ(2)

where fk(i j) is the k-th frame of N times M macroblockswith i and j representing the macroblock coordinatesSADY (fa(i j) fb(i j)) is the SAD value between the lumi-nance value of fa(i j) and fb(i j) and DY (fa(i j) fb(i j))is set to one when the SAD value between the luminance valueof fa(i j) and fb(i j) exceeds a certain threshold θ The valueof θ is set to 320 because OpenH264 [16] determines highmotion blocks when the SAD value of the luminance exceeds320 when detecting a scene change Finding the optimal valueof θ is left as future work

EVSO used M-Diff as a perceptual similarity measurementmethod to cover both brightness and object movement M-Diff uses the luminance values of macroblocks to performbrightness perception and handles object movement throughblock-based calculation The block-based approach performsthe perceptual similarity calculation based on the macroblockunit rather than on the frame unit Figure 4 shows how Y-Diff is similar to SSIM in the analysis on adjacent frames ofseven videos [8] and Figure 5 shows how M-Diff is similar toSSIM in the same videos These figures indicate that M-Diffis highly correlated with SSIM which is approximately 5higher than Y-Diff In other words M-Diff has an accuracycomparable to or better than Y-Diff in perceptual similaritycalculationHow about using SSIM as a method for the perceptualsimilarity calculation More than 72 hours of new videocontent is uploaded to YouTube every minute and the volumeof uploaded videos continues to grow [17] For such a largenumber of videos performing SSIM calculations for consec-utive frames is exorbitant in terms of computational time andpower On a desktop PC equipped with a 35 GHz processor

and 12 GB of memory it took about 401 milliseconds tocalculate the perceptual similarity between two frames with1920times1080 resolution using SSIM If SSIM is used to processa video with 60 frames per second for a duration of 30 minutesthe time overhead is approximately 12 hours On the otherhand there is little or no overhead incurred in the M-Diffcalculation because it uses the luminance differences betweenmacroblocks generated by the H264AVC encoder when avideo is uploaded

B Splitting Video into Multiple Video Chunks

In order to apply multiple frame rates to a single video howto properly split the video into multiple chunks needs to beconsidered If a video is split too finely an appropriate framerate can be set to suit the characteristics of each video chunkbut it takes a very long time to process the video In contrastif a video is split too coarsely it is difficult to calculate theappropriate frame rate because the characteristic of each videochunk becomes ambiguous which can adversely impact theuser experienceEstimating the split threshold A simple approach to splittingthe video is to use the M-Diff value in the perceptual similarityscore as a separation criterion For example if more than 80of the corresponding macroblocks in two frames exceed thethreshold θ then there is a large enough difference between thetwo frames and thus they can be separated Another approachis to use the local statistical properties of the frame sequence todefine a dynamic threshold model These local properties caninclude mean and standard deviation to determine the degreeof change in the frame sequence However the above twoapproaches are used mainly to detect scene changes [16] Onthe other hand our approach is to design separation criteriathat allow each video chunk to have a similar degree ofvariability

We newly define the Estimated Split Threshold (EST) toseparate the video into multiple chunks with similar variabilitylevels as follows

EST (fn) =

1 if (σn gt α or DMB(fnminus1 fn) gt β) and T gt γ

0 otherwise(3)

σn =

radicradicradicradic 1

K minus 1

nminus1sumi=nminusK+1

(DMB(fi fi+1)minusmn)2 (4)

mn =1

K

nminus1sumi=nminusK+1

DMB(fi fi+1) (5)

where mn is the average M-Diff value of the previous Knumber of frames σn is the standard deviation with thewindow size K representing the degree of M-Diff scatteringof the previous K number of frames and EST (fn) is athreshold that determines whether to split When the value ofEST (fn) is one EVSO decides to split at fn In EST (fn)DMB(fnminus1 fn) is used to detect clear scene changes FinallyT is the number of frames in the current video chunk and γis the frame rate of the current video The values of σn and

1000 2000 3000 4000

Factor

0

50

100

150

Pro

cess

ing

Tim

e (s

)

(a) Constant factor α

05 1 15 2

Factor 104

0

50

100

Pro

cess

ing

Tim

e (s

)

(b) Constant factor β

Figure 6 Variation of the processing time according to the constant factors

DMB(fnminus1 fn) tend to be large around the high-motion partof the video which can cause the video to split too finelyTherefore the inequality N gt γ representing the minimumcondition is applied so that each video chunk is at least onesecond longDetermining the α β and K factors The smaller the valueof the constant factors α and β the less frames are assignedto the video chunk so each video chunk can have a moresimilar level of variability In the extreme setting α and β to0 can be effective for the user experience however if α andβ are too low the number of separated video chunks and theprocessing time of V-Processor increase exponentially Thatis α and β should be appropriately selected in considerationof the tradeoff between the computational complexity and theaccuracy of the frame rate estimation process

Figure 6(a) shows the processing time when α is set to1000 2000 3000 and 4000 with nine experimental videos ona server equipped with 220 GHztimes40 processors and 135 GBof memory As α increases the total number of video chunksto be separated decreases reducing the overall processing timeOur findings indicate that in most videos the processing timedecreases drastically when α is set to 3000 after which itremains nearly the same Likewise Figure 6(b) shows thatthe processing time decreases as β increases Moreover inmost of the videos the processing time decreases sharplyuntil β reaches 15000 and then remains almost the sameConsequently α and β are set to 3000 and 15000 respectivelyto allow efficient processing

If the window size K is set too high rapid motion changescannot be detected as separation criteria K values of 1 510 and 30 were tested to determine the appropriate value fordetecting motion changes in our experimental environmentBased on numerous experiments K is set to 10 The optimalvalue of the window size K depends on the characteristic ofthe video being processed

C Estimating Frame Rates for Video Chunks

After splitting the video into multiple video chunks based onthe discussion presented in Section III-B an appropriate framerate needs to be determined for each video chunk Setting theproper frame rate to match the variation of each video chunkis a crucial step in terms of the user experienceEstimating proper frame rates for video chunks Based onthe analysis of the relationship between the M-Diff values andframe rates we define the Estimated Proper Frame rate (EPF)

EMPD

Period 2

Adaptation Set 1 Representation 4

Period 1duration = ldquoPT30Srdquohellip

Period 2duration = ldquoPT5Mrdquohellip

Period 3duration = ldquoPT7Mrdquohellip

hellip

Adaptation Set 0

Adaptation Set 1

Adaptation Set 2

Representation 11080p HighmimeType = ldquovideomp4rdquohellip

Representation 4720p HighmimeType = ldquovideomp4rdquohellip

Representation 21080p MediummimeType = ldquovideomp4rdquohellip

Representation 5720p MediummimeType = ldquovideomp4rdquohellip

Representation 31080p LowmimeType = ldquovideomp4rdquohellip

Representation 6720p LowmimeType = ldquovideomp4rdquohellip

Initialization segment (URL)MIT_course_1280x720_5000k_high_dashinitmp4

Representation 6

Media Segment 1 (URL)

hellip

hellip

Initialization segment (URL)MIT_course_1280x720_5000k_low_dashinitmp4

hellip

hellip

hellip

Media Segment 2 (URL)

Media Segment 3 (URL)

Media Segment 1 (URL)

Media Segment 2 (URL)

Media Segment 3 (URL) hellip

Figure 7 The overall hierarchical structure of EMPD

as follows

EPF (fn fn+1) =

s1 times γ if DMB(fn fn+1) lt τ1s2 times γ if τ1 le DMB(fn fn+1) lt τ2s3 times γ if τ2 le DMB(fn fn+1) lt τ3s4 times γ if τ3 le DMB(fn fn+1) lt τ4s5 times γ if τ4 le DMB(fn fn+1)

(6)

where s1 s2 s3 s4 and s5 are the scaling factors of the framerate γ that match the degree of change between fn and fn+1and τ1 τ2 τ3 and τ4 are the scaling thresholds that distinguishM-Diff values based on similar variations EPF (fn fn+1) isutilized to calculate the appropriate frame rate for adjacentframes and then we define the Estimated Video chunk Framerate (EVF) to obtain the appropriate frame rate for each videochunk as follows

EV F (Sk) =

mminus1sumi=l

EPF (fi fi+1)

mminus l+ δ times σSk

(7)

where Sk represents the k-th video chunk ie(fl fl+1 fl+2 fm) which is needed to estimate theappropriate frame rate and σSk

is the standard deviation ofadjacent frames in Sk based on Equation (4) which is used todetermine whether the degree of M-Diff variation in the videochunk is constant or anomalous If σSk

is high it means thatfast and slow moving parts coexist in a single video chunk Inthis case the overall frame rate can be adjusted to take intoaccount the highly variable parts for a better user experienceOur experiments were performed with δ equal to 00001 butit can be tailored to the specific environment in which EVSOis deployedDetermining τ and s factors To determine the τ factors weanalyzed the relationship between M-Diff and the commonlyused SSIM using a linear regression based on Figure 5 Theresulting equation is as follows

SSIM(fn fn+1) = 10063minus 15903times 10minus5 timesDMB(fn fn+1) (8)

The statistical values of R2 (R-squared) and Pearson Cor-relation Coefficient (PCC) were calculated to measure thegoodness of fit The R2 and PCC values of this regressionmodel are 0613 and -07834 respectively indicating that thecorrelation between SSIM and M-Diff is sufficiently high

If the SSIM index between two frames is above 09 thepeak signal-to-noise ratio is above 50 dB and the mean opinionscore which represents the quality of experience on a scaleof 1 to 5 corresponds to 4 (Good) or 5 (Excellent Identical)[12] Based on this we subdivided the SSIM index into 099098 095 and 091 According to the regression model inEquation (8) the M-Diff values corresponding to these SSIMindices are 500 1500 3000 and 6000 and these valuesrepresent τ1 τ2 τ3 and τ4 in Equation (6) respectively

The s factors are set differently depending on the batterylevel Detailed settings of the s factors are described inSection V

D Extending Media Presentation Description (MPD) forUser Interaction

Dynamic Adaptive Streaming over HTTP (DASH) [13]also known as MPEG-DASH is an adaptive bitrate streamingtechnology that allows a multimedia file to be partitioned intoseveral segments and transmitted to the client over HTTP Inother words when a video is uploaded the streaming servermust generate several processed videos by adjusting the bitrateand resolution to provide DASH functionality

Media Presentation Description (MPD) refers to a manifestthat provides segment information such as resolutions bitratesand URLs of the video contents Therefore the DASH clientchooses the segment with the highest possible resolution andbitrate that can be supported by the network bandwidth andthen fetches the corresponding segment through MPD EVSOextends MPD to allow the client to select segments consideringthe battery status as well as the network bandwidth

Figure 7 shows the hierarchical structure of Extended MPD(EMPD) which consists of one or more periods that describethe content duration Multiple periods can be used when it isnecessary to classify multiple videos into chapter-by-chapteror separate advertisements and contents Each period consistsof one or more adaptation sets which include media streamsThe period typically consists of separate adaptation sets ofaudio and video for efficient bandwidth management Theadaptation set of the video consists of several representationsthat contain specific information about alternative contentssuch as the resolution and MIME type (or content type)EMPD extended this adaptation set to include an EVSOLevelattribute allowing the client to select the representation thatis appropriate for their current battery situation Finally the

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 3: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

Transcoding

Prediction

Transform

Encode

H264MPEG-4 AVC Encoder

F - S c h e d u l e rlt Sec 31 - 33 gt

V - P r o c e s s o rlt Sec 4 gt

E M P Dlt Sec 34 gt

Macroblockrsquos Y(luminance) values

Decoder

E V S O

Copy

Scheduling Results

Input Video

User

High Medium Low

1080p

720p

480p

MPD

Figure 3 Flow of EVSO when a video is uploaded to a streaming server

multiple video chunks with similar motion intensity levels(see Section III-B) and schedules the appropriate frame ratesfor each video chunk according to three battery levels HighMedium and Low (see Section III-C) Although our studyadopts three battery levels other settings can also be appliedThe scheduling results generated from F-Scheduler specifywhich frame rate is appropriate for each part of the video Af-terwards V-Processor processes the original video accordingto what is specified in the scheduling results to produce videossuitable for the three battery levels Detailed implementationof V-Processor is described in Section IV Finally EMPDallows users to request appropriate video chunks taking intoconsideration not only the network condition but also thebattery status (see Section III-D)

Note that this paper focuses on H264AVC as the videocodec however we emphasize that the proposed EVSO sys-tem is not limited solely to H264AVC Since other codecsalso perform video compression using the luminance (Y)values of the macroblock in the similar manner as H264AVCEVSO can easily be applied to other codecs

A Calculating the Perceptual Similarity Score

The human vision system is more sensitive to changes inthe brightness (Y) than it is to changes in colors (Cb and Cr)[9] [14] Hwang et al [5] have shown that the Y-Difference(or Y-Diff) reflects the changes in visual perception fairly wellMoreover they found Y-Diff to be highly correlated with thecommonly used SSIM method through various experimentsand a user study However human vision is sensitive notonly to brightness but also to object movement [15] TheY-Diff value between two images is calculated as the Sumof Absolute Differences (SAD) of the luminance values ona per-frame basis not on a per-block basis Therefore thelocal information can be ignored because the detailed featuresof each block are mixed In other words Y-Diff effectivelyreflects the brightness but can be less accurate to the motionperception of objects Although Y-Diff can partially reflectsome perception of object movement [14] but more attentionneeds to be paid to motion information in order to be morecompatible with human perception

To effectively calculate the perceptual similarity scores byconsidering both the brightness and object movement in videostreaming environments we define the Macroblock-Difference

0 05 1 15 2 25 3 35

Y-Diff (fN

fN+1

) 108

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 4 Correlation between Y-Diff and SSIM (Pearsonrsquos correlation coef-ficient -07329)

0 05 1 15 2 25 3 35

M-Diff (fN

fN+1

) 104

0

02

04

06

08

1

SS

IM (

f N f

N+1

)

Figure 5 Correlation between M-Diff and SSIM (Pearsonrsquos correlationcoefficient -07834)

(or M-Diff) DMB(fa fb) between frames fa and fb asfollows

DMB(fa fb) =

Nminus1sumi=0

Mminus1sumj=0

DY (fa(i j) fb(i j)) (1)

DY (fa(i j) fb(i j)) =

1 SADY (fa(i j) fb(i j)) gt θ

0 SADY (fa(i j) fb(i j)) le θ(2)

where fk(i j) is the k-th frame of N times M macroblockswith i and j representing the macroblock coordinatesSADY (fa(i j) fb(i j)) is the SAD value between the lumi-nance value of fa(i j) and fb(i j) and DY (fa(i j) fb(i j))is set to one when the SAD value between the luminance valueof fa(i j) and fb(i j) exceeds a certain threshold θ The valueof θ is set to 320 because OpenH264 [16] determines highmotion blocks when the SAD value of the luminance exceeds320 when detecting a scene change Finding the optimal valueof θ is left as future work

EVSO used M-Diff as a perceptual similarity measurementmethod to cover both brightness and object movement M-Diff uses the luminance values of macroblocks to performbrightness perception and handles object movement throughblock-based calculation The block-based approach performsthe perceptual similarity calculation based on the macroblockunit rather than on the frame unit Figure 4 shows how Y-Diff is similar to SSIM in the analysis on adjacent frames ofseven videos [8] and Figure 5 shows how M-Diff is similar toSSIM in the same videos These figures indicate that M-Diffis highly correlated with SSIM which is approximately 5higher than Y-Diff In other words M-Diff has an accuracycomparable to or better than Y-Diff in perceptual similaritycalculationHow about using SSIM as a method for the perceptualsimilarity calculation More than 72 hours of new videocontent is uploaded to YouTube every minute and the volumeof uploaded videos continues to grow [17] For such a largenumber of videos performing SSIM calculations for consec-utive frames is exorbitant in terms of computational time andpower On a desktop PC equipped with a 35 GHz processor

and 12 GB of memory it took about 401 milliseconds tocalculate the perceptual similarity between two frames with1920times1080 resolution using SSIM If SSIM is used to processa video with 60 frames per second for a duration of 30 minutesthe time overhead is approximately 12 hours On the otherhand there is little or no overhead incurred in the M-Diffcalculation because it uses the luminance differences betweenmacroblocks generated by the H264AVC encoder when avideo is uploaded

B Splitting Video into Multiple Video Chunks

In order to apply multiple frame rates to a single video howto properly split the video into multiple chunks needs to beconsidered If a video is split too finely an appropriate framerate can be set to suit the characteristics of each video chunkbut it takes a very long time to process the video In contrastif a video is split too coarsely it is difficult to calculate theappropriate frame rate because the characteristic of each videochunk becomes ambiguous which can adversely impact theuser experienceEstimating the split threshold A simple approach to splittingthe video is to use the M-Diff value in the perceptual similarityscore as a separation criterion For example if more than 80of the corresponding macroblocks in two frames exceed thethreshold θ then there is a large enough difference between thetwo frames and thus they can be separated Another approachis to use the local statistical properties of the frame sequence todefine a dynamic threshold model These local properties caninclude mean and standard deviation to determine the degreeof change in the frame sequence However the above twoapproaches are used mainly to detect scene changes [16] Onthe other hand our approach is to design separation criteriathat allow each video chunk to have a similar degree ofvariability

We newly define the Estimated Split Threshold (EST) toseparate the video into multiple chunks with similar variabilitylevels as follows

EST (fn) =

1 if (σn gt α or DMB(fnminus1 fn) gt β) and T gt γ

0 otherwise(3)

σn =

radicradicradicradic 1

K minus 1

nminus1sumi=nminusK+1

(DMB(fi fi+1)minusmn)2 (4)

mn =1

K

nminus1sumi=nminusK+1

DMB(fi fi+1) (5)

where mn is the average M-Diff value of the previous Knumber of frames σn is the standard deviation with thewindow size K representing the degree of M-Diff scatteringof the previous K number of frames and EST (fn) is athreshold that determines whether to split When the value ofEST (fn) is one EVSO decides to split at fn In EST (fn)DMB(fnminus1 fn) is used to detect clear scene changes FinallyT is the number of frames in the current video chunk and γis the frame rate of the current video The values of σn and

1000 2000 3000 4000

Factor

0

50

100

150

Pro

cess

ing

Tim

e (s

)

(a) Constant factor α

05 1 15 2

Factor 104

0

50

100

Pro

cess

ing

Tim

e (s

)

(b) Constant factor β

Figure 6 Variation of the processing time according to the constant factors

DMB(fnminus1 fn) tend to be large around the high-motion partof the video which can cause the video to split too finelyTherefore the inequality N gt γ representing the minimumcondition is applied so that each video chunk is at least onesecond longDetermining the α β and K factors The smaller the valueof the constant factors α and β the less frames are assignedto the video chunk so each video chunk can have a moresimilar level of variability In the extreme setting α and β to0 can be effective for the user experience however if α andβ are too low the number of separated video chunks and theprocessing time of V-Processor increase exponentially Thatis α and β should be appropriately selected in considerationof the tradeoff between the computational complexity and theaccuracy of the frame rate estimation process

Figure 6(a) shows the processing time when α is set to1000 2000 3000 and 4000 with nine experimental videos ona server equipped with 220 GHztimes40 processors and 135 GBof memory As α increases the total number of video chunksto be separated decreases reducing the overall processing timeOur findings indicate that in most videos the processing timedecreases drastically when α is set to 3000 after which itremains nearly the same Likewise Figure 6(b) shows thatthe processing time decreases as β increases Moreover inmost of the videos the processing time decreases sharplyuntil β reaches 15000 and then remains almost the sameConsequently α and β are set to 3000 and 15000 respectivelyto allow efficient processing

If the window size K is set too high rapid motion changescannot be detected as separation criteria K values of 1 510 and 30 were tested to determine the appropriate value fordetecting motion changes in our experimental environmentBased on numerous experiments K is set to 10 The optimalvalue of the window size K depends on the characteristic ofthe video being processed

C Estimating Frame Rates for Video Chunks

After splitting the video into multiple video chunks based onthe discussion presented in Section III-B an appropriate framerate needs to be determined for each video chunk Setting theproper frame rate to match the variation of each video chunkis a crucial step in terms of the user experienceEstimating proper frame rates for video chunks Based onthe analysis of the relationship between the M-Diff values andframe rates we define the Estimated Proper Frame rate (EPF)

EMPD

Period 2

Adaptation Set 1 Representation 4

Period 1duration = ldquoPT30Srdquohellip

Period 2duration = ldquoPT5Mrdquohellip

Period 3duration = ldquoPT7Mrdquohellip

hellip

Adaptation Set 0

Adaptation Set 1

Adaptation Set 2

Representation 11080p HighmimeType = ldquovideomp4rdquohellip

Representation 4720p HighmimeType = ldquovideomp4rdquohellip

Representation 21080p MediummimeType = ldquovideomp4rdquohellip

Representation 5720p MediummimeType = ldquovideomp4rdquohellip

Representation 31080p LowmimeType = ldquovideomp4rdquohellip

Representation 6720p LowmimeType = ldquovideomp4rdquohellip

Initialization segment (URL)MIT_course_1280x720_5000k_high_dashinitmp4

Representation 6

Media Segment 1 (URL)

hellip

hellip

Initialization segment (URL)MIT_course_1280x720_5000k_low_dashinitmp4

hellip

hellip

hellip

Media Segment 2 (URL)

Media Segment 3 (URL)

Media Segment 1 (URL)

Media Segment 2 (URL)

Media Segment 3 (URL) hellip

Figure 7 The overall hierarchical structure of EMPD

as follows

EPF (fn fn+1) =

s1 times γ if DMB(fn fn+1) lt τ1s2 times γ if τ1 le DMB(fn fn+1) lt τ2s3 times γ if τ2 le DMB(fn fn+1) lt τ3s4 times γ if τ3 le DMB(fn fn+1) lt τ4s5 times γ if τ4 le DMB(fn fn+1)

(6)

where s1 s2 s3 s4 and s5 are the scaling factors of the framerate γ that match the degree of change between fn and fn+1and τ1 τ2 τ3 and τ4 are the scaling thresholds that distinguishM-Diff values based on similar variations EPF (fn fn+1) isutilized to calculate the appropriate frame rate for adjacentframes and then we define the Estimated Video chunk Framerate (EVF) to obtain the appropriate frame rate for each videochunk as follows

EV F (Sk) =

mminus1sumi=l

EPF (fi fi+1)

mminus l+ δ times σSk

(7)

where Sk represents the k-th video chunk ie(fl fl+1 fl+2 fm) which is needed to estimate theappropriate frame rate and σSk

is the standard deviation ofadjacent frames in Sk based on Equation (4) which is used todetermine whether the degree of M-Diff variation in the videochunk is constant or anomalous If σSk

is high it means thatfast and slow moving parts coexist in a single video chunk Inthis case the overall frame rate can be adjusted to take intoaccount the highly variable parts for a better user experienceOur experiments were performed with δ equal to 00001 butit can be tailored to the specific environment in which EVSOis deployedDetermining τ and s factors To determine the τ factors weanalyzed the relationship between M-Diff and the commonlyused SSIM using a linear regression based on Figure 5 Theresulting equation is as follows

SSIM(fn fn+1) = 10063minus 15903times 10minus5 timesDMB(fn fn+1) (8)

The statistical values of R2 (R-squared) and Pearson Cor-relation Coefficient (PCC) were calculated to measure thegoodness of fit The R2 and PCC values of this regressionmodel are 0613 and -07834 respectively indicating that thecorrelation between SSIM and M-Diff is sufficiently high

If the SSIM index between two frames is above 09 thepeak signal-to-noise ratio is above 50 dB and the mean opinionscore which represents the quality of experience on a scaleof 1 to 5 corresponds to 4 (Good) or 5 (Excellent Identical)[12] Based on this we subdivided the SSIM index into 099098 095 and 091 According to the regression model inEquation (8) the M-Diff values corresponding to these SSIMindices are 500 1500 3000 and 6000 and these valuesrepresent τ1 τ2 τ3 and τ4 in Equation (6) respectively

The s factors are set differently depending on the batterylevel Detailed settings of the s factors are described inSection V

D Extending Media Presentation Description (MPD) forUser Interaction

Dynamic Adaptive Streaming over HTTP (DASH) [13]also known as MPEG-DASH is an adaptive bitrate streamingtechnology that allows a multimedia file to be partitioned intoseveral segments and transmitted to the client over HTTP Inother words when a video is uploaded the streaming servermust generate several processed videos by adjusting the bitrateand resolution to provide DASH functionality

Media Presentation Description (MPD) refers to a manifestthat provides segment information such as resolutions bitratesand URLs of the video contents Therefore the DASH clientchooses the segment with the highest possible resolution andbitrate that can be supported by the network bandwidth andthen fetches the corresponding segment through MPD EVSOextends MPD to allow the client to select segments consideringthe battery status as well as the network bandwidth

Figure 7 shows the hierarchical structure of Extended MPD(EMPD) which consists of one or more periods that describethe content duration Multiple periods can be used when it isnecessary to classify multiple videos into chapter-by-chapteror separate advertisements and contents Each period consistsof one or more adaptation sets which include media streamsThe period typically consists of separate adaptation sets ofaudio and video for efficient bandwidth management Theadaptation set of the video consists of several representationsthat contain specific information about alternative contentssuch as the resolution and MIME type (or content type)EMPD extended this adaptation set to include an EVSOLevelattribute allowing the client to select the representation thatis appropriate for their current battery situation Finally the

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 4: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

and 12 GB of memory it took about 401 milliseconds tocalculate the perceptual similarity between two frames with1920times1080 resolution using SSIM If SSIM is used to processa video with 60 frames per second for a duration of 30 minutesthe time overhead is approximately 12 hours On the otherhand there is little or no overhead incurred in the M-Diffcalculation because it uses the luminance differences betweenmacroblocks generated by the H264AVC encoder when avideo is uploaded

B Splitting Video into Multiple Video Chunks

In order to apply multiple frame rates to a single video howto properly split the video into multiple chunks needs to beconsidered If a video is split too finely an appropriate framerate can be set to suit the characteristics of each video chunkbut it takes a very long time to process the video In contrastif a video is split too coarsely it is difficult to calculate theappropriate frame rate because the characteristic of each videochunk becomes ambiguous which can adversely impact theuser experienceEstimating the split threshold A simple approach to splittingthe video is to use the M-Diff value in the perceptual similarityscore as a separation criterion For example if more than 80of the corresponding macroblocks in two frames exceed thethreshold θ then there is a large enough difference between thetwo frames and thus they can be separated Another approachis to use the local statistical properties of the frame sequence todefine a dynamic threshold model These local properties caninclude mean and standard deviation to determine the degreeof change in the frame sequence However the above twoapproaches are used mainly to detect scene changes [16] Onthe other hand our approach is to design separation criteriathat allow each video chunk to have a similar degree ofvariability

We newly define the Estimated Split Threshold (EST) toseparate the video into multiple chunks with similar variabilitylevels as follows

EST (fn) =

1 if (σn gt α or DMB(fnminus1 fn) gt β) and T gt γ

0 otherwise(3)

σn =

radicradicradicradic 1

K minus 1

nminus1sumi=nminusK+1

(DMB(fi fi+1)minusmn)2 (4)

mn =1

K

nminus1sumi=nminusK+1

DMB(fi fi+1) (5)

where mn is the average M-Diff value of the previous Knumber of frames σn is the standard deviation with thewindow size K representing the degree of M-Diff scatteringof the previous K number of frames and EST (fn) is athreshold that determines whether to split When the value ofEST (fn) is one EVSO decides to split at fn In EST (fn)DMB(fnminus1 fn) is used to detect clear scene changes FinallyT is the number of frames in the current video chunk and γis the frame rate of the current video The values of σn and

1000 2000 3000 4000

Factor

0

50

100

150

Pro

cess

ing

Tim

e (s

)

(a) Constant factor α

05 1 15 2

Factor 104

0

50

100

Pro

cess

ing

Tim

e (s

)

(b) Constant factor β

Figure 6 Variation of the processing time according to the constant factors

DMB(fnminus1 fn) tend to be large around the high-motion partof the video which can cause the video to split too finelyTherefore the inequality N gt γ representing the minimumcondition is applied so that each video chunk is at least onesecond longDetermining the α β and K factors The smaller the valueof the constant factors α and β the less frames are assignedto the video chunk so each video chunk can have a moresimilar level of variability In the extreme setting α and β to0 can be effective for the user experience however if α andβ are too low the number of separated video chunks and theprocessing time of V-Processor increase exponentially Thatis α and β should be appropriately selected in considerationof the tradeoff between the computational complexity and theaccuracy of the frame rate estimation process

Figure 6(a) shows the processing time when α is set to1000 2000 3000 and 4000 with nine experimental videos ona server equipped with 220 GHztimes40 processors and 135 GBof memory As α increases the total number of video chunksto be separated decreases reducing the overall processing timeOur findings indicate that in most videos the processing timedecreases drastically when α is set to 3000 after which itremains nearly the same Likewise Figure 6(b) shows thatthe processing time decreases as β increases Moreover inmost of the videos the processing time decreases sharplyuntil β reaches 15000 and then remains almost the sameConsequently α and β are set to 3000 and 15000 respectivelyto allow efficient processing

If the window size K is set too high rapid motion changescannot be detected as separation criteria K values of 1 510 and 30 were tested to determine the appropriate value fordetecting motion changes in our experimental environmentBased on numerous experiments K is set to 10 The optimalvalue of the window size K depends on the characteristic ofthe video being processed

C Estimating Frame Rates for Video Chunks

After splitting the video into multiple video chunks based onthe discussion presented in Section III-B an appropriate framerate needs to be determined for each video chunk Setting theproper frame rate to match the variation of each video chunkis a crucial step in terms of the user experienceEstimating proper frame rates for video chunks Based onthe analysis of the relationship between the M-Diff values andframe rates we define the Estimated Proper Frame rate (EPF)

EMPD

Period 2

Adaptation Set 1 Representation 4

Period 1duration = ldquoPT30Srdquohellip

Period 2duration = ldquoPT5Mrdquohellip

Period 3duration = ldquoPT7Mrdquohellip

hellip

Adaptation Set 0

Adaptation Set 1

Adaptation Set 2

Representation 11080p HighmimeType = ldquovideomp4rdquohellip

Representation 4720p HighmimeType = ldquovideomp4rdquohellip

Representation 21080p MediummimeType = ldquovideomp4rdquohellip

Representation 5720p MediummimeType = ldquovideomp4rdquohellip

Representation 31080p LowmimeType = ldquovideomp4rdquohellip

Representation 6720p LowmimeType = ldquovideomp4rdquohellip

Initialization segment (URL)MIT_course_1280x720_5000k_high_dashinitmp4

Representation 6

Media Segment 1 (URL)

hellip

hellip

Initialization segment (URL)MIT_course_1280x720_5000k_low_dashinitmp4

hellip

hellip

hellip

Media Segment 2 (URL)

Media Segment 3 (URL)

Media Segment 1 (URL)

Media Segment 2 (URL)

Media Segment 3 (URL) hellip

Figure 7 The overall hierarchical structure of EMPD

as follows

EPF (fn fn+1) =

s1 times γ if DMB(fn fn+1) lt τ1s2 times γ if τ1 le DMB(fn fn+1) lt τ2s3 times γ if τ2 le DMB(fn fn+1) lt τ3s4 times γ if τ3 le DMB(fn fn+1) lt τ4s5 times γ if τ4 le DMB(fn fn+1)

(6)

where s1 s2 s3 s4 and s5 are the scaling factors of the framerate γ that match the degree of change between fn and fn+1and τ1 τ2 τ3 and τ4 are the scaling thresholds that distinguishM-Diff values based on similar variations EPF (fn fn+1) isutilized to calculate the appropriate frame rate for adjacentframes and then we define the Estimated Video chunk Framerate (EVF) to obtain the appropriate frame rate for each videochunk as follows

EV F (Sk) =

mminus1sumi=l

EPF (fi fi+1)

mminus l+ δ times σSk

(7)

where Sk represents the k-th video chunk ie(fl fl+1 fl+2 fm) which is needed to estimate theappropriate frame rate and σSk

is the standard deviation ofadjacent frames in Sk based on Equation (4) which is used todetermine whether the degree of M-Diff variation in the videochunk is constant or anomalous If σSk

is high it means thatfast and slow moving parts coexist in a single video chunk Inthis case the overall frame rate can be adjusted to take intoaccount the highly variable parts for a better user experienceOur experiments were performed with δ equal to 00001 butit can be tailored to the specific environment in which EVSOis deployedDetermining τ and s factors To determine the τ factors weanalyzed the relationship between M-Diff and the commonlyused SSIM using a linear regression based on Figure 5 Theresulting equation is as follows

SSIM(fn fn+1) = 10063minus 15903times 10minus5 timesDMB(fn fn+1) (8)

The statistical values of R2 (R-squared) and Pearson Cor-relation Coefficient (PCC) were calculated to measure thegoodness of fit The R2 and PCC values of this regressionmodel are 0613 and -07834 respectively indicating that thecorrelation between SSIM and M-Diff is sufficiently high

If the SSIM index between two frames is above 09 thepeak signal-to-noise ratio is above 50 dB and the mean opinionscore which represents the quality of experience on a scaleof 1 to 5 corresponds to 4 (Good) or 5 (Excellent Identical)[12] Based on this we subdivided the SSIM index into 099098 095 and 091 According to the regression model inEquation (8) the M-Diff values corresponding to these SSIMindices are 500 1500 3000 and 6000 and these valuesrepresent τ1 τ2 τ3 and τ4 in Equation (6) respectively

The s factors are set differently depending on the batterylevel Detailed settings of the s factors are described inSection V

D Extending Media Presentation Description (MPD) forUser Interaction

Dynamic Adaptive Streaming over HTTP (DASH) [13]also known as MPEG-DASH is an adaptive bitrate streamingtechnology that allows a multimedia file to be partitioned intoseveral segments and transmitted to the client over HTTP Inother words when a video is uploaded the streaming servermust generate several processed videos by adjusting the bitrateand resolution to provide DASH functionality

Media Presentation Description (MPD) refers to a manifestthat provides segment information such as resolutions bitratesand URLs of the video contents Therefore the DASH clientchooses the segment with the highest possible resolution andbitrate that can be supported by the network bandwidth andthen fetches the corresponding segment through MPD EVSOextends MPD to allow the client to select segments consideringthe battery status as well as the network bandwidth

Figure 7 shows the hierarchical structure of Extended MPD(EMPD) which consists of one or more periods that describethe content duration Multiple periods can be used when it isnecessary to classify multiple videos into chapter-by-chapteror separate advertisements and contents Each period consistsof one or more adaptation sets which include media streamsThe period typically consists of separate adaptation sets ofaudio and video for efficient bandwidth management Theadaptation set of the video consists of several representationsthat contain specific information about alternative contentssuch as the resolution and MIME type (or content type)EMPD extended this adaptation set to include an EVSOLevelattribute allowing the client to select the representation thatis appropriate for their current battery situation Finally the

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 5: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

EMPD

Period 2

Adaptation Set 1 Representation 4

Period 1duration = ldquoPT30Srdquohellip

Period 2duration = ldquoPT5Mrdquohellip

Period 3duration = ldquoPT7Mrdquohellip

hellip

Adaptation Set 0

Adaptation Set 1

Adaptation Set 2

Representation 11080p HighmimeType = ldquovideomp4rdquohellip

Representation 4720p HighmimeType = ldquovideomp4rdquohellip

Representation 21080p MediummimeType = ldquovideomp4rdquohellip

Representation 5720p MediummimeType = ldquovideomp4rdquohellip

Representation 31080p LowmimeType = ldquovideomp4rdquohellip

Representation 6720p LowmimeType = ldquovideomp4rdquohellip

Initialization segment (URL)MIT_course_1280x720_5000k_high_dashinitmp4

Representation 6

Media Segment 1 (URL)

hellip

hellip

Initialization segment (URL)MIT_course_1280x720_5000k_low_dashinitmp4

hellip

hellip

hellip

Media Segment 2 (URL)

Media Segment 3 (URL)

Media Segment 1 (URL)

Media Segment 2 (URL)

Media Segment 3 (URL) hellip

Figure 7 The overall hierarchical structure of EMPD

as follows

EPF (fn fn+1) =

s1 times γ if DMB(fn fn+1) lt τ1s2 times γ if τ1 le DMB(fn fn+1) lt τ2s3 times γ if τ2 le DMB(fn fn+1) lt τ3s4 times γ if τ3 le DMB(fn fn+1) lt τ4s5 times γ if τ4 le DMB(fn fn+1)

(6)

where s1 s2 s3 s4 and s5 are the scaling factors of the framerate γ that match the degree of change between fn and fn+1and τ1 τ2 τ3 and τ4 are the scaling thresholds that distinguishM-Diff values based on similar variations EPF (fn fn+1) isutilized to calculate the appropriate frame rate for adjacentframes and then we define the Estimated Video chunk Framerate (EVF) to obtain the appropriate frame rate for each videochunk as follows

EV F (Sk) =

mminus1sumi=l

EPF (fi fi+1)

mminus l+ δ times σSk

(7)

where Sk represents the k-th video chunk ie(fl fl+1 fl+2 fm) which is needed to estimate theappropriate frame rate and σSk

is the standard deviation ofadjacent frames in Sk based on Equation (4) which is used todetermine whether the degree of M-Diff variation in the videochunk is constant or anomalous If σSk

is high it means thatfast and slow moving parts coexist in a single video chunk Inthis case the overall frame rate can be adjusted to take intoaccount the highly variable parts for a better user experienceOur experiments were performed with δ equal to 00001 butit can be tailored to the specific environment in which EVSOis deployedDetermining τ and s factors To determine the τ factors weanalyzed the relationship between M-Diff and the commonlyused SSIM using a linear regression based on Figure 5 Theresulting equation is as follows

SSIM(fn fn+1) = 10063minus 15903times 10minus5 timesDMB(fn fn+1) (8)

The statistical values of R2 (R-squared) and Pearson Cor-relation Coefficient (PCC) were calculated to measure thegoodness of fit The R2 and PCC values of this regressionmodel are 0613 and -07834 respectively indicating that thecorrelation between SSIM and M-Diff is sufficiently high

If the SSIM index between two frames is above 09 thepeak signal-to-noise ratio is above 50 dB and the mean opinionscore which represents the quality of experience on a scaleof 1 to 5 corresponds to 4 (Good) or 5 (Excellent Identical)[12] Based on this we subdivided the SSIM index into 099098 095 and 091 According to the regression model inEquation (8) the M-Diff values corresponding to these SSIMindices are 500 1500 3000 and 6000 and these valuesrepresent τ1 τ2 τ3 and τ4 in Equation (6) respectively

The s factors are set differently depending on the batterylevel Detailed settings of the s factors are described inSection V

D Extending Media Presentation Description (MPD) forUser Interaction

Dynamic Adaptive Streaming over HTTP (DASH) [13]also known as MPEG-DASH is an adaptive bitrate streamingtechnology that allows a multimedia file to be partitioned intoseveral segments and transmitted to the client over HTTP Inother words when a video is uploaded the streaming servermust generate several processed videos by adjusting the bitrateand resolution to provide DASH functionality

Media Presentation Description (MPD) refers to a manifestthat provides segment information such as resolutions bitratesand URLs of the video contents Therefore the DASH clientchooses the segment with the highest possible resolution andbitrate that can be supported by the network bandwidth andthen fetches the corresponding segment through MPD EVSOextends MPD to allow the client to select segments consideringthe battery status as well as the network bandwidth

Figure 7 shows the hierarchical structure of Extended MPD(EMPD) which consists of one or more periods that describethe content duration Multiple periods can be used when it isnecessary to classify multiple videos into chapter-by-chapteror separate advertisements and contents Each period consistsof one or more adaptation sets which include media streamsThe period typically consists of separate adaptation sets ofaudio and video for efficient bandwidth management Theadaptation set of the video consists of several representationsthat contain specific information about alternative contentssuch as the resolution and MIME type (or content type)EMPD extended this adaptation set to include an EVSOLevelattribute allowing the client to select the representation thatis appropriate for their current battery situation Finally the

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 6: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

Video Name Length(min)

Resolution(pixel)

Bitrate(Kbps)

Frame rate(fps) Type

NBC News Conference 229 1920x1080 2130 2998

AMIT Course 332 1920x1080 1191 2997

Golf mdash 2017 Back ofHope Founders Cup 106 1920x1080 2256 2997

PyeongChang Olympic 124 1920x1080 2908 3000

BConan Show 31 1920x1080 3755 2998

Kershaw Baseball 115 1280x720 1946 2997

Pororo Animation 1531 1920x1080 1085 2997

CTennis mdash AustralianOpen 2018 1004 1280x720 3075 5994

National Geographic 1353 1280x720 3115 5994

Table I The characteristics of the videos used for the experiments The boldword in the video name is used to represent that video in the experiments

representation provides the URL so that the client can fetchthe media segments and play them back

IV IMPLEMENTATION

In order to study the effectiveness of EVSO we imple-mented a video streaming server with DASH functionalityusing IIS [10] The video streaming server sends EMPD tothe client to allow the users to select the appropriate videoWe also implemented a video streaming client by modifyingExoPlayer [18] which is an open-source media player forAndroid to interpret the EMPD manifest information takinginto account the battery status and network bandwidthFrame rate Scheduler (F-Scheduler) The source code forOpenH264 [16] was modified to determine which frame rateis appropriate for each video chunk The SAD values ofthe macroblocks are extracted when the H264AVC encoderperforms the video compression process These extracted SADvalues of the macroblocks are utilized to analyze the motionvariation of the video and to set the appropriate frame ratesaccordinglyVideo Processor (V-Processor) After the streaming serverprocesses the uploaded video to generate videos with variousresolutions and bitrates for the DASH service V-Processoradditionally processes the videos according to the scheduleprovided by F-Scheduler This is achieved using the seekingand concatenate methods of FFmpeg [19] to generate videoswith adaptive frame rates The seeking method is used to splitthe video into multiple video chunks and the frame rate ofeach video chunk is adjusted to match the schedule Finallythe video chunks that have different frame rates are combinedinto a single video through the concatenate method

Note that V-Processor is integrated into the H264AVCencoder for performance reasons The implementation locationof V-Processor can be changed depending on the deploymentsituation

V EVALUATION

We evaluated the performance of the EVSO system bydetermining 1) how much the frame rate can be reduced 2)how well the quality of the processed videos is maintained3) how much total energy can be saved 4) how much do the

Video EVSO EVSO+ EVSO++

NBC 2482 2244 1918MIT 1935 1651 1427Golf 2347 2141 1886

PyeongChang 2764 2678 2406Conan 2595 2446 2184Kershaw 2636 2444 2164

Pororo 2551 2375 2083Tennis 5094 4782 4217Geographic 5338 4944 4263

Table II The average frame rate of the videos processed through EVSO

processed videos affect the user experience and 5) how muchoverhead is caused by EVSOVideos used for experiments EVSO was evaluated basedon nine videos of various categories such as sports talkshows lectures etc (available at [8]) As shown in Table Ithese videos have various frame rates and resolutions in orderto represent an environment similar to an actual streamingservice In addition the videos are grouped into three typesaccording to the degree of motion intensity (A) Static (B)Dynamic and (C) Hybrid groups The static group consists ofvideos that display almost identical scenes such as lecturesThe dynamic group consists of videos with a high level ofmotion intensity such as sports The hybrid group consistsof videos that have both attributes of the two aforementionedgroups Unless otherwise noted we conducted experiments onthe videos processed to play for up to three minutesFour different settings used for experiments (EVSOEVSO+ EVSO++ 23 FPS) We configured EPF with threedifferent settings according to the userrsquos battery status ieEVSO EVSO+ and EVSO++ The worse the battery condi-tion the more aggressive EVSO is used (denoted by more +signs) allowing for a longer battery lifetime For this reasonthrough multiple experiments the values of s1 s2 s3 s4 ands5 in Equation (6) are set to 06 083 09 093 and 1 forEVSO 05 073 083 09 and 1 for EVSO+ 043 06 0708 and 093 for EVSO++ respectively As a comparison wealso created an experimental group called 23 FPS that naivelyreduces the frame rate of the original video to two-thirdsVideos processed through EVSO Table II shows the averageframe rates of the videos processed according to battery levelsSince EVSO adaptively reduces the frame rate in considerationof the characteristics of the video the resulting frame rate isquite different for each video For example the MIT video haslittle change in motion and is therefore reduced by more thanone-third of the original frame rate on EVSO On the otherhand the Geographic video has a relatively fast-changingmotion so EVSO reduces the frame rate of the video to onlyabout five-sixths of the original frame rate

A Video Quality Assessment

We used three objective quality metrics to measure howmuch EVSO affected the quality of the videos SSIM [11]Video Quality Metric (VQM) [20] and Video MultimethodAssessment Fusion (VMAF) [21] SSIM is an appropriatemetric for calculating similarity based on images however

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 7: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

Video EVSO EVSO+ EVSO++ 23 FPS

SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF SSIM VQM VMAF

NBC 9953 0730 9961 9941 0814 9933 9906 0998 9892 9852 1137 9842MIT 9968 0235 9930 9963 0256 9905 9956 0289 9869 9952 0318 9865Golf 9864 0515 9866 9850 0563 9826 9803 0711 9709 9695 0925 9491

PyeongChang 9929 0366 9927 9912 0431 9877 9787 0847 9563 9420 1527 8557Conan 9904 0818 9877 9876 0942 9776 9820 1106 9646 9666 1513 9282Kershaw 9854 0707 9881 9818 0839 9820 9660 1324 9572 9357 1866 9101

Pororo 9892 0630 9727 9850 0780 9607 9762 1068 9333 9594 1439 9047Tennis 9890 0806 9902 9876 0892 9862 9842 1051 9729 9793 1249 9532Geographic 9912 0860 9934 9894 0962 9880 9873 1095 9799 9845 1199 9716

Table III Video quality scores according to the metrics of SSIM VQM and VMAF SSIM and VMAF are expressed in percentages

NBC MIT Golf PyeongChang Conan Kershaw Pororo Tennis Geographic0

10

20

30

En

erg

y S

avin

g (

)

EVSO EVSO++ 23 FPS

Figure 8 The percentage of energy savings while watching each video by streaming

because our experiments focused on videos we also usedVQM and VMAF These metrics are more suited to measuringthe subjective quality of the video than SSIM A low value ofVQM indicates a high quality video On the other hand theVMAF index has a range from 0 to 1 and a high value ofVMAF indicates a high quality video

Table III shows the quality of the videos processed throughEVSO compared to 23 FPS As can be seen EVSO providesbetter overall video quality than 23 FPS in all video cases Forthe static group both 23 FPS and EVSO maintain high qualitybecause the motion intensity in the videos is comparativelylow The most prominent gap between EVSO and 23 FPSoccurs in the dynamic group where 23 FPS causes the VMAFmetric to be less than 90 indicating a severe degradationof the user experience In addition in the Kershaw videothe average frame rates of EVSO++ and 23 FPS are similarat 2166 and 20 respectively whereas the VMAF metric ofEVSO++ is measured to be 5 higher than 23 FPS This isbecause EVSO adjusts the frame rate by analyzing each partof the video so that the video quality can be kept much higherthan 23 FPS which naively reduces the frame rate

B Energy Saving

We used a Monsoon Power Monitor [22] to measure theenergy consumption of processed videos on a LG Nexus 5The brightness of the smartphone was configured to 30and the airplane mode was activated to reduce the effect fromexternal variables and Wi-Fi turned on to stream videos fromthe streaming server To prevent thermal throttling caused byoverheating the temperature of the smartphone was cooledbefore each experiment

We evaluated the energy consumption of nine videos infour different settings Baseline EVSO EVSO++ and 23FPS Baseline is an original video and serves as a control

group EVSO EVSO++ and 23 FPS use the same settingsas described above For accurate experiments the energyconsumption of each video was measured five times and thenaveraged

Figure 8 shows the ratio of energy savings when comparingBaseline to EVSO EVSO++ and 23 FPS Because EVSOadjusts the frame rate according to the motion intensity theamount of energy saved varies considerably depending on thevideo characteristics For example the energy used by theMIT video in EVSO was reduced by about 27 but for theTennis video the reduction was only about 11 Howeverif the current battery status is bad EVSO++ can be usedto reduce the energy requirement similarly to the 23 FPSgroup except for the PyeongChang video with a reductionon average of 22 compared to Baseline

C User Study

To analyze how the videos processed through EVSO affectthe user experience we recruited 15 participants (10 males) forthe user study We adopted the Double Stimulus ImpairmentScale (DSIS) and the Double Stimulus Continuous QualityScale (DSCQS) which are widely used for subjective qualityassessments of systems [5] [7] For the sake of experimenta-tion participants were allowed to watch one minute per videoDSIS DSIS allows the participants to sequentially view twostimuli ie the source video and the processed video toassess the degree of impairment to the processed video Theevaluation index of DSIS is a five-point impairment scaleof two stimuli imperceptible (5 points) perceptible but notannoying (4 points) slightly annoying (3 points) annoying (2points) and very annoying (1 points) In DSIS the participantsare informed about which videos were original or processed

We measured how much impairment occurred on the videosprocessed through EVSO The participants were asked to

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 8: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

NBC PyeongChang Geographic0

1

2

3

4

5R

atin

g (

1~5)

(a) Quality rating from DSIS

NBC PyeongChang Geographic

20

40

60

80

100

Rat

ing

(1~

100)

BaselineEVSOEVSO++23 FPS

(b) Quality rating from DSCQS

Baseline EVSO EVSO++ 23 FPS

20

40

60

80

100

Rat

ing

(1~

100)

(c) Quality rating distribution from DSCQS

Figure 9 Results of the user study through DSIS and DSCQS

watch three video types NBC (Static) PyeongChang (Dy-namic) and Geographic (Hybrid) The participants wereinformed in advance of which videos were processed andthey saw the original video and then the processed video insequence

Figure 9(a) shows the impairment rating of the videos Mostparticipants scored the processed videos as either rdquoimpercep-tiblerdquo or rdquoperceptible but not annoyingrdquo This indicates thatmost participants did not recognize the difference between theoriginal video and the processed video In addition the factthat all three different types of the video received high scoresindicates that EVSO properly considers the motion intensitycharacteristics of the videosDSCQS DSCQS allows the participants to view two stimuliie the source video and the processed video in random orderto evaluate the differences in perceived visual quality Theevaluation index of DSCQS is a [0-100] scale for each stim-ulus Unlike DSIS DSCQS does not inform the participantsabout the setting of each video

The participants were asked to watch the same three videosas in the DSIS method For each video the participantswatched the videos using four different settings in randomorder and evaluated the quality of each video The participantswere allowed to watch the previously viewed videos againif they wanted Figures 9(b) and (c) show the quality ratingand quality rating distribution using DSCQS The participantscould not distinguish between the original video (ie Baseline)and the processed video (ie EVSO) On the other hand theparticipants were able to clearly discriminate 23 FPS whoseframe rate was naively reduced to two thirds In additionEVSO++ which aggressively reduces the frame rate has anaverage frame rate similar to 23 FPS but the average qualityscore is 16 points higher This indicates that reducing the framerate using EVSO leads to better quality than naively reducingthe frame rate

D System Overhead

We measured the system overhead on a server equippedwith 220 GHztimes40 processors and 135 GB of memoryExperiments were performed with the NBC video and theaverage was measured after repeating the session five timesfor accuracyScheduling overhead We measured how much schedul-ing overhead occurred when the Frame rate Scheduler(F-Scheduler) was built into the OpenH264 encoder TheOpenH264 encoder without and with F-Scheduler takes about

2185 seconds and 2202 seconds on average respectivelyThe processing overhead incurred by F-Scheduler to schedulean adaptive frame rate for each video chunk is only 076on average which clearly indicates that there is little or nooverhead caused by the addition of F-SchedulerVideo processing overhead We measured how much pro-cessing overhead occurred when the Video Processor (V-Processor) processed a video based on the scheduling re-sults from F-Scheduler A video streaming server without V-Processor processes an uploaded video with various resolu-tions and bitrates such as 720p (5 Mbps) and 480p (25 Mbps)to provide DASH functionality [13] In addition to this taskV-Processor processes videos according to three battery levelsThe video streaming server without and with V-Processortakes about 143 seconds and 289 seconds on average re-spectively This indicates that the processing overhead causedby V-Processor is approximately 100 However this videoprocessing is internally handled by the streaming server andis a preliminary task performed prior to streaming the videoto users In addition the time overhead can be significantlyreduced if V-Processor is performed based on a selectivestrategy such as processing only popular videos Thereforethe overhead caused by V-Processor can be tolerated by boththe user and streaming serverStorage overhead Because the EVSO system generates ad-ditional videos based on the three battery levels the amountof videos that need to be stored is nearly three times morethan a conventional streaming server This may be a strain onthe storage requirement of the streaming server however thisproblem can also be alleviated if V-Processor is selectivelyperformed only on popular videos as described above

VI RELATED WORK

A Energy Saving when Streaming Videos

Lim et al [4] extended the H264AVC encoder to provide aframe-skipping scheme during compression and transmissionfor mobile screen sharing applications This scheme can reducethe energy consumption of mobile devices without signifi-cantly affecting the user experience as similar frames areskipped during compression and transmission In addition thisscheme does not transmit unnecessary frames to the mobiledevice because it works on the server side thus there isa benefit of increased available network bandwidth on theclient side However this scheme incurs high computationaloverhead because it calculates the similarity of all frames withthe SSIM method [5] On the other hand EVSO proposes

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References
Page 9: EVSO: Environment-aware Video Streaming …We also extend the media presentation description, in which the video content is selected based only on the network bandwidth, to allow for

a novel M-Diff method specialized for a video encoder andprovides a flexible way to adjust the frame rate according tothe current battery status of the mobile device

Since the wireless interface consumes a considerableamount of energy on the mobile device there have beenseveral efforts to reduce the energy consumption by utilizinga playback buffer when streaming videos [23] [24] Theseapproaches download a large amount of data in advance andstore it in the playerrsquos playback buffer to increase the idle timeof the wireless interface These approaches are interoperablewith EVSO However if users frequently skip or quit whilewatching the video network bandwidth can be wasted becausethe previously downloaded data is no longer used

Kim et al [7] adjusted a refresh rate of the screen bydefining a content rate which is the number of meaningfulframes per second If the content rate is low this schemereduces the refresh rate which reduces the energy requirementwithout significantly affecting the user experience Howeverthe energy usage of the wireless interface which is the mainenergy-consuming factor in a video streaming cannot bereduced at all because only the refresh rate on the user sideis adjusted

B Energy Saving for Mobile GamesHwang et al [5] introduced a rate-scaling technique called

RAVEN that skips similar frames in the rendering loop whileplaying games This system uses Y-Diff method because theSSIM method has high computational overhead The authorsshowed that Y-Diff has accuracy comparable to SSIM inhuman visual perception On the other hand EVSO proposes anovel M-Diff method specialized for a video streaming serviceMoreover RAVEN cannot be applied to a video streamingservice because it performs rate-scaling in the rendering loopfor mobile games

Chen et al [25] proposed a method called FingerShadowthat reduces the power consumption of the mobile deviceby applying a local dimming technique to the screen areascovered by userrsquos finger while playing a game Howeverthis approach requires external interaction with user behaviorIn contrast EVSO is independent of external factors andperforms power savings using the motion intensity informationof the video

VII CONCLUSION AND FUTURE WORK

In this paper we propose EVSO which is a system that op-portunistically applies adaptive frame rates for videos withoutrequiring any user effort We introduce a similarity calculationmethod specialized for a video encoder and present a novelscheduling technique that assigns an appropriate frame rate toeach video chunk Our various experiments show that EVSOreduces the power requirement by as much as 27 on mobiledevices with a little degradation of the user experience

As future work we plan to optimize various factors used inEVSO A potential direction would be to use reinforcementlearning to optimize decisions and continue learning throughthe policy training The other direction would be to includemore videos with various frame rates in experiments

ACKNOWLEDGMENT

We thank the anonymous reviewers as well as our colleagueJaehyun Park for their valuable comments and suggestionsThis research was a part of the project titled ldquoSMART-Navigation projectrdquo funded by the Ministry of Oceans andFisheries Korea This research was also supported by theNational Research Foundation of Korea (NRF) grant funded bythe Korean government (MSIP) (No 2017R1A2B4005865)

REFERENCES

[1] ldquoCisco Visual Networking Index Global Mobile Data TrafficForecast Update 2016-2021 White Paperrdquo 2018 [Online] Availablehttpswwwciscocomcenussolutionscollateralservice-providervisual-networking-index-vnimobile-white-paper-c11-520862html

[2] ldquoRecommended upload encoding settingsrdquo 2018 [Online] Availablehttpssupportgooglecomyoutubeanswer1722171

[3] ldquoSamsung Galaxy S seriesrdquo 2018 [Online] Available httpswwwsamsungcomlevantsmartphones

[4] K W Lim J Ha P Bae J Ko and Y B Ko ldquoAdaptive frame skippingwith screen dynamics for mobile screen sharing applicationsrdquo IEEESystems Journal vol 12 no 2 pp 1577ndash1588 2016

[5] C Hwang S Pushp C Koh J Yoon Y Liu S Choi and J SongldquoRAVEN Perception-aware Optimization of Power Consumption forMobile Gamesrdquo ACM MobiCom 2017

[6] ldquoGame Tunerrdquo 2018 [Online] Available httpsplaygooglecomstoreappsdetailsid=comsamsungandroidgametunerthin

[7] D Kim N Jung Y Chon and H Cha ldquoContent-centric energy man-agement of mobile displaysrdquo IEEE Transactions on Mobile Computingvol 15 no 8 pp 1925ndash1938 2016

[8] ldquoExperimental videosrdquo 2018 [Online] Available httpswwwyoutubecomplaylistlist=PL6zYoyhvrmRrmY6JNGVJhKjJdbJJZzy9z

[9] T Wiegand G J Sullivan G Bjontegaard and A Luthra ldquoOverview ofthe H264AVC video coding standardrdquo IEEE Transactions on circuitsand systems for video technology vol 13 no 7 pp 560ndash576 2003

[10] ldquoInternet Information Servicesrdquo 2018 [Online] Available httpswwwiisnet

[11] Z Wang A C Bovik H R Sheikh and E P Simoncelli ldquoImagequality assessment from error visibility to structural similarityrdquo IEEETransactions on Image Processing vol 13 no 4 pp 600ndash612 2004

[12] E Cuervo A Wolman L P Cox K Lebeck A Razeen S Saroiuand M Musuvathi ldquoKahawai High-quality mobile gaming using gpuoffloadrdquo ACM MobiSys 2015

[13] T Stockhammer ldquoDynamic adaptive streaming over HTTPndash standardsand design principlesrdquo Proceedings of the second annual ACM Confer-ence on Multimedia Systems 2011

[14] M S Livingstone Vision and Art The Biology of Seeing Wadsworth2002

[15] E B Goldstein and J Brockmole Sensation and Perception CengageLearning 2016

[16] ldquoOpenH264rdquo 2018 [Online] Available httpwwwopenh264org[17] ldquoData never sleeps 20rdquo 2018 [Online] Available httpswwwdomo

comblogdata-never-sleeps-2-0[18] ldquoGithub - ExoPlayerrdquo 2018 [Online] Available httpsgithubcom

googleExoPlayer[19] ldquoFFmpegrdquo 2018 [Online] Available httpsffmpegorg[20] F Xiao et al ldquoDCT-based video quality evaluationrdquo Final Project for

EE392J vol 769 2000[21] ldquoToward a practical perceptual video quality metricrdquo

2016 [Online] Available httptechblognetflixcom201606toward-practical-perceptual-videohtml

[22] ldquoMonsoon solutions inc monsoon power monitorrdquo 2018 [Online]Available httpswwwmsooncom

[23] A Rao A Legout Y-s Lim D Towsley C Barakat and W DabbousldquoNetwork characteristics of video streaming trafficrdquo ACM CoNEXT2011

[24] W Hu and G Cao ldquoEnergy-aware video streaming on smartphonesrdquoIEEE INFOCOM 2015

[25] X Chen K W Nixon H Zhou Y Liu and Y Chen ldquoFingerShadowAn OLED Power Optimization Based on Smartphone Touch Interac-tionsrdquo USENIX HotPower 2014

  • I Introduction
  • II Motivation and Background
    • II-A Similarity Analysis of Adjacent Video Frames
    • II-B Video Upload Process and H264AVC Encoder
      • III EVSO Design
        • III-A Calculating the Perceptual Similarity Score
        • III-B Splitting Video into Multiple Video Chunks
        • III-C Estimating Frame Rates for Video Chunks
        • III-D Extending Media Presentation Description (MPD) for User Interaction
          • IV Implementation
          • V Evaluation
            • V-A Video Quality Assessment
            • V-B Energy Saving
            • V-C User Study
            • V-D System Overhead
              • VI Related Work
                • VI-A Energy Saving when Streaming Videos
                • VI-B Energy Saving for Mobile Games
                  • VII Conclusion and Future Work
                  • References