ee 6850 lecture #10sfchang/course/vis/slide/lecture10-part1-handout.pdf · ee 6850, f'02,...

EE 6850, F'02, Chang, Columbia U.

EE 6850 Lecture #10 (Nov. 13, 2002)

� Part I:Content Adaptation for Ubiquitous Media Access� On-demand dynamic distillation� Scalable delivery� Content-adaptive streaming and messaging

� References� A. Fox, E. Brewer, S. Gribble, and E. Amir, “Adapting to Network and Client Variability via On-Demand

Dynamic Distillation,” ASPLOS, 1996. � J. R. Smith, R. Mohan, and C.-S. Li, “Scalable Multimedia Delivery for Pervasive Computing,” ACM

Multimedia Conference, Orlando, FL, Oct. 1999.� T.-L. Pham, G. Schneider and S. Goose, A situated computing framework for mobile and ubiquitous

multimedia access using small screen and composite devices,“ ACM Multimedia Conference, Oct. 2000, Los Angeles, CA.

� S.-F. Chang, D. Zhong, and R. Kumar, Real-Time Content-Based Adaptive Streaming of Sports Video, Columbia University ADVENT Technical Report #121, July 2001. Also IEEE Workshop on Content-Based Access to Video/Image Library, Hawaii, Dec. 2001.

� H. Sundaram and S.-F Chang, Condensing Computable Scenes using Visual Complexity and Film Syntax Analysis, IEEE Conference on Multimedia and Exhibition, Tokyo, Japan, Aug. 22-25, 2001.

� A. Jaimes, T. Echigo, M. Teraguchi, and F. Satoh, “LEARNING PERSONALIZED VIDEO HIGHLIGHTS FROM DETAILED MPEG-7 METADATA,” IEEE Intern. Conf. On Image Processing, NY, 2002.s

� S.F. Chang, “Optimal Video Adaptation and Skimming Using a Utility-Based Framework, TyrrhenianInternational Workshop on Digital Communications (IWDC-2002), Capri Island, Italy, Sept. 2002.


The Need for Video Adaptation

� Heterogeneous users, networks, and terminals� Content analysis to assist video adaptation decision

Broa dca stCont ent


Adaptation in 3-tier architecture

Metadata:• Transcoding

hints• Content

adaptability

adaptationtool library

resourceabstraction

& monitoring

proxy/gateway

server

videouser

terminalnetwork

optimal adaptation


Need for a Systematic Framework

task

device

active (e.g. search)passive (e.g. watching tv)

+

• what are allowable adaptations?• how much resource is available?• how to model and optimize quality?

UI resources

cpu speed

bandwidth

audio?


Active Proxy

� Active proxy dynamic distillers [Fox, Brewer, et al 96]� Datatype-specific

distillation� intermediate proxy – low

end-to-end latency� network/application

interface separation� Address different data

types: image, text, and video (global level)

Client variability (‘96 data)

Constraints: bandwidth, latency, data size, spatial-temporal size, color depth, audio


Datatype and adaptation axes

Distillation latency of Images

Variations:network

Variations:display

Variations:software


Effect on end-to-end latency

Text Datatype (PDF, text, HTML)


Video


IBM InfoPyramid

� InfoPyramid, Universal Tuner [Smith, Li, Mohand 99]� Content adaptation with various

fidelity and modality� Translation and transcoding

� Transcoding focused on images� Allowable operators

� Resolution, bit rate, color depth, modality substitute

� Each IP object is associated with a (rate, quality) descriptor

• Did not Address rich multi-level elements and structures in video• Did not address rapidly changing User/network conditions


Image transcoding methods


Detecting image purpose

A General Utility-Based Framework (Chang’02)

Entitye

a1

a2

a3

Adaptation Space

r1 r2

r3 Resource Space

u1

u2

u3

Utility Space

CR

� Entity: � A set of video data with consistency constraints on certain attributes� Basic data unit undergoing adaptation� Examples:

� Program, Shot, Scene� Frame, Object, Region� Syntactic or semantic elements, e.g., anchors, scoring segments� Synchronous multimedia entities, e.g., dialog, talking face, explosion


Multiple Degrees of Freedom for Adaptation

� Signal level� change of bit rate, frame rate, resolution, color

depth, SNR

� Time� Condensation by uniform time scaling,

content-based filtering (selection/dropping)

� Modality conversion� Key frame shows, video posters, spatial

summaries


Adaptation-Resource-Utility Spaces

� Adaptation Space� Temporal

� Frame rate reduction� Time condensation� Entity removal

� Spatial� Resolution� SNR, color depth� Object

� Modality conversion

� Bandwidth� Computation� Memory

� Resource Space� Power� Display� Viewer time

� Utility Space� Objective (SNR)� Subjective Scores� Information

Comprehension


Resource Constrained Utility Maximization

� Given a content entity e, and resource constraints (CR), find the optimal adaptation operation aopt within the search space to maximize the utility

� aopt = arg MinA U(e, a), s.t. R < CR

� Similar to the rate-distortion approach used in video coding

� Can be extended to a utility-constrained resource minimization formulation


Key Issues

� Given an entity, what are allowable adaptations in a specific environment?� frames droppable from MPEG, shots removable in a

scene� Ambiguity in Specifying Adaptation Operations, e.g.,

� Drop 10% of transform coefficients� Uniform allocation vs. R/D optimization

� Options� Standard compliant scalable adaptations� Empirical variation bounds among implementations� Describe rankings among competing operations, not their

utility values


Key Issues� Utility measure and modeling

� How to measure and combine resource and utility of different modalities and levels� Power, bandwidth, display, user time;

Audio, video, signal, subjective, comprehension� Signal level vs. comprehension level

� Relationships with user tasks and preferences� Representation schemes for ARU spaces and relationships

� Sample the permissible adaptation region and describe the associated R and U� E.g., binary selection from N shots � N(N-1) points� E.g., 4 frame rates, 2 resolutions � 8 adaptation points

� Given resource constraints, sample the permissible adaptation region� Divide adaptation space to separate dimensions

� Lose correlation among dimensions


Example 1: Video Skim Generation(with H. Sundaram 2001)

1. How much time is available? (CR)

2. What are allowable operations? (A)3. How will operations alter aesthetic affects,

perceptual quality, and semantic understanding? (U)

4. A Resource Constrained Utility Maximization Problem

Generalized utility framework for video skimming

dropped frames

Original scene � condensed clip (video skims)

Original-1 30% Skim

Original-news 17% Skim


The entities preserved in this case

� Video shot (duration altered)� The fundamental video entity; try to maximize the

coherence of each retained video shot� Segment beginnings (SBEG’s) significant phrase

� This is an element of the speech discourse� Synchronous multimedia segments

� Ensures maximum skim coherence� Elements of visual syntax

� dialogs, regular anchors, shot phrases� Film rhythm

� Preserves the “pace” of the film


How to Construct Computable Utility Model?

� How much time is required for generic comprehension (who, what, where, when)?

� Is comprehension time related to the computable spatio-temporal complexity of the shot ?

� Explore Viewer Perceptual Model from Film Theory� The presence of detail robs a shot of its screen time [Sharff 1982].

(a) (b)


Measuring Comprehension Time

� Represent the shot by its key-frame� A shot is selected at random [3600 shots]� The subject was asked to correctly answer four

questions in minimum time: � Who ?� What ?� When ?� Where ?

“Why” was not asked


Utility Function

complexity →

time →

0

02.

54.

5

1

reduce to upper boundoriginal

shot Ub

LbUtility function between the bounds

t: duration, c: complexity: selection indicator sequence

( ) 2 .4 0 1 .1 1

( ) 0 .6 1 0 .6 8b

b

U c c

L c c

= += +

� Plot of average time vs. complexity shows two bounds


Factors affecting comprehension

� The comprehension time is influenced by many factors:� Visual complexity� The viewer task (active vs. passive)� Prior knowledge of the viewer

100101000

� This example only focused on visual complexity since it is measurable.


Rich Structure: Considering syntax

� Minimum number of shots in a scene� The particular ordering of the shots

(cut)� The specific duration of the shots,

to direct viewer attention� Changing the scale of the shots

The specific arrangement of shots so as to bring out their mutual relationship. [ sharff 82 ].

Film makers think in terms of phrases of shots and not individual shots � choose the right entity for adaptation


The progressive phrase

Hence, a phrase (a group of shots) must at least have three shots.

“Two well chosen shots will create expectations of the development of narrative; the third well-chosen shot will resolve those expectations.”[ sharff 82 ].

Maximal shot removal:eliminate all the dark shots.


Structure (dialog)

Hence, a dialog must at least have six shots

“Depicting a conversation between m people requires 3m shots.” [ sharff 82 ].

Maximal adaptation:eliminate all the dark shots.

SVM classifier

Time-dependentViterbi decoder-temporal consistency

Audio content analysis

silence

significant phrases

Prosody analysis: (pitch, pause, energy)[Hirshberg, Nakatani, Arons, ‘92]

audio-scenes

silence removal

non-speech cleannoisy

speech

Tied Audio-Video Constraint

� Tied segments:� Include all significant phrases� Audio and video boundaries are fully synchronized� Cannot be condensed or de-synchronized� Allow viewers to “catch up” when viewing skims

� Untied segments:� Audio-video can be dropped, condensed, reduced� Audio-video segments do not have to synchronize

Significant Phrases

opening closingsignificant phrase

dialog


Utility Framework for Skim Generation

shot detection

video utility function

audio utility function

utility function

iterative

maximization

skim generation

shot duration and visual complexity

target skim time

audio constraints

visual syntax constraints

constraints

original

skim

[Chang, IWDC 02]

Example 2: Utility-Based MPEG-4 Video Transcoding(with J. Kim and Y. Wang)

(no FD,50%)

(B1,42%)

(all B,27%)

� Original bit rate � reduced rate (resource change)� Adaptation Space: FD: frame dropping, CD: coefficient dropping,

and combinations

� Utility Ranking Description: {(Ri, Rank(Ai), Ui, Consistency-Flag), i = 1,2, …}


MPEG-4 Fine-Grained Scalability

I PBase Layer

Spatial

Enhancement Layer

Temporal

Enhancement Layer

� Adaptation Space: Temporal frame rate and SNR bit planes


Utility Function of Subjective Spatio-Temporal Preference(w. R. Kumar and M. van der Schaar)

� SNR is inadequate for measuring utility

� Conduct study where users chose preferred frame-rate at different bit-rates

� As the bit-rate goes up, people prefer better frame-rates

� Preference varies with video category� High-motion videos (Stefan,

Coastguard) require a higher frame rate

0102030405060708090

100

0 5

200300500

Spatio-Temporal Preferences

0

2

4

6

8

10

12

0 500 1000 1500

Bitrate (kbps)

Fram

e-R

ate

(fp

s)

Foreman

Stefan

Coastguard

Mobile


Utility Model Guided FGS

� When more bit rate available, increase SNR quality and fix temporal rate to a predetermined bitrate

� Then improve temporal quality

� Further improve SNR quality at the new frame-rate

� And so on….

Enhanced-reference scheme

Scheme using fixed reference rate

(B) SNR quality of enhanced-scheme

Bandwidth

SNR dB 2T fps

T fps

4T fps

3T fps


FGS+ Scheme

I PBase Layer

Spatial

Enhancement Layer

Temporal

Enhancement Layer

� Solve the issue of determining optimal rate for motion prediction reference


Performance

� Improvement over FGS varies from 0.19 dB to 1.28 dB

� At low bit-rates simple videos benefit (Coastguard)

� At high bit-rates complex videos benefit (Mobile)

Improvement of FGS+ (B-frames) over

FGS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

300 800 1300kbps

SNR (dB)

Foreman StefanCoastguard Mobile


Content-based utility classification and prediction

Challenge: predict utility function for an arbitrary video?

Content Features Classesobject motion

object complexity

camera operation

content genre

TD iC →

Content-basedUtility Function

classifier T i

T 1

T N

content-utility correlation & content-based classification


Example 3: Utility Adaptive Video Streaming

non-important segments: still frame + audio + captions

Content Adaptation

time

channel rate

important segments: video mode

Encoder andScheduler

Real-time Event Detection

Live video

View/text recognition

network

playback

� Model utility based on content “importance” vs. “non-importance”� Utility-based adaptive rate allocation


Detecting recurrent semantic units: Serve

Adaptive Color Filter

Object level verification

Canonical View Detection

Shot Detection

Compressed video

Every I/P frames Key frames/shot Down-sampled I/P frames

Color,

lines, player

(Chang, Zhong, Kumar ’00)

System Issues for Adaptive Streaming

� Increase buffer size feasible� 8MB buffer can store 2000 sec (32Kbps) - 250 sec (256Kbps)

� But playback latency is the main constraint� E.g., deliver real-time event within 60 sec

� Client error more serious than server error� Degrade to Channel Rate, Freeze and resume, or adaptive

playback speed.

t1t0

Input channel

accumulatedrate

time

Server Underflow

Client Underflow

Output

networkvideoencoder

playback

Recap

Entitye a1

a2

a3

Adaptation Space

r1 r2

r3 Resource Space

u1

u2

u3Utility Space

CR

� Content adaptation needed for UMA applications� A generalized conceptual framework for

� Modeling relationships among content entities, adaptation processes, utility, and resources

� Formulate problems as constrained optimization tasks, e.g.,� Time condensed skims� Modeling spatio-temporal utility preference� Content-adaptive streaming

� Open issues� utility model, high-dimensional representation, search

ee 6850 lecture #10sfchang/course/vis/slide/lecture10-part1-handout.pdf · ee 6850, f'02,...

Documents