video coding based on audio-visual attentionw vv w w aa ww v vww a aw v: collection of visual...

15
Multimedia Signal Processing Group Swiss Federal Institute of Technology 1 Video Coding based on Audio-Visual Attention Jong-Seok Lee, Francesca De Simone, Touradj Ebrahimi EPFL, Switzerland 1 July 2009

Upload: others

Post on 19-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

1

Video Coding based on Audio-Visual Attention

Jong-Seok Lee, Francesca De Simone, Touradj Ebrahimi

EPFL, Switzerland

1 July 2009

Page 2: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

2 Introduction

Objective of video coding – Better quality with smaller number of bits

How to achieve better video coding efficiency? – Using statistics of signal

– Using human visual system’s characteristics: Focus of attention Only small region around fixation point is captured at high spatial

resolution.

Attended region

less compression Unattended region

more compression

Page 3: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

3 Introduction

Which region draws attention?

Conspicuity-based (Itti, 2004)

Moving object-based (Cavallaro, 2005)

Face-based (Boccignone, 2008)

No consideration of cross-modal (audio-visual) interaction!

Page 4: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

4 Audio-Visual Focus of Attention

– Abrupt sound draws visual attention to sound source location.

(Spence, 1997)

– Attending to auditory stimuli at given location enhances

processing of visual stimuli at same location. (Spence, 1996)

We define sound-emitting region as attended region.

Page 5: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

5 Overall Procedure

Original frame Source localization

Priority map Blurring (Gaussian pyramid)

Compression

(H.264/AVC)

Page 6: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

6 Audio-Visual Source Localization

Proposed method – Is applicable to normal video with mono audio channel.

– Does not have assumption on sound source.

– Does not require training phase.

Page 7: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

7 Audio-Visual Source Localization

Canonical correlation analysis – To find wa and wv which maximize canonical correlation:

– Equivalently, One dimensional audio feature:

( )

( ) ( )

T T T Tv a v a

T T T TT T T Tv v a av v a a

E

E E

w va w w V A w

w V V w w A A ww vv w w aa w

V: collection of visual features A: collection of audio features

v aVw Aw

Wv

… …

… …

Spatial location (Nv)

Time (T)

=

a(1) a(2) …

a(T)

V A

Many solutions exist!

Page 8: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

8 Audio-Visual Source Localization

Sparsity principle (Kidron, 2007)

– We want to have energy concentrated in small region.

Spatio-temporal consistency

1min || || subject tov v w Vw A

t

t+1

t+2

t t+1 t+2

1

min | | subject tovN

i vi v

i

f w

Vw A

oldvwGaussian filtering { }iff

Page 9: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

9 Video Coding

Localization result Priority map

Gaussian pyramid of L levels

Blurred image

Compression

(H.264/AVC)

Page 10: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

10 Experiments

4 test sequences including multiple moving objects in scene

Audio-visual source localization – Visual features: differential images

– Audio features: frame energy

H.264/AVC coding: constant quantization parameter (QP) mode

Subjective test – Is quality degradation acceptable?

– Double stimulus continuous quality scale (DSCQS)

Page 11: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

11 Results: Localization

Conventional (Kidron, 2007) Proposed

Page 12: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

12

0

20

40

60

80

100

DMOS

Results: Coding Efficiency & Subjective Quality

No blurring Blurring outside sound source (L=2)

Blurring outside moving regions (L=2)

Blurring outside sound source (L=6)

Blurring outside moving regions (L=6)

Dif

fere

nti

al m

ean o

pin

ion s

core

24% gain 17% gain

51% gain 40% gain

Constant QP=26

Page 13: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

13 Conclusion & Discussion

Audio-visual focus of attention can be used for efficient

video coding.

Discarding information outside focus of attention does not

degrade perceived quality significantly. – But, perceived quality may vary according to original content,

which may be necessary to be considered.

AV FoA does not explain everything. It should be combined

with other attention mechanisms.

Page 14: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

14 References

L. Itti, “Automatic foveation for video compression using a neurobiological model of visual attention,” IEEE Trans.

Image Process., 2004

A. Cavallaro, O. Steiger, T. Ebrahimi, “Semantic video analysis for adaptive content delivery and automatic

description,” IEEE Trans. Circuits Syst. Video Technol., 2005

G. Boccignone, A. Marcelli, P. Napoletano, G. D. Fiore, G. Iacovoni, S. Morsa, “Bayesian integration of face and low-

level cues for foveated video coding,” IEEE Trans. Circuits Syst. Video Technol., 2008

B. Stein, M. Meredith, “The merging of Senses,” MIT Press, 1993

R. Sharma, V. I. Pavlovic, T. S. Huang, “Toward multimodal human-computer interface,” Proc. IEEE, 1998

H. McGurk, J. MacDonald, “Hearing lips and seeing voices,” Nature, 1976

J.-S. Lee, C. H. Park, “Robust audio-visual speech recognition based on late integration,” IEEE Trans. Multimedia,

2008

M. Sargin, Y. Yemez, E. Erzin, A. Tekalp, “Audiovisual synchronization and fusion using canonical correlation

analysis,” IEEE Trans. Multimedia, 2007

P. Perez, J. Vermaak, A. Blake, “Data fusion for visual tracking with particles,” Proc. IEEE, 2004

B. Rivet, L. Girin, C. Jutten, “Mixing audiovisual speech processing and blind source separation for the extraction of

speech signal from convolutive mixtures,” IEEE Trans. Multimedia, 2007

C. Spence, J. Driver, “Audiovisual links in exogenous covert spatial orienting,” Perception & Psychophysics, 1997

C. Spence, J. Driver, “Audiovisual links in endogenous covert spatial attention,” J. Experimental Psychology: Human

Perception & Performance, 1996

E. Kidron, Y. Schechner, M. Eland, “Cross-modal localization via sparsity,” IEEE Trans. Signal Process., 2007

Page 15: Video Coding based on Audio-Visual Attentionw vv w w aa ww V Vww A Aw V: collection of visual features A: collection of audio features ... G. Boccignone, A. Marcelli, P. Napoletano,

Multimedia Signal Processing Group Swiss Federal Institute of Technology

15

Questions/comments are welcome!

Contact

http://mmspg.epfl.ch

[email protected]