saliency and tracking in compressed...

Saliency and Tracking in Compressed

Video

by

Sayed Hossein Khatoonabadi

M.Sc., Amirkabir University of Technology, 2005

B.Sc., University of Kerman, 2002

Thesis Submitted in Partial Fulfillment

of the Requirements for the Degree of

Doctor of Philosophy

in the

School of Engineering Science

Faculty of Applied Sciences

c© Sayed Hossein Khatoonabadi 2015

SIMON FRASER UNIVERSITY

Summer 2015

All rights reserved.

However, in accordance with the Copyright Act of Canada, this work may be

reproduced without authorization under the conditions for ”Fair Dealing”.

Therefore, limited reproduction of this work for the purposes of private study,

research, criticism, review and news reporting is likely to be in accordance

with the law, particularly if cited appropriately.

APPROVAL

Name: Sayed Hossein Khatoonabadi

Degree: Doctor of Philosophy

Title: Saliency and Tracking in Compressed Video

Examining Committee: Chair: Dr. Rodney G. Vaughan

Professor, Simon Fraser University

Dr. Ivan V. Bajic

Associate Professor, Simon Fraser University

Senior Supervisor

Dr. Parvaneh Saeedi


Supervisor

Dr. Nuno Vasconcelos

Professor, University of California, San Diego

Supervisor

Dr. Jie Liang


Internal Examiner

Dr. Z. Jane Wang

Professor, University of British Columbia

External Examiner

Date Approved: June 30th, 2015

ii

Abstract

Visual saliency is the propensity of a part of the scene to attract attention. Compu-

tational modeling of visual saliency has become an important research problem in recent

years, with applications in quality assessment, compression, object tracking, and so on.

While most saliency estimation models for dynamic scenes operate on raw video, their high

computational complexity is a serious drawback when it comes to practical applications.

Our approach for decreasing the complexity and memory requirements is to avoid decoding

the compressed bitstream as much as possible. Since most modern cameras incorporate

video encoders, this paves the way for in-camera saliency estimation, which could be useful

in a variety of computer vision applications. In this dissertation we present compressed-

domain features that are highly indicative of saliency in natural video. Using these features,

we construct two simple and effective saliency estimation models for compressed video. The

proposed models have been extensively tested on two ground truth datasets using several

accuracy metrics, and shown to yield considerable improvement over several state-of-the-art

compressed-domain and pixel-domain saliency models. Another contribution is a tracking

algorithm that also uses only compressed-domain information to isolate moving regions

and estimate their trajectories. The algorithm has been tested on a number of standard

sequences, and the results demonstrate its advantages over state-of-the-art for compressed-

domain tracking and segmentation, with over 30% improvement in F-measure.

Keywords: Compressed-Domain Processing, Video Object Tracking, Visual Attention,

Visual Saliency Modeling

iii

To my mother and father for their constant support, encouragement, and love.

To my brothers for their constant friendship and belief in me and my dreams.

iv

Acknowledgements

I am grateful for having the opportunity to experience a wonderful time at SFU, a

pioneer research facility, and an exceptionally friendly environment. This has been and will

always be a marvelous memorable experience.

First and foremost, I would like to express my deepest gratitude to my supervisor, Prof.

Ivan V. Bajic, for his generous guidance and mentorship during my PhD. His insightful

comments, invaluable assistance, support and enthusiasm kept me on track with this dis-

sertation and provided me with numerous opportunities to learn and grow. Prof. Bajic’s

willingness to share his knowledge and the opportunities he has provided me is very much

appreciated.

I wish to acknowledge Prof. Nuno Vasconcelos for giving me the opportunity to collab-

orate with Statistical Visual Computing Laboratory team at the University of California,

San Diego under his supervision for six months. His valuable comments and superb ideas

have undoubtedly had a great impact on the success of this project. I am very fortunate to

have collaborated with and learned from one of the bests in the field.

It is my pleasure to thank my supervisory committee member, Prof. Parvaneh Saeedi,

for her insightful comments on my research that greatly improved the clarity and the quality

of this work. I would also like to thank Prof. Jie Liang, and Prof. Z. Jane Wang, my internal

and external examiners, who offered priceless feedback, support and encouragement for my

research. I also express my great thanks to Prof. Rodney G. Vaughan who kindly agreed to

coordinate my defense. I hereby, would like to acknowledge the financial support I received

from Cisco Systems, Inc. In particular, I owe thanks to Dr. Yufeng Shan for his support

and encouragement.

At last, none of this would have been possible without the full support of my friends

and family. I would like to thank them for their nonstop support and patronizing during

my life. Unquestionably they have done their best to pave the road of success for me which

v

indeed walked me towards this stage, the place where they can finally see the results of

their lifetime endeavors. For now, I can only hope to make them proud and to make sure

they know how much I love them. Mom! more than anyone else, this thesis is dedicated to

you.

vi

Contents

Approval ii

Abstract iii

Dedication iv

Acknowledgements v

Contents vii

List of Tables xi

List of Figures xii

List of Symbols xvi

List of Acronyms xxi

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Preview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Bottom-Up Saliency Estimation . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Saliency Model Evaluation Framework . . . . . . . . . . . . . . . . . 5

1.2.3 A Comparison of Compressed-Domain Saliency Models . . . . . . . 5

1.2.4 Proposed Saliency Estimation Methods . . . . . . . . . . . . . . . . 6

1.2.5 Compressed-Domain Tracking . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Scholarly Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 Journal papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

vii

1.3.2 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Bottom-Up Saliency Estimation 9

2.1 The Itti-Koch-Niebur (IKN) Saliency Model . . . . . . . . . . . . . . . . . . 9

2.1.1 Feature Map Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 The Master Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Focus of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Spatio-Temporal Saliency Estimation . . . . . . . . . . . . . . . . . . . . . . 16

3 Saliency Model Evaluation Framework 19

3.1 Eye-Tracking Video Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 The SFU Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2 The DIEM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Area Under Curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Kullback-Leibler Divergence (KLD) and J-Divergence (JD) . . . . . 25

3.2.3 Jensen-Shannon Divergence (JSD) . . . . . . . . . . . . . . . . . . . 26

3.2.4 Normalized Scanpath Saliency (NSS) . . . . . . . . . . . . . . . . . . 26

3.2.5 Pearson Correlation Coefficient (PCC) . . . . . . . . . . . . . . . . . 26

3.3 Data Analysis Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Gaze Point Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Center Bias and Border Effects . . . . . . . . . . . . . . . . . . . . . 28

4 Compressed-Domain Saliency Estimation 31

4.1 Compressed-Domain Visual Saliency Models . . . . . . . . . . . . . . . . . . 32

4.1.1 Saliency Based on Perceived Motion Energy Spectrum . . . . . . . . 36

4.1.2 Saliency Based on Motion Attention Model . . . . . . . . . . . . . . 37

4.1.3 Saliency Based on Perceptual Importance Map . . . . . . . . . . . . 37

4.1.4 Saliency Based on Motion Center-Surround Difference Model . . . . 38

4.1.5 Saliency Based on Center-Surround Difference . . . . . . . . . . . . 39

4.1.6 Saliency Based on Motion Saliency Map and Similarity Map . . . . 40

4.1.7 A Convex Approximation to IKN Saliency . . . . . . . . . . . . . . . 41

viii

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Compressed-Domain Correlates of Fixations 65

5.1 Compressed-domain features . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1.1 Motion Vector Entropy (MVE) . . . . . . . . . . . . . . . . . . . . . 66

5.1.2 Smoothed Residual Norm (SRN) . . . . . . . . . . . . . . . . . . . . 68

5.1.3 Operational Block Description Length (OBDL) . . . . . . . . . . . . 68

5.2 Discriminative Power of the Proposed Compressed-Domain Features . . . . 70

6 Proposed Saliency Estimation Algorithms 74

6.1 MVE+SRN Saliency Estimation Model . . . . . . . . . . . . . . . . . . . . 74

6.2 OBDL-MRF Saliency Estimation Model . . . . . . . . . . . . . . . . . . . . 76

6.2.1 MRF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.2 Temporal Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2.3 Observation Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.2.4 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.6 Final Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

7 Compressed-Domain Tracking 93

7.1 MRF-Based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.1.1 Temporal Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.1.2 Context Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.1.3 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

7.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.2.1 Polar Vector Median for Intra-Coded Blocks . . . . . . . . . . . . . . 100

7.2.2 Global Motion Compensation . . . . . . . . . . . . . . . . . . . . . . 103

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8 Conclusions and Future Work 112

8.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Bibliography 119

x

List of Tables

3.1 Datasets used in our study to evaluate different visual saliency models. . . . . . . 21

3.2 Summary of evaluation metrics used in the study. . . . . . . . . . . . . . . . . . 30

4.1 Compressed-domain visual saliency models included in the study

(MVF: Motion Vector Field; DCT-R: Discrete Cosine Transformation of residual

blocks; DCT-P: Discrete Cosine Transformation of pixel blocks; KLD: Kullback-

Leibler Divergence; AUC: Area Under Curve; ROC: Receiver Operating Characteristic) 35

4.2 Ranking of test sequences according to average scores across all models excluding

IO and GAUSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 Ranking of test sequences according to different metrics for IO. . . . . . . . . . . 55

4.4 The PSNR value (in dB) for various QP values on sequences from the SFU dataset 59

4.5 Average processing time in milliseconds per frame. . . . . . . . . . . . . . . . . . 59

5.1 Results of statistical comparison of test and control samples. For each se-

quence, the p-value of a two-sample t-test and the percentage (%) of frames

where the test sample mean is larger than the control sample mean are shown. 73

6.1 Saliency estimation algorithms used in our evaluation. D: target domain

(cmp: compressed; pxl: pixel); I: Implementation (M: Matlab; P: Matlab

p-code; C: C/C++; E: Executable). . . . . . . . . . . . . . . . . . . . . . . 81

6.2 Average processing time (ms) per frame. . . . . . . . . . . . . . . . . . . . . 90

7.1 Parameter values used in our experiments . . . . . . . . . . . . . . . . . . . . . 106

7.2 Total average of Precision (P), Recall (R), and F-Measure (F ) in percent for different

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xi

List of Figures

2.1 Architecture of the IKN model from [73]. . . . . . . . . . . . . . . . . . . . 10

2.2 An example of MaxNorm normalization of a feature from [73]. . . . . . . . 13

2.3 FancyNorm of two feature maps, from [72]: (a) contains a strong activa-

tion peak and several weaker activation peaks, (b) contains various strong

activation peaks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 From [73]: the processing steps of the IKN model, demonstrated by an exam-

ple. Three conspicuity maps of color (C), intensity (I) and orientation (O)

are derived from image features and integrated to create the master saliency

map (S). The FOA is first directed to the most salient location, specified by

an arrow in 92 ms simulated time, after which the next FOAs are successively

determined considering inhibition-of-return feedback. . . . . . . . . . . . . . 17

3.1 Sample gaze visualization from the SFU Dataset. The gaze points from the

first viewing are indicated as white squares, those from the second viewing

as black squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Sample gaze visualization from the DIEM Dataset. The gaze points of the

right eye are shown as white squares, those of the left eye as black squares. 22

3.3 The heatmap visualization of gaze points combined across all frames and all

observers, for the first viewing in the SFU dataset and the right eye in the

DIEM dataset. Gaze points accumulate near the center of the frame. . . . . 23

4.1 Functional diagrams for various compressed-domain saliency models. . . . . 33

4.2 Sample saliency maps obtained by various models. . . . . . . . . . . . . . . 46

4.2 Sample saliency maps obtained by various models.(cont.) . . . . . . . . . . 47

xii

4.3 Accuracy of various saliency models over SFU and DIEM dataset according

to AUC′ score for (top) I-frames and (bottom) P-frames. The 2-D color map

shows the average AUC′ score of each model on each sequence. Top: Average

AUC′ score for each sequence, across all models. Right : Average AUC′ scores

each model across all sequences. Error bars represent standard error of the

mean (SEM), σ/√n, where σ is the sample standard deviation of n samples. 50


to NSS′. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


to JSD′. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


to PCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.7 Evaluation of models using various metrics. . . . . . . . . . . . . . . . . . . 57

4.8 Illustration of multiple comparison test for (left) Hall Monitor and (right)

Mobile Calendar sequences using AUC′. . . . . . . . . . . . . . . . . . . . . 58

4.9 The number of appearances among top performers, using various evaluation

metrics. Results for I-frames are shown at the top, those for P-frames at the

bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.10 The relationship between the QP parameter and the models’ accuracy on the

SFU dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Motion cube N(n) associated with block n (shown in red) is its causal spatio-

temporal neighborhood of size W ×W × L. . . . . . . . . . . . . . . . . . . 67

5.2 Scatter plots of the pairs (control sample mean, test sample mean) in each

frame, for MVE (top), SRN (middle) and OBDL (bottom). Dots above

the diagonal show that feature values at fixation points are higher than at

randomly selected points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.1 Block diagram of the proposed MVE+SRN saliency estimation algorithm. . 75

6.2 Sample saliency maps obtained by various algorithms. . . . . . . . . . . . . 82

xiii

6.3 Accuracy of various saliency algorithms over the two datasets according to

AUC′. Each 2-D color map shows the average AUC′ score of each algorithm

on each sequence. The average AUC′ performance across sequences/algorithms

shown in the sidebar/topbar. Error bars represent standard error of the mean. 86

6.4 Accuracy of various saliency algorithms over the two datasets according to

NSS′. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5 The number of appearances among top performers, using AUC′ and NSS′

evaluation metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.6 The frequencies of saliency values estimated by different algorithms at the

fixation locations (narrow blue bars) and random points from non-fixation

locations (wide green bars) vs the number of human fixations. The JSD

between two distribution corresponding to each algorithm presents how large

each distribution diverges from another (histograms were sorted from left to

right and top to bottom according to the JSD metric.) . . . . . . . . . . . . 91

6.7 The relationship between the average PSNR and the models’ accuracy. . . . 92

7.1 The flowchart of our proposed moving object tracking. . . . . . . . . . . . . . . . 94

7.2 Distribution of d(n) and d′(n) for the ball in frame #4 of the Mobile Calendar

sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.3 MV assignment for an intra-coded MB. One of the first-order neighboring MBs is

also intra-coded, and the remaining neighbors have MVs assigned to variable size

blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.4 Polar Vector Median: (a) original vectors, (b) angles of vectors; cyan vectors: candi-

date vectors for computing representative angle, red vector: representative angle, (c)

lengths of vectors, red line: representative length, (d) result of polar vector median,

green vector: standard vector median [14], red vector: polar vector median. . . . . 102

7.5 The effect of assigning PVM to intra-coded blocks: (a) intra-coded blocks indicated

as yellow squares; (b) tracking result without PVM assignment, (c) tracking result

with PVM assignment. TP are shown as green, FP as blue, FN as red. . . . . . . 103

7.6 Small block modes (8× 4, 4× 8 and 4× 4) of frame #2 for Coastguard and Stefan

sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xiv

7.7 Object detection during ST-MRF-based tracking (a) frame #70 of Coastguard (b)

target superimposed by scaled MVF after GMC, (c) the heatmap visualization of

MRF energy, (d) tracking result by the proposed method, (e) segmentation result

from [119], and (f) [163]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.8 Object detection/tracking by (a) proposed method, (b) and method from [103] for

frame #97 of Mobile Calendar (c) Segmentation result from [119], and (d) [163]. . 108

7.9 Trajectory results of (a) Mobile Calendar, (b) Coastguard (c) Stefan CIF (d) Ste-

fan SIF (e) Hall Monitor and (f) Flower Garden sequences (blue lines: proposed

algorithm, yellow lines: ground truth) . . . . . . . . . . . . . . . . . . . . . . . 111

xv

List of Symbols

I The intensity channel of an image

R The red channel of an image

G The green channel of an image

B The blue channel of an image

M The motion channel of a frame

F The flicker channel of a frame

R The tuned color component of red

G The tuned color component of green

B The tuned color component of blue

Y The tuned color component of yellow

I The intensity feature map

RG The opponent color feature map of (red,green)

BY The opponent color feature map of (blue,yellow)

Oθ The orientation feature map with the orientation of θ

Mθ The motion feature map with the orientation of θ

F The flicker feature map

K A feature map

I The intensity conspicuity map

C The color conspicuity map

O The orientation conspicuity map

M The motion conspicuity map

F The flicker conspicuity map

K A conspicuity map

Cf The conspicuity map of feature f

G The ground-truth saliency map

xvi

S The predicted saliency map

S ′ The normalized saliency map according to normalized scanpath saliency

Ss The static/spatial saliency map

St The motion/temporal saliency map

S ′t The temporal saliency map after global motion compensation

S−1s The static saliency map of the previous non-predicted frame

S−1t The motion saliency map of the previous predicted frame

SMVE The saliency map of motion vector entropy

SSRN The saliency map of smoothed residual norm

Mk The motion saliency map at level k

Em The normalized motion magnitude

Eg The global angle coherence

Es The spatial angle coherence

Et The temporal angle coherence

Ef The spatial frequency content

Ee The edge energy

X The DCT coefficients

Xn The DCT coefficient of the block n

Xr The DCT coefficient of the residual block r

XS The DCT coefficients of the desired signal

XN The DCT coefficients of the undesired signal

SS The power spectral densities of the desired signal

SN The power spectral densities of the undesired signal

(x, y) A pixel coordinate

(x, y, t) A pixel coordinate of frame t

n A block of pixels in an image

m A block of pixels in an image

p A block of pixels in an image

r A residual block in an image

v(n) The motion vector assigned to block n

(vx, vy) The motion vector assigned to a block at coordinate (x, y)

v′(n) The preprocessed motion vector assigned to block n

xvii

v The representative vector of a region’s motion

V A list of motion vectors

V A list of motion vectors containing a subset of V

ot The observed information of frame t

κt The block coding mode and partition size of frame t

T The transfer function

Υθ The shifted image by one pixel orthogonal to the Gabor orientation of θ

D The difference of Gaussians map

Oθ The Gabor orientation of an image with the orientation of θ

NG The 2-D Gaussian density map

P The probability distribution

Q The probability distribution

b The Bernoulli distribution

H The histogram

H The entropy

L The low-pass filter

corr The Pearson correlation coefficient

cov The covariance function

inf The infimum function

sup The supremum function

F the set of pixel coordinates of fixations

E The energy function

ξ The block-wise energy/error function

Z The partition function

w The weighting function

mi A global motion parameter (m1,m4:translation, m2,m6:zoom, m3,m5:rotation)

ρ The strength of camera motion

θ The orientation/angle

ϑ The resolution level (scale) of an image

Λk The motion vectors of 8-connected neighbors at level k

z The quantized transformed residual of a macroblock

dc The DC value of a block

xviii

ωt The class labels of frame t

ωt∗ The optimal label assignment for frame t

Ωt The set of all possible label assignments for frame t

ψ A sample label assignment

Φ The measure of saliency in the neighborhood

W The spatial dimension

L The temporal dimension

µ The sample mean

µ The weighted sample mean

m The maximum value

σ The Gaussian parameter or sample standard deviation

σ The weighted standard deviation

γ(·, ·) The degree of overlap between two class labels

c(·, ·) The consistency between the two locations

d(·, ·) The distance function between two locations

d(·) The length of a vector

d′(·) The normalized length of a vector

∆f (·, ·) The absolute difference between the feature values of two blocks

δ(·) The dissimilarity between a motion vector and its neighboring motion vectors

N(·) The neighborhood of a given block

N+(·) The first-order neighborhood of a given block

N×(·) The second-order neighborhood of a given block

N (·) The normalization operator

`p The p-norm

τ The threshold

0 The matrix of 0s

1 The matrix of 1s

T Transpose

∝ The proportional relationship

∀ The for all operator2⇓ The downsampling operator by a factor of two

xix

2⇑ The upsampling operator by a factor of two

⊕ The center-surround combination operator

The center-surround difference operator

The pointwise multiplication operator

∗ The convolution operator

∠ The polar angle

|| · ||2 The Euclidean distance

|| · ||s2 The Euclidean distance along the spatial dimension

|| · ||t2 The Euclidean distance along the temporal dimension

| · | Absolute value operator or cardinality

|·|≥0 The operator that sets the negative values to zero

xx

List of Acronyms

abb advert-bbc4-bees

abl advert-bbc4-library

ai advert-iphone

aic ami-ib4010-closeup

ail ami-ib4010-left

blicb bbc-life-in-cold-blood

bws bbc-wildlife-serpent

ds diy-sos

hp6t harry-potter-6-trailer

mg music-gummybear

mtnin music-trailer-nine-inch-nails

nim nightlife-in-mozambique

ntbr news-tony-blair-resignation

os one-show

pas pingpong-angle-shot

pnb pingpong-no-bodies

ss sport-scramblers

swff sport-wimbledon-federer-final

tucf tv-uni-challenge-final

ufci university-forum-construction-ionic

2-D Two-Dimensional

3-D Three-Dimensional

APPROX Approximation to IKN

ASP Advanced Simple Profile

AUC Area Under Curve

xxi

AUC′ The center-bias-corrected Area Under Curve

AVC Advanced Video Coding

AWS Adaptive Whitening Saliency

B-frame Bi-Predicted frame

BCM Block Coding Mode

CIF Common Intermediate Format

CRCNS Collaborative Research in Computational Neuroscience

DCT Discrete Cosine Transform

DCT-P Discrete Cosine Transformation of Pixel blocks

DCT-R Discrete Cosine Transformation of Residual blocks

DIEM Dynamic Images and Eye Movements

DIOFM DKL-color, Intensity, Orientation, Flicker, and Motion

DoG Difference of Gaussians

EER Equalized Error Rate

FN False Negative

FOA Focus of Attention

FP False Positive

FPR False Positive Rate

GAUS-CS Gaussian Center-Surround

GAUSS Gaussian center-bias

GBVS Graph-Based Visual Saliency

GM Global Motion

GMC Global Motion Compensation

GME Global Motion Estimation

GOP Group-of-Pictures

HEVC High Efficiency Video Coding

HVS Human Visual System

ICM Iterated Conditional Modes

IKN Itti-Koch-Niebur

I-frame Intra-coded frame

IO Intra-Observer

JD J-Divergence

xxii

JSD Jensen-Shannon Divergence

JSD′ The center-bias-corrected Jensen-Shannon Divergence

KLD Kullback-Leibler Divergence

MAM Motion Attention Model

MAP Maximum a Posteriori

MB Macroblock

MCSDM Motion Center-Surround Difference Model

MPEG Moving Picture Experts Group

MRF Markov Random Field

MSM-SM Motion Saliency Map - Similarity Map

MV Motion Vector

MVF Motion Vector Field

MVE Motion Vector Entropy

NSS Normalized Scanpath Saliency

NSS′ The center-bias-corrected Normalized Scanpath Saliency

OBDL Operational Block Description Length

P-frame Predicted frame

PCC Pearson Correlation Coefficient

PIM-MCS Perceptual Importance Map based on Motion Center-Surround

PIM-ZEN Perceptual Importance Map based on Zen method

PMES Perceived Motion Energy Spectrum

PNSP-CS Parametrized Normalization, Sum and Product - Center-Surround

PSNR Peak Signal-to-Noise Ratio

PVM Polar Vector Median

QCIF Quarter Common Intermediate Format

QP Quantization Parameter

RN Residual Norm

ROC Receiver Operating Characteristic

SEM Standard Error of the Mean

SFC Spatial Frequency Content

SFU Simon Fraser University

SIF Source Input Format

xxiii

SORM Self-Ordinal Resemblance Measure

STSD Space Time Saliency Detection

SP Simple Profile

SR Stochastic Relaxation

SRN Smoothed Residual Norm

ST-MRF Spatio-Temporal Markov Random Field

TP True Positive

TPR True Positive Rate

xxiv

Chapter 1

Introduction

1.1 Background and Motivation

Visual attention in humans is a set of strategies in early stages of vision processing

that filters the stream of data collected by eyes. Visual attention enables the visual system

to parse complex and highly cluttered scenes rapidly. The Human Visual System (HVS)

reduces the complexity of visual scene analysis by automatically shifting the Focus of At-

tention (FOA) across the scene [128]. This ability allows the brain to restrict high-level

processing of a scene to a relatively small part at any given time. Regions that draw atten-

tion are called salient and are subject to further processing for high-level perception of the

scene.

In humans, attention mechanisms are driven by two components: 1) observer biases (so-

called top-down attention) that enable high-level perception and 2) visual stimuli (so-called

bottom-up attention) that are the characteristics of the scene itself. Top-down attention

is a cognitive mechanism for maintaining goal-directed behavior. Bottom-up attention, on

the other hand, deals with low-level features of the scene, such as contrast between various

regions and their surroundings.

Recently, the development of visual attention models has attracted much interest in the

computer vision and image processing communities. Although there has been a couple of

attempts to model the cognitive influence in the HVS (top-down attention), most of the

efforts have been devoted to model the stimulus-driven component (bottom-up attention),

typically through the development of visual saliency models. This has long been believed

to be a part of the early stages of vision, via the projection of the visual stimulus along the

1

features computed in the early visual cortex, and to consist of a center-surround operation.

In general, regions of the field of view that are distinctive compared to their surroundings

attract attention [21]. A potentially more accurate approach to model visual attention could

be to combine bottom-up saliency and high-level priors, or even broader perception [127,

107, 76, 24, 69, 47, 141, 61].

Early approaches to computational attention modeled bottom-up saliency using center-

surround difference operator [73]. Under these deterministic models, if a given region re-

sembles its surround, then the stimulus would be suppressive, resulting in low saliency, and

if it differs from its surround, the stimulus would be excitatory, leading to high saliency.

Variations on the details of the center-surround computation have given rise to a multitude

of saliency models in the past decade and a half.

Another class of saliency models attempts to describe the principles of bottom-up at-

tention in probabilistic terms [15, 16, 17, 145]. This approach is typically inspired by the

cognitive science view where the brain is considered as a probabilistic network [89]. It is

widely known in the cognitive science literature that the human brain operates as a uni-

versal compression device [16], where each layer eliminates as much signal redundancy as

possible from its input, while preserving all the information necessary for scene perception.

This principle has led to important developments in signal processing and computer vision

techniques, such as wavelet theory [113], sparse representations [130], and, more recently,

compression-based models of saliency.

The models that rely on probabilistic reasoning can be divided into two groups:

1. Stimulus-based information maximization

2. Signal compressibility

In the first group, saliency is hypothesized to be stimulus self-information [24, 164, 129].

More specifically, visual attention is governed by information maximization, where each

region is assigned self-information [30] with respect to the distribution of certain features

in the neighborhood. If a stimulus has low probability according to the feature distribution

in the surround, this leads to high self-information and subsequently high saliency, whereas

if the stimulus has high probability, self-information is low, leading to low saliency. A

similar idea has been cast in a Bayesian framework in [69], where saliency is related to

2

the divergence between the prior feature distribution in the surround, and the posterior

distribution computed after observing the features in the center, termed Bayesian surprise.

In the second group of probabilistic models, saliency is equated to the reconstruction

error of a compressed representation of the stimulus. Particularly, at each location, the stim-

ulus is compressed, for example via principal component analysis [64, 114, 48], wavelet [140],

or sparse decomposition [65, 98], and the reconstruction error from this compressed rep-

resentation is measured. Large reconstruction error indicates incompressibility, which is

considered as high saliency. On the other hand, easily compressible regions are considered

to have low saliency in this framework. A recent comparative study [22] has shown that

saliency models based on the compression principle tend to have excellent accuracy for the

prediction of eye fixations. In fact, several of these models predict saliency with accuracy

near the probability of agreement among observers. It could thus be claimed that “the

bottom-up saliency modeling problem is solved.”

There are, nevertheless, three main problems with the current state-of-the-art on visual

saliency modeling:

• While it is true that high accuracy has been extensively documented for free viewing

of still images, the same is not true for dynamic stimuli, which has received much less

attention.

• While many implementations of the compression principle for saliency modeling have

been proposed, none has really used a direct measure of compressibility. From a

scientific point of view, this weakens the arguments in support of the principle.

• While many implementations of the “saliency as compression” principle have been

proposed, much less attention has been devoted to implementation complexity.

The last item above is of critical importance for many real-world applications. For exam-

ple, consider automatic monitoring of video quality in intermediate network nodes, where

the uncompressed video is generally not accessible. According to physiological and psy-

chological evidence, the impact of distortion in salient and non-salient areas is not equally

important in terms of perceived quality. For large-scale in-network deployment, video qual-

ity monitoring that considers saliency must be of reasonably low complexity and memory

requirements. As another example, for anomaly detection [110] or background subtrac-

tion [146] in large camera networks, saliency estimation should ideally be performed in the

3

cameras themselves, then the system would only consume the power and bandwidth neces-

sary to transmit video when faced with salient or anomalous events. This, however, requires

highly efficient saliency algorithms.

These observations have motivated us to investigate alternative measures of saliency,

according to compressed-domain features using the data already computed by the video

encoder. In this dissertation, we propose two approaches to extract compressed-domain

features that are highly indicative of saliency in natural video. In the first approach, two

video features are extracted from the compressed video bitstream using motion vectors

(MVs), block coding modes (BCMs), and transformed prediction residuals. In the second

approach, the central idea is that there is no need to define new indirect measures of saliency,

since a direct measure of compressibility, namely the number of bits, is readily available

in the compressed bitstream. In fact, due to the extensive amount of research on video

compression over the last decades, modern video compression systems are improving. It

follows that the number of bits produced by a modern video codec is a fairly good measure

of compressibility of the video being processed. Because modern codecs work very hard to

assign bits efficiently to different locations of the visual field, the spatial distribution of bits

can be seen as a saliency measure, which directly implements the compressibility principle.

Under this view, regions that require more bits to compress are more salient, while regions

that require fewer bits are less salient.

1.2 Preview and Contributions

This research is aimed at compressed-domain video processing. The goal is to reduce

computational requirements of two important computer vision tasks - visual saliency mod-

eling and region tracking - by reusing, as much as possible, the data already produced by

the video encoder. As will be seen, however, the focus on compressed-domain information

does not only improve algorithmic efficiency. In the case of saliency modeling, which forms

the larger part of the dissertation, it also leads to higher accuracy. In the following, we give

a preview of the various chapters in the dissertation and summarize the main contributions.

4

1.2.1 Bottom-Up Saliency Estimation

Many computational models have been introduced during the past 25 years to estimate

visual saliency. An excellent review of the state of the art on pixel-domain saliency estima-

tion is given in [21, 22]. To introduce the main concepts, we briefly review a gold-standard

saliency model, the so-called Itti-Koch-Niebur (IKN) model in Chapter 2. In this model, the

visual saliency is estimated using the center-surround difference mechanism implemented

as the difference between a fine and a coarse resolution for a given feature.

1.2.2 Saliency Model Evaluation Framework

Eye-tracking data is the most typical psychophysical ground truth for both bottom-up

and top-down visual attention models [39]. In this dissertation, we compare models’ per-

formance using sequences from two popular datasets, SFU [56] and DIEM [2]. In addition,

to illuminate various aspects of the models’ performance, a number of different comparison

metrics are used. The advantages and limitations of the existing evaluation metrics are

discussed, and, accordingly, new metrics are introduced that overcome the existing metrics’

shortfalls. The details of the datasets and the evaluation metrics used in this research are

described in [85, 82], and are discussed in Chapter 3.

1.2.3 A Comparison of Compressed-Domain Saliency Models

The overwhelming majority of existing saliency models operate on raw pixels, rather

than compressed images or video. However, a few attempts have been made to use com-

pressed video data, such as MVs, BCMs, motion-compensated prediction residuals, or their

transform coefficients, in saliency modeling. The compressed-domain approach is typically

adopted for efficiency reasons, i.e., to avoid recomputing information already present in the

compressed bitstream. The extracted data is a proxy for many of the features frequently

used in saliency modeling. For example, the motion vector field (MVF) is an approximation

to optical flow, while BCMs and prediction residuals are indicative of motion complexity.

Furthermore, the extraction of these features only requires partial decoding of the com-

pressed video file, while the recovery of the actual pixel values is not necessary. A compara-

tive study of available compressed-domain saliency models is presented in [82] and discussed

in Chapter 4.

5

1.2.4 Proposed Saliency Estimation Methods

In [83, 84], we described two new video features for saliency modeling, namely Motion

Vector Entropy (MVE) and Smoothed Residual Norm (SRN), both of which can be com-

puted from the compressed video bitstream using MVs, BCMs, and transformed prediction

residuals with partial decoding. The variation of motion and the split size of blocks are

used to generate the MVE feature map, while the energy of prediction residuals is used to

construct the SRN feature map.

We also proposed a simple compressed-domain video feature called the Operational

Block Description Length (OBDL) as a measure of saliency in [86]. The OBDL is the num-

ber of bits required to compress a given block of video data under a distortion criterion.

This saliency measure addresses the three main limitations of the state of the art. First,

it is a direct measure of stimulus compressibility, namely “how many bits it takes to com-

press.” By leveraging decades of research on video compression, this is a far more accurate

measure of compressibility than previous proposals, such as Bayesian surprise or mutual

information. Second, it is equally easy to apply to images and video. For example, it does

not require weighting the contributions of spatial and temporal components, as the video

encoder already uses motion estimation and compensation, and performs rate-distortion op-

timized bit assignment. Finally, because most modern cameras already contain an on-chip

video compressor, it has trivial complexity for most computer vision applications. In fact,

it only requires the very first step of decoding of the compressed bitstream to determine

the number of bits assigned to each region. In Chapter 5, we will show that the three

above-mentioned compressed-domain features are powerful enough to discriminate fixation

points from non-fixation points in natural video.

A simple and effective saliency estimation method for compressed video can be con-

structed using the proposed features. In [83], we described a method called MVE+SRN

to fuse MVE and SRN feature maps into the final saliency map. We have also proposed

an implementation of the OBDL measure in [86], and showed that saliency can be esti-

mated using a simple feature derived from it. However, while video compression systems

produce very effective measures of compressibility, this measure is strictly local, since all

processing is restricted to image blocks. Saliency, on the other hand, has both a local and

global character, e.g. saliency maps are usually smooth. To account for this property we

6

embed the OBDL features in a Markov Random Field (MRF) model. Our extensive ex-

periments show that the resulting MVE+SRN and OBDL-MRF saliency measures make

accurate predictions of eye fixations in dynamics scenes. Both methods will be described in

Chapter 6.

1.2.5 Compressed-Domain Tracking

In Chapter 7 we describe a method for compressed-domain region tracking, which was

first presented in [81]. While the material in this chapter can stand on its own and is

applicable to various problems outside of saliency modeling, it opens up interesting possi-

bilities in conjunction with compressed-domain saliency estimation, such as salient region

tracking. The proposed tracking framework makes use of only the MVs and BCMs from the

H.264/AVC-compressed video bitstream. This method tracks a single object by computing

the maximum a posteriori (MAP) estimate of a Spatio-Temporal Markov Random Field

(ST-MRF) at each P- or B-frame. In Chapter 7, the details of the framework are presented,

and the accuracy of the proposed method is evaluated through simulation.

1.3 Scholarly Publications

The research efforts during this Ph.D. study have resulted in the following 8 scholarly

publications, comprising 3 journal papers (two accepted, one submitted) and 5 conference

papers (all accepted). Some works that have been completed during this period are not

presented in this dissertation, particularly the work on still visualization of object mo-

tion (conference paper 4). All work has been performed in a reproducible research man-

ner [154]. MATLAB implementation of the proposed methods (including our implementa-

tion of several compressed-domain saliency models from the literature) and the evaluation

data (ground truth data, as well as implementations of various evaluation metrics) used

in this study have been made available online at http://www.sfu.ca/~ibajic/software.

html.

1.3.1 Journal papers

1. S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. “Compressed-domain visual saliency

models: A comparative study,“ submitted to IEEE Trans. Image Process., 2014.

7

http://www.sfu.ca/~ibajic/software.html

http://www.sfu.ca/~ibajic/software.html

2. S. H. Khatoonabadi, I. V. Bajic and Y. Shan, “Compressed-domain correlates of

human fixations in dynamic scenes,” accepted for publication in Multimedia Tools and

Applications, Special Issue on Perception Inspired Video Processing, 2015. (Invited)

3. S. H. Khatoonabadi and I. V. Bajic. “Video object tracking in the compressed do-

main using spatio-temporal Markov random fields,“ IEEE Trans. Image Process.,

22(1):300-313, 2013.

1.3.2 Conference papers

1. S. H. Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan. “How many bits does

it take for a stimulus to be salient?,“ In Proc. IEEE CVPR’15, pages 5501-5510,

2015.

2. S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. “Compressed-domain correlates of fixa-

tions in video,“ In Proc. 1st Intl. Workshop on Perception Inspired Video Processing,

PIVP’14, pages 3-8, 2014.

3. S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. “Comparison of visual saliency models

for compressed video,“ In Proc. IEEE ICIP’14, pages 1081-1085, 2014.

4. S. H. Khatoonabadi and I. V. Bajic. “Still visualization of object motion in compressed

video,“ In Proc. IEEE ICME’13 Workshop: MMIX, 2013.

5. S. H. Khatoonabadi and I. V. Bajic. “Compressed-domain global motion estimation

based on the normalized direct linear transform algorithm,“ In Proc. International

Technical Conference on Circuits/Systems, Computers and Communications (ITC-

CSCC’13), 2013.

8

Chapter 2

Bottom-Up Saliency Estimation

The Human Visual System (HVS) is able to automatically shift the Focus of Attention

(FOA) to salient regions in the pre-attentive, early vision phase. This ability allows the

brain to restrict high-level processing of a scene to a relatively small part at any given time.

Many computational models have been introduced to imitate the HVS in order to predict

human visual attention. Many models rely on physiological and psychophysical findings [73].

Visual saliency estimation can benefit a large number of applications in image processing

and computer vision, such as quality assessment [158, 44, 102, 123, 93, 40, 106, 165, 138, 32],

compression [66, 157, 50, 99, 52, 53], guiding visual attention [57, 115, 116], retargeting [41,

105], segmentation [121, 45], anomaly detection [110], background subtraction [146], object

recognition [58], object tracking [112], video abstraction [75], error concealment [55], data

hiding [78], and so on.

In this chapter, we review one representative pixel-domain saliency estimation method

known as the Itti-Koch-Niebur (IKN) model. This is one of the most cited saliency models,

regarded as a gold standard in the field. A particular version of this model has recently

been found [117] to be the most accurate among several publicly available saliency models

on a dataset from [56]. We refer readers to [21, 22] for a more comprehensive overview of

pixel-domain saliency estimation.

2.1 The Itti-Koch-Niebur (IKN) Saliency Model

In [73], Itti et al. proposed an architecture for building a bottom-up saliency map of

visual attention for static images. This biologically-plausible architecture was inspired by

9

Figure 2.1: Architecture of the IKN model from [73].

the Feature Integration Theory of Attention, introduced by Treisman and Gelade [153].

This theory explains how HVS extracts important features and combines them to find the

FOA. A set of topographic feature maps is first extracted from a scene. Next, salient spatial

locations within each feature map are selected through competition. Multiple image features

are then fused into a single topographical saliency map, known as the Master Saliency Map.

The competition for saliency was inspired by computational principles in the retina,

lateral geniculate nucleus, and primary visual cortex [96], called Center-Surround strategy,

in which the spatial location (the “center”) that stands out relative to its neighborhood

(“surround”) is located. In the standard IKN model [73], center-surround difference was

computed as the difference between fine and coarse resolutions (scales). The center was

defined as a pixel at a fine resolution level, and surround was derived from the corresponding

pixel at a coarser resolution level. A feature map was then determined by the difference

between the maps at the two resolution levels, by interpolating the coarser resolution level

to the finer resolution level and subtracting point by point. The architecture of the IKN

model is depicted in Fig. 2.1 where the process is elaborated below.

10

2.1.1 Feature Map Extraction

Three groups of image features, namely intensity, color and orientation, are extracted in

the IKN model. Associated with each image feature, a pyramid of the image is constructed

at nine levels, by iteratively low-pass filtering and subsampling by a factor of two. The

original image is at level “0” and the coarsest resolution image is at level “8.” Center-

surround feature maps are defined as the differences between a fine level (center) and a

coarser level (surround).

Let I (ϑ) be the intensity of an image at the resolution level ϑ ∈ 0, 1, ..., 8 of the

constructed Gaussian pyramid. A set of six feature maps is extracted corresponding to

intensity channel of the image:

I (c, s) = |I (c) I (s)| , (2.1)

where operator represents the center-surround difference, c ∈ 2, 3, 4 and s = c + δ

with δ ∈ 3, 4. The center-surround difference is obtained by interpolating the coarser

resolution level (s) to the finer resolution level (c) through up-sampling followed by pointwise

subtraction. An intensity feature map represents the sensitivity of neurons to intensity

contrast (bright center surrounded by a dark neighborhood, or dark center surrounded by

a bright neighborhood) that conforms to the functionality of the early HVS [96].

Similar to intensity feature maps, a set of color feature maps is also extracted. First,

tuned color components of red (R), green (G), blue (B), and yellow (Y ) are defined as

R = max

(0, R− G+B

2

), (2.2)

G = max

(0, G− R+B

2

), (2.3)

B = max

(0, B − G+R

2

), (2.4)

Y = max

(0,R+G

2− |R−G|

2−B

), (2.5)

where R, G and B are the red, green, and blue components of the image, respectively.

The definition of color feature maps is according to Color Double-Opponent model [38],

which states that neurons in the center of their receptive fields are activated by one color

11

(e.g., yellow) and inhibited by its opponent color (e.g., blue), or vice versa. The pairs of

such opponent colors in HVS are (red, green) and (blue, yellow). Consequently, two color

components are extracted based on these opponent colors:

RG (c, s) = |R (c)−G (c) G (s)−R (s)| , (2.6)

BY (c, s) = |B (c)− Y (c) Y (s)−B (s)| , (2.7)

where c and s are defined as before. Given c ∈ 2, 3, 4 and s = c+ δ with δ ∈ 3, 4, a set

of six feature maps is extracted for each color feature. Accordingly, 12 maps are defined for

color features.

A set of orientation feature maps is also specified in order to consider receptive field

sensitivity profile of orientation-selective neurons in HVS [96]. These maps are created

based on the local orientation contrast between the center and surround:

Oθ (c, s) = |Oθ (c)Oθ (s)| , (2.8)

where Oθ (ϑ) is the Gabor orientation, obtained by Gabor filter [34], at the resolution level

ϑ ∈ 0, 1, ..., 8 with the orientation of θ ∈ 0, 45, 90, 135. Considering the number

of possible orientations and resolution levels, 24 maps are derived for orientation features.

Together with 6 intensity feature maps and 12 color feature maps, a total of 6+12+24 = 42

feature maps are extracted from the image.

2.1.2 The Master Saliency Map

The master saliency map, or briefly the saliency map, is used frequently in the literature

for quantifying the saliency at every location within an image. It is a gray scale image in

which the brighter pixels represent the more salient locations. In the IKN model, all 42

extracted feature maps are combined together to create the saliency map. In the following,

we review two methods used in various versions of the IKN model for combining feature

maps.

MaxNorm normalization

In the standard IKN model [73], all feature maps are first normalized such that the

values of each map range from 0 (dark) to m (bright). All local maxima, except the global

12

Figure 2.2: An example of MaxNorm normalization of a feature from [73].

maximum value m, are then identified within each map, and their sample mean, µ, is

computed. Each map is individually normalized by multiplying its values by (m− µ)2 to

intensify strong peaks that stand out from other peaks in the map, or to suppress the

map globally if there is no distinctive peak compared to the average of local maxima.

This normalization procedure is called MaxNorm normalization and is denoted by N (·).

The MaxNorm normalization attempts to account for the neuro-biological principle that

neighboring similar features inhibit each other in HVS [25]. An example of MaxNorm

normalization is illustrated in Fig. 2.2.

After MaxNorm normalization, all feature maps within each group of image features are

combined, resulting in three separate “conspicuity” maps: I for intensity, C for color, and

O for orientation:

I =4⊕c=2

4⊕

s=c+3N (I(c, s)) , (2.9)

C =4⊕c=2

4⊕

s=c+3[N (RG(c, s)) +N (BY(c, s))], (2.10)

O =∑

θ=0,45,90,135

4⊕c=2

4⊕

s=c+3N (Oθ (c, s)). (2.11)

Combining feature maps is accomplished by subsampling to the resolution level four (the

coarsest resolution level of centers) and pointwise summation, denoted by ⊕. Eventually,

all conspicuity maps are normalized once more by the MaxNorm operation and summed to

form the master saliency map:

S =

∑K∈I,C,ON

(K)

3. (2.12)

13

Biologically, features in individual conspicuity map compete for saliency, whereas fea-

tures in different conspicuity maps support each other. Hence, all feature maps in the same

class are first combined and normalized, and then resulting feature maps of different classes

are integrated to create the master saliency map.

FancyNorm Normalization

Itti and Koch in [72] proposed an alternative biologically-plausible normalization. The

method is called Iterative Localized Normalization, also known as FancyNorm in the litera-

ture, and has been shown to offer high accuracy in terms of gaze prediction [117]. MaxNorm

relies on the global maximum, while FancyNorm is based on local computations, which is

consistent with the local connectivity of cortical neurons [72]. For this reason, FancyNorm

might be considered more biologically-plausible than MaxNorm.

In FancyNorm, all feature maps are first normalized to range [0, 1]. Then, each feature

map is convolved with the difference of Gaussians (DoG) filter given by

D(x, y) =c2exc

2πσ2exc

· e−(x2+y2)/2πσ2exc −

c2inh

2πσ2inh

· e−(x2+y2)/2πσ2inh , (2.13)

where c2exc and c2

inh represent, respectively, the impact of local excitation and inhibition from

neighboring locations. Parameters σexc and σinh are, respectively, Gaussian parameters for

excitation and inhibition.

A given feature map K can be iteratively normalized by the DoG filter as

K = |K +K ∗D − Cinh|≥0 , (2.14)

where operator ∗ denotes convolution, and operator |·|≥0 means setting negative values

to zero. Cinh is a constant inhibitory term which discards non-salient regions of uniform

textures.

Fig. 2.3 shows the result of applying FancyNorm procedure on two feature maps: one

containing a strong activation peak surrounded by numerous weaker activation peaks, and

the other containing numerous strong activation peaks. In the former, shown at the top, the

stronger peak at iteration 0 becomes excessively dominant peak after a few iterations, while

in the latter, no peak stands out after a number of iterations. If this function is applied in

a single iteration, the resulting normalization is called FancyOne; if two iterations are used,

the resulting normalization is called FancyTwo, and so on.

14

Iteration 0 Iteration 4 Iteration 8 Iteration 12Feature Map

(a)

(b)

Figure 2.3: FancyNorm of two feature maps, from [72]: (a) contains a strong activationpeak and several weaker activation peaks, (b) contains various strong activation peaks.

2.1.3 Focus of Attention

The predicted Focus of Attention (FOA) is the maximum of the master saliency map.

To determine FOA jumps from one salient location to the next, for a given saliency map,

the IKN model uses a biologically-plausible, 2-D winner-take-all neural network. In this

neural network, each neuron is associated with a saliency map pixel, and is described by a

capacitance in which the potential of more salient locations increases faster. The capacitance

integrates excitatory inputs from the saliency map until one of them (the capacitance of

the winner neuron) first reaches the voltage threshold and fires. At that time, the FOA is

directed to the location of the winner, the capacitance of all neurons reset, and the area

around the winner location (in the IKN model it is a disk with a radius determined by the

size of the input image) is transiently deactivated (inhibited). The neurons again integrate

the charge until the next FOA is identified and the process repeats. Fig. 2.4 shows an

example.

The inhibition-of-return in the IKN model is such that the FOA is inhibited for ap-

proximately 500-900 ms, similar to HVS characteristics [133]. In addition, to correlate with

15

HVS, the voltage threshold of the capacitance should be chosen such that the time interval

for jumping between two FOAs is approximately 30-70 ms [133].

2.2 Spatio-Temporal Saliency Estimation

In [71, 66], Itti et al. further extended the IKN model to address spatio-temporal saliency

estimation. Two new groups of features, namely motion and flicker contrasts, were added to

the basic IKN model [73]. Motion features were computed for different orientations, similar

to features defined for orientation in static images. To do this, pyramid images of two

successive frames were spatially-shifted orthogonal to the Gabor orientation and subtracted

based on the Reichardt model [134]:

Mnθ (ϑ) =

∣∣Onθ (ϑ)Υn−1θ (ϑ)−On−1

θ (ϑ)Υnθ (ϑ)

∣∣ . (2.15)

In the above equation, the symbol denotes pointwise multiplication. Mnθ (ϑ) is the

motion feature of frame n at the resolution level ϑ ∈ 0, 1, ..., 8 and orientation θ ∈

0, 45, 90, 135. Υnθ (ϑ) is obtained by shifting the pyramid image one pixel orthogo-

nal to the Gabor orientation Onθ (ϑ). Note that one pixel shift at the coarsest level, ϑ = 8,

corresponds to 28 = 256 pixels shift at the finest level, so a wide range of velocities can be

handled.

The flicker for frame n was computed as the absolute difference between the intensity

of the current frame and that of the previous frame:

Fn(ϑ) =∣∣In(ϑ)− In−1(ϑ)

∣∣ . (2.16)

Again, having six different pairs of (c, s) and four possible values for θ, 24 feature maps for

motion and 6 feature maps for flicker were extracted as

Mnθ (c, s) = |Mn

θ (c)−Mnθ (s)| , (2.17)

Fn(c, s) = |Fn(c)− Fn(s)| . (2.18)

From the extracted features, two new conspicuity maps were respectively defined for

motion and flicker as

M =∑

θ=0,45,90,135

4⊕c=2

4⊕

s=c+3N (Mθ (c, s)), (2.19)

16

Figure 2.4: From [73]: the processing steps of the IKN model, demonstrated by an example.Three conspicuity maps of color (C), intensity (I) and orientation (O) are derived fromimage features and integrated to create the master saliency map (S). The FOA is firstdirected to the most salient location, specified by an arrow in 92 ms simulated time, afterwhich the next FOAs are successively determined considering inhibition-of-return feedback.

17

F =4⊕c=2

4⊕

s=c+3N (F(c, s)) . (2.20)

Finally, the master saliency map in this model was computed as the summation of 5

conspicuity maps, containing 72 feature maps in total: 6 for intensity, 12 for color, 24 for

orientation, 24 for motion and 6 for flicker.

S =

∑K∈I,C,O,M,FN

(K)

5. (2.21)

18

Chapter 3

Saliency Model Evaluation Framework

In order to quantify research progress on a particular problem, one needs to be able to

compare various solutions in a common framework. In the case of visual saliency model

comparison, we have to make decisions about ground-truth datasets, evaluation metrics,

and the comparison methodology.

3.1 Eye-Tracking Video Datasets

Eye-tracking data is the most typical psychophysical ground truth for visual saliency

models [39]. To evaluate saliency models, each model’s saliency map is compared with

recorded gaze locations of the subjects. Two recent publicly available eye-tracking datasets

were used in this study. The reader is referred to [160] for an overview of other existing

datasets in the field.

3.1.1 The SFU Dataset

The Simon Fraser University (SFU) eye-tracking dataset [56, 3] consists of twelve CIF

(Common Intermediate Format, 352 × 288) sequences that have become popular in the

video compression and communications community: Bus, City, Crew, Foreman, Flower

Garden, Hall Monitor, Harbour, Mobile Calendar, Mother and Daughter, Soccer, Stefan,

and Tempete. A total of 15 participants watched all 12 videos while wearing a Locarna Pt-

mini head-mounted eye tracker [6]. Each participant took part in the test twice, resulting

in two sets of viewings per participant for each video. The first viewing is used as ground

truth for evaluating the performance of saliency models, whereas the data from the second

19

Bus City Crew Foreman

Garden Hall Harbour Mobile

Mother Soccer Stefan Tempete

Figure 3.1: Sample gaze visualization from the SFU Dataset. The gaze points from the firstviewing are indicated as white squares, those from the second viewing as black squares.

viewing is used to construct benchmark models, as described in Section 3.1.3. The results

in [56] showed that gaze locations in the first and second viewings can differ notably, however

they remain relatively close to each other when there is a single dominant salient region

in the scene (for example, the face in the Foreman sequence.) As a result, it is reasonable

to expect that good saliency models will produce high scores for those frames where the

first and second viewing data agree. A sample frame from each video is shown in Fig. 3.1,

overlaid with the gaze locations from both viewings. The visualization is such that the less-

attended regions (according to the first viewing) are indicated by darker colors. Further

details about this dataset are shown in Table 3.1.

3.1.2 The DIEM Dataset

Dynamic Images and Eye Movements (DIEM) project [2] provides tools and data to

study how people look at dynamic scenes. So far, DIEM collected gaze data for 85 sequences

of 30 fps videos varying in the number of frames and resolution, using the SR Research

Eyelink 1000 eye tracker [8]. The videos were taken from various categories including movie

20

Table 3.1: Datasets used in our study to evaluate different visual saliency models.

Dataset SFU DIEM

Year 2012 2011Sequences 12 85Display Resolution 704× 576∗ varyingFormat RAW MPEG-4Frame per Seconds 30 30Frames 90-300 888-3401Participants 15 35-53§

Viewings 2† 2‡

Screen Resolution 1280× 1024 1600× 1200Screen Diagonal 19" 21.3"Viewing Distance 80 cm 90 cm

∗The original video resolution (352× 288) was doubled during the presentation to the participants§A total of 250 subjects participated in the study, but not all of them viewed each video; thenumber of viewers per video was 35-53 †Each participant watched each sequence twice, afterseveral minutes‡Viewings for the left/right eye are available

trailers, music videos, documentary, news and advertisements. For the purpose of the study,

the frames of the sequences from the DIEM dataset were re-sized to 288 pixels height, while

securing the original aspect ratio, resulting in five different resolutions: 352×288, 384×288,

512 × 288, 640 × 288 and 672 × 288. Among 85 available videos, 20 sequences similar to

those used in [22] were chosen for the study, and only the first 300 frames were used in

the comparison to match the length of the SFU sequences. In the DIEM dataset, the gaze

location of both eyes are available. The gaze locations of the right eye were used as ground

truth in the study, while gaze locations of the left eye were used to construct benchmark

models, as described in Section 3.1.3. Clearly, the gaze points of the two eyes are very close

to each other, closer than the gaze points of the first and second viewing in the SFU dataset.

A sample frame form each selected sequence, overlaid with gaze locations of both eyes, is

illustrated in Fig. 3.2. The visualization is such that the less-attended regions (according

to the right eye) are indicated by darker colors.

3.1.3 Benchmark Models

In addition to the computational saliency models, we consider two additional models:

Intra-Observer (IO) and Gaussian center-bias (GAUSS). The IO saliency map is obtained

by the convolution of a 2-D Gaussian blob (with standard deviation of 1 of visual angle)

21

abb abl ai aic

ail blicb bws ds

hp6t mg mtnin ntbr

nim os pas pnb

ss swff tucf ufci

Figure 3.2: Sample gaze visualization from the DIEM Dataset. The gaze points of the righteye are shown as white squares, those of the left eye as black squares.

22

SFU DIEM

Figure 3.3: The heatmap visualization of gaze points combined across all frames and allobservers, for the first viewing in the SFU dataset and the right eye in the DIEM dataset.Gaze points accumulate near the center of the frame.

with the second set of gaze points of the same observer within the dataset. Recall that both

datasets have two sets of gaze points for each sequence and each observer – first/second

viewing in the SFU dataset, right/left eye in the DIEM dataset. So the IO saliency maps

for the sequences in the SFU dataset are obtained using the gaze points from the second

viewing, while IO saliency maps for the sequences from the DIEM dataset are obtained using

the gaze points of the left eye. These IO saliency maps can be considered as indicators of

the best possible performance of a visual saliency model, especially in the DIEM dataset

where the right and left eye gaze points are always close to each other.

On the other hand, GAUSS saliency map is just a 2-D Gaussian blob with the standard

deviation of 1 located at the center of the frame. This model assumes that the center

of the frame is the most salient point. Center bias turns out to be surprisingly powerful

and has been used occasionally to boost the performance of saliency models without taking

scene content into account. The underlying assumption is that the person recording the

image or video will attempt to keep the salient objects at or near the center of the frame.

Fig. 3.3 shows the heatmaps indicating cumulative gaze point locations across all sequences

and all participants in the SFU dataset (first viewing) and DIEM dataset (right eye). As

seen in the figure, aggregate gaze point locations do indeed cluster around the center of the

frame. However, since GAUSS does not take content into account, one could expect a good

saliency model to outperform it.

23

3.2 Accuracy Evaluation

A number of methods have been used to evaluate the accuracy of visual saliency models

with respect to gaze point data [21, 22, 37, 67, 68, 95]. Since each method emphasizes a

particular aspect of model’s performance, to make the evaluation balanced, a collection of

methods and metrics is employed in this study. A model that offers high score across many

metrics can be considered to be accurate.

3.2.1 Area Under Curve (AUC)

The Area Under Curve (AUC) or, more precisely, the area under Receiver Operating

Characteristic (ROC) curve, is computed from the graph of the true positive rate (TPR)

versus the false positive rate (FPR) at various threshold parameters [148]. In the context

of saliency maps, the saliency values are first divided into positive and negative sets corre-

sponding to gaze and non-gaze points. Then for any given threshold, TPR and FPR are,

respectively, obtained as the fraction of elements in the positive set and in the negative

set that are greater than the threshold. Essentially, by varying the threshold, the ROC

curve of TPR versus FPR is generated, visualizing the performance of a saliency model

across all possible thresholds. The area under this curve quantifies the performance and

shows how well the saliency map can predict gaze points. A larger AUC implies a greater

correspondence between gaze locations and saliency predictions. A small AUC indicates

weaker correspondence. The AUC is in the range [0, 1]: the values near 1 indicates the

saliency algorithm performs well, the value of 0.5 represents pure chance performance, and

the value of less than 0.5 represents worse than pure chance performance. This metric is

also invariant to monotonic scaling of saliency maps [23].

It is worth mentioning that instead of using all non-gaze saliency values, these are

usually sampled [139, 37]. The idea behind this approach is that an effective saliency model

would have higher values at fixation points than at randomly sampled points. Control

points for non-gaze saliency values are obtained with the help of a nonparametric bootstrap

technique [36], and sampled with replacement, with sample size equal to the number of

gaze points, from non-gaze parts of the frame, multiple times. Finally, the average of the

statistic over all bootstrap subsamples is taken as a sample mean.

24

3.2.2 Kullback-Leibler Divergence (KLD) and J-Divergence (JD)

The Kullback-Leibler Divergence (KLD) is often used to obtain the divergence between

two probability distributions. It is given by the relative entropy of one distribution with

respect to another [92]

KLD(P‖Q) =

r∑i=1

P (i) · logb

(P (i)

Q(i)

), (3.1)

where P and Q are discrete probability distributions, b is the logarithmic base, and r

indicates the number of bins in each distribution. Note that KLD is asymmetric. The

symmetric version of KLD, also called J-Divergence (JD), is [74]

JD(P‖Q) = KLD(P‖Q) +KLD(Q‖P ). (3.2)

To assess how accurately a saliency model predicts gaze locations based on JD, the

distribution of saliency values at the gaze locations is compared against the distribution

of saliency values at some random points from non-gaze locations [67, 69, 68]. If these

two distributions overlap substantially, i.e., if JD approaches zero, then the saliency model

predicts gaze points no better than a random guess. On the other hand, as one distribution

diverges from the other and JD increases, the saliency model is better able to predict gaze

points.

Specifically, let there be n gaze points in a frame. Another n points different from the

gaze points are randomly selected from the frame. The saliency values at the gaze points and

the randomly selected points constitute the two distributions, P and Q. A good saliency

model would produce a large JD, because saliency values at gaze points would be large,

while saliency values at non-gaze points would be small. The process of choosing random

samples and computing the JD is usually repeated many times and the resulting JD values

are averaged to minimize the effect of random variations. While JD has certain advantages

over KLD (see [68, 21] for details), it also shares several problems. One of the problems

with both KLD and JD is the lack of an upper bound [91]. Another problem is that if P (i)

or Q(i) is zero for some i, one of the terms in (3.2) is undefined. For these reasons, KLD

and JD were not used in the present study.

25

3.2.3 Jensen-Shannon Divergence (JSD)

The Jensen-Shannon divergence (JSD) is a KLD-based metric that avoids some of the

problems faced by KLD and JD [100]. For two probability distributions P and Q, JSD is

defined as [33]:

JSD(P‖Q) =KLD(P‖R) +KLD(Q‖R)

2, (3.3)

where

R =P +Q

2. (3.4)

Unlike KLD, JSD is a proper metric, is symmetric in P and Q, and is bounded in [0, 1] if

the logarithmic base is set to b = 2 [100]. The value of the JSD for the saliency map that

perfectly predicts gaze points will be equal to 1. The same sampling strategy employed in

AUC and KLD/JD computation can also be used for computing JSD.

3.2.4 Normalized Scanpath Saliency (NSS)

The Normalized Scanpath Saliency (NSS) measures the strength of normalized saliency

values at gaze locations [132]. Normalization is affine so that the resulting normalized

saliency map has zero mean and unit standard deviation. The NSS is defined as the average

of normalized saliency values at gaze points:

NSS =

∑(x,y)∈F S ′(x, y)

|F |, (3.5)

where F is the set of pixel coordinates of fixations, |·| is cardinality, and

S ′(x, y) =S(x, y)− µ

σ, (3.6)

in which µ and σ are the mean and the standard deviation of the saliency map, respectively.

A positive normalized saliency value at a certain gaze point indicates that the gaze point

matches one of the predicted salient regions, zero indicates no link between predictions and

the gaze point, while a negative value indicates that the gaze point has fallen into an area

predicted to be non-salient.

3.2.5 Pearson Correlation Coefficient (PCC)

The Pearson Correlation Coefficient (PCC) measures the strength of a linear relationship

between the predicted saliency map S and the ground truth map G. First, the ground truth

26

map G is obtained by convolving the gaze point map with a 2-D Gaussian function having

the standard deviation of 1 of the visual angle [95]. Then S and G are treated as random

variables whose paired samples are given by values of the two maps at each pixel position

in the frame. The Pearson correlation coefficient is defined as

corr(S,G) =cov(S,G)

σSσG, (3.7)

where cov(·, ·) denotes covariance and σS and σG are, respectively, the standard deviations

of the predicted saliency map and the ground truth map. The value of PCC is between

−1 and 1; the value of ±1 indicates the strongest linear relationship, whereas the value

of 0 indicates no correlation. If the model’s saliency values tend to increase as the values

in the ground truth map increase, the PCC is positive. Otherwise, if the model’s saliency

values tend to decrease as the ground truth values increase, the PCC is negative. In this

context, a PCC value of −1 would mean that the model predicts non-salient regions as

salient, and salient regions as non-salient. While this is the opposite of what is needed, such

model can still be considered accurate if its saliency map is inverted. While PCC is widely

used for studying relationships between random variables, in its default form it has some

shortcomings in the context of saliency model evaluation, especially due to center bias, as

discussed in the next section.

3.3 Data Analysis Considerations

Here, we discuss several considerations about the ground truth data, and the methods

and metrics used in the evaluation.

3.3.1 Gaze Point Uncertainty

Eye-tracking datasets usually report a single point (x, y) as the gaze point of a given

subject in a given frame. However, such data should not be treated as absolute. There

are at least two sources of uncertainty in the measurement of gaze points. One is the

eye-tracker’s measurement error, which is usually on the order of 0.5 to 1 of the visual

angle [6, 8, 122]. The other source of uncertainty is the involuntary eye movement during

fixations. The human eye does not concentrate on a stationary point during a fixation,

but instead constantly makes small rapid movements to make the image more clear [26].

27

Depending on the implementation, the eye tracker may filter those rapid movements out,

either due to undersampling or to create an impression of a more stable fixation. For at

least these two reasons, the gaze point measurement reported by an eye tracker contains

some uncertainty. At the current state of technology, the eye tracker measurement errors

seem to be larger than the uncertainty caused by involuntary drifts, and so we take them as

the dominant source of noise in the ground truth data. To account for this noise, we apply a

local maximum operator in a radius of 0.5 of visual angle. In other words, when computing

a saliency value of a given point in a frame, the maximum value within its neighborhood is

used.

3.3.2 Center Bias and Border Effects

A person recording a video will generally tend to put regions of interest near the center

of the frame [151, 131]. In addition, people also have a tendency to look at the center of

the image [150], presumably to maximize the coverage of the displayed image by their field

of view. These phenomena are known as center bias. Fig. 3.3 illustrates the center bias in

the SFU and DIEM datasets by displaying the locations of gaze points accumulated over

all sequences and all frames.

Interestingly, Kanan et al. [76] and Borji et al. [22] showed that creating a saliency map

merely by placing a Gaussian blob at the center of the frame may result in fairly high scores.

Such high scores are partly caused by using a uniform spatial distribution over the image

when selecting control samples. Specifically, the computation of ACU, KLD and JSD for a

given model involves choosing non-gaze control points randomly in an image. If these are

chosen according to a uniform distribution across the image, the process results in many

control points near the border, which, empirically, have little chance of being salient. As a

result, the saliency values of those control points tend to be small, resulting in an artificially

high score for the model under test. At the same time, since gaze points are likely located

near the center of the frame, a centered Gaussian blob would tend to match many of the

gaze points, which would make its NSS and PCC scores high.

Additionally, Zhang et al. [164] thoroughly investigated the effect of dummy zero borders

against evaluation metrics. Adding dummy zero saliency values at the border of the image

changes the distribution of saliency of the random samples as well as the normalization

parameters in NSS, leading to different scores while the saliency prediction is unchanged. To

28

decrease sensitivity to the center bias and the border effect, Tatler et al. [151] and Parkhurst

and Niebur [131] suggested to distribute random samples according to the measured gaze

points. To this end, Tatler et al. [151] distributed random samples from human saccades

and choose control points for the current image randomly from fixation points in other

images in their dataset. Kanan et al. [76] also picked saliency values at the gaze points in

the current image, while control samples were chosen randomly from the fixations in other

images in the dataset. For both techniques, control points are drawn from a non-uniform

random distribution according to the measured fixations, decreasing the effect of center

bias. Furthermore, this way, dummy zero borders will not affect the distribution of random

samples.

In this thesis, we use a similar approach for handling center bias and border effects.

Instead of directly using the accumulated gaze points over all frames in the dataset (Fig. 3.3),

we fit an omni-directional 2-D Gaussian distribution to the accumulated gaze points across

both SFU and DIEM datasets. Then, control samples are chosen randomly from the fitted

2-D Gaussian distribution. This reduces center bias in AUC and JSD.

To reduce center bias and border effects in NSS, we modify the normalization as

S ′(x, y) =S(x, y)− µ

σ, (3.8)

where

µ =1

n

∑(x,y)

NG(x, y) · S(x, y), (3.9)

σ =

√√√√ 1

n− 1

∑(x,y)

(NG(x, y) · S(x, y)− µ)2. (3.10)

In the above equations, (x, y) are the pixel coordinates, n is the total number of pixels, and

NG(x, y) is the fitted 2-D Gaussian density evaluated at (x, y) normalized such that it sums

up to 1. In the normalization described by the above three equations, the pixels located

near the center of the image are given more significance due to NG. In other words, saliency

predictions have the same bias as observers’ fixations. These accuracy measures that are

modified to reduce the center bias and border effects are indicated by prime (′) and referred

to as NSS′, AUC′, and JSD′.

We summarize all above-mentioned metrics in Table 3.2. Metrics can be divided by

symmetry (column 2) or boundedness (column 3). Some metrics favor center-biased saliency

29

Table 3.2: Summary of evaluation metrics used in the study.

Metric Symmetric Bounded Center-biased Applicability Input

AUC Yes Yes Yes General LocationAUC′ Yes Yes No Saliency LocationKLD No No Yes General DistributionJD Yes No Yes General DistributionJSD Yes Yes Yes General DistributionJSD′ Yes Yes No Saliency DistributionNSS Yes No Yes Saliency ValueNSS′ Yes No No Saliency ValuePCC Yes Yes Yes General Distribution

models (column 4). Also, some metrics are specific to saliency while others have more

general applicability (column 5), e.g., for comparing two distributions. The input data

for various metrics comes from three sources (column 6): 1) the locations associated with

estimated saliency 2) the distribution of estimated saliency and 3) the values of estimated

saliency at fixation points.

30

Chapter 4

Compressed-Domain Saliency Estimation

Image and video processing methods can be distinguished by the domain in which they

operate: pixel domain vs. compressed domain. The former have the potential for higher

accuracy, but also have higher computational complexity. In contrast to the pixel-domain

methods, compressed-domain approaches make use of the data from the compressed video

bitstream, such as motion vectors (MVs), block coding modes (BCMs), motion-compensated

prediction residuals or their transform coefficients, etc. The lack of full pixel information

often leads to lower accuracy, but the main advantage of compressed-domain methods in

practical applications is their generally lower computational cost. This is due to the fact that

part of decoding can be avoided, a smaller amount of data needs to be processed compared

to pixel-domain methods, and some of the information produced during encoding (e.g., MVs

and transform coefficients) can be reused. Therefore, compressed-domain methods are more

suitable for real-time applications.

Due to its potential for lower complexity compared to pixel-domain methods, compressed-

domain visual saliency estimation is starting to be used in a number of applications such as

video compression [51], motion-based video retrieval [108], video skimming and summariza-

tion [109], video transcoding [161, 104, 143], quality estimation [101], image retargeting [41],

salient motion detection [126], and so on. Some methods only use MVs [108, 109, 161, 104]

while others also take advantage of other data produced during encoding, such as trans-

form coefficients [11, 143, 35, 42]. Although there are relatively few compressed-domain

saliency models compared to their pixel-domain counterparts, their potential for practical

deployment makes them an important research topic.

31

In this chapter, our goal is to evaluate visual saliency models for video that have been

designed explicitly for, or have the potential to work in, the compressed domain. This

means that they should operate with the kind of information found in a compressed video

bitstream, such as block-based motion vector field (MVF), BCMs, prediction residuals or

their transforms, etc. Finally, the strategies that have shown success in compressed-domain

saliency modeling are highlighted, and certain challenges are identified as potential avenues

for further improvement. To our knowledge, this is the first comprehensive comparison of

compressed-domain saliency models. Preliminary findings have been presented in [82] and

a more complete study has been submitted for publication [85].

4.1 Compressed-Domain Visual Saliency Models

In the following, nine prominent models listed in Table 4.1, sorted according to the

publication year, are reviewed. Different models assume different coding standards, for ex-

ample MPEG-1, MPEG-2, MPEG-4 SP (Simple Profile), MPEG-4 ASP (Advanced Simple

Profile), and MPEG-4 part 10, better known as H.264/AVC (Advanced Video Coding).

For each model, the data used from the compressed bitstream, their intended application,

as well as data and evaluation method, if any, are also included in the table. As seen in

the table, only a few of the most recent models have been evaluated using gaze data from

eye-tracking experiments, which is widely considered to be the ultimate test for a visual

saliency model. This fact makes the comparison study all the more relevant. Fig. 4.1 illus-

trates functional diagrams of the various models in the study, which will complement their

brief description below.

32

MV

Energy

Phase

Consistency

MVF

Fusion Saliency

(a) PMES and MAM

MV

Energy

Edge

Energy

MVF

Fusion Saliency

DCT-R

SFC

(b) PIM-ZEN and PIM-MCS

MVs in

4X4 level

MVF

MVs in

16X16 level

Cen

ter-

Surr

ound D

iffe

rence

Fusion Saliency

(c) MCSDM

Figure 4.1: Functional diagrams for various compressed-domain saliency models.

33

MV

Feature

Luminance

Feature

MVF

Color

Feature

Gau

ssia

n C

ente

r-S

urr

ound D

iffe

rence

Texture

Feature

DCT-P

Fusion Saliency

(d) GAUS-CS and PNSP-CS

DCT-P

Center-Surround

Difference

Edge

Energy

MVFMotion

Saliency

DCT-R

Dissimilarity

Measure

Fusion Saliency

Fusion

(e) MSM-SM

DCT-P

MV

Energy

Spectrum

Analysis

MVF

GMC

Fusion Saliency

(f) APPROX

Figure 4.1: Functional diagrams for various compressed-domain saliency models. (cont.)

34

Tab

le4.

1:C

omp

ress

ed-d

omai

nvis

ual

sali

ency

mod

els

incl

ud

edin

the

stu

dy

(MV

F:

Mot

ion

Vec

tor

Fie

ld;

DC

T-R

:D

iscr

ete

Cosi

ne

Tra

nsf

orm

ati

on

of

resi

du

al

blo

cks;

DC

T-P

:D

iscr

ete

Cosi

ne

Tra

nsf

orm

ati

on

of

pix

elb

lock

s;K

LD

:K

ull

bac

k-L

eib

ler

Div

erge

nce

;A

UC

:A

rea

Un

der

Cu

rve;

RO

C:

Rec

eive

rO

per

ati

ng

Ch

ara

cter

isti

c)

#M

odel

First

Auth

or

Year

Codec

Data

Application

Sequences

Gazedata

Metric(s)

1PM

ES

Ma

[108]

2001

MP

EG

-1/2

MV

FV

ideo

Ret

riev

al

MP

EG

-7[1

25]

--

2M

AM

Ma

[109]

2002

MP

EG

-1/2

MV

FV

ideo

Skim

min

gSp

ecifi

c[1

09]

-H

um

an

Sco

re

3PIM

-ZEN

Agarw

al

[11]

2003

MP

EG

-1/2

MV

F+

DC

T-R

RO

ID

etec

tion

QC

IFSta

ndard

--

4PIM

-MCS

Sin

ha

[143]

2004

MP

EG

-4SP

MV

F+

DC

T-R

Vid

eoT

ransc

odin

gQ

CIF

Vari

ous

--

5M

CSDM

Liu

[104]

2009

H.2

64/A

VC

MV

FR

ate

Contr

ol

QC

IFSta

ndard

--

6GAUS-C

SF

ang

[42]

2012

MP

EG

-4A

SP

MV

F+

DC

T-P

Salien

cyD

etec

tion

CR

CN

S[6

9],[7

0]

Yes

KL

D

7M

SM

-SM

Muth

usw

am

y[1

26]

2013

MP

EG

-2M

VF

+D

CT

-P/R

Salien

cyD

etec

tion

Mahadev

an

[111]

-R

OC

8APPROX

Hadiz

adeh

[51]

2013

-M

VF

+D

CT

-PV

ideo

Com

pre

ssio

nSF

U[5

6]

Yes

KL

D+

AU

C

9PNSP-C

SF

ang

[43]

2014

MP

EG

-4A

SP

MV

F+

DC

T-P

Salien

cyD

etec

tion

CR

CN

S[6

9],[7

0]

Yes

KL

D+

AU

C

35

4.1.1 Saliency Based on Perceived Motion Energy Spectrum

Ma and Zhang [108] used the magnitude of an object’s motion and its duration as cues

for detecting salient regions. According to this saliency model, the so-called Perceived

Motion Energy Spectrum (PMES) model (Fig. 4.1-a), a motion with a large magnitude

attracts human’s attention. Moreover, the angle of camera motion in a shot is supposed to

be more stable than the angle of a salient object’s motion.

The saliency map in [108] was constructed by using MVs only:

S = Em Eg, (4.1)

where matrices Em and Eg respectively denote the normalized motion magnitude and global

angle coherence, and operator represents the element-wise multiplication. Matrix Em

was obtained from the average motion magnitude over a constant duration of a shot after

removing outliers. Specifically, at each MV, a 3-D spatio-temporal tracking volume from

which the motion magnitude is computed was considered. Inlier MVs were filtered by an

α-trimmed average filter followed by normalization, that is

Em(n) =

xτ x ≤ τ

1 x > τ,(4.2)

where x is the filtered motion magnitude of block n and τ denotes the truncating threshold.

So, if Em of a MV approaches 1, it implies the corresponding block has a large motion

magnitude, and therefore probably attracts human’s attention.

The global angle coherence of a MV over its corresponding duration of the shot was

computed from the normalized entropy:

Eg(n) =−∑P (θ) · logP (θ)

log n, (4.3)

where P (θ) is the probability mass function (i.e., normalized histogram) of MV angles, and

n is the number of histogram bins within the associated tracking volume. The denominator

used here, log n, is for normalizing the value of Eg(n) to the range [0, 1]. Note that when

P (θ) = 1/n for all θ in the tracking volume associated with the block n, Eg(n) will be 1,

and when limP (θ) → 1 for a certain θ, Eg(n) approaches 0. Thus, if Eg(n) approaches

0, it implies the camera motion is dominant, and if it approaches 1, it it indicates the

inconsistency of the motion that likely attracts human’s attention.

36

The saliency map computed by (4.1) takes two rules into consideration: the sensitivity

of Human Visual System (HVS) to a large magnitude motion and the stability of camera

motion. It is unclear whether the assumptions made about HVS in this paper actually hold

in practice. Furthermore, experimental evaluation was undertaken in the context of video

retrieval, rather than gaze prediction.

4.1.2 Saliency Based on Motion Attention Model

In the Motion Attention Model (MAM) [109] (Fig. 4.1-a), developed by the same group

as PMES in Section 4.1.1, two new rules are further added to reduce false detection: the

MVs of a moving object tend to have coherent angles, and if the magnitudes of object’s MVs

are large and their angles are incoherent, the motion information is not reliable. Therefore,

the saliency map in the MAM is computed as

S = Em Et (1−Em Es), (4.4)

where matrices Em, Es and Et are, respectively, the normalized motion magnitude, spatial

angle coherence and temporal angle coherence, and 1 is a matrix of 1s. Note that operators

and − are element-wise operations. Spatial angle coherence was defined within a spatial

window, while temporal angle coherence was obtained within a temporal window, both as

normalized entropies, similar to (4.3).

The target application of [109] was video skimming. The proposed motion saliency

model was evaluated using the following technique. Both the video and the obtained motion

saliency model were shown to 10 viewers. Each saliency model was given a score from 0,

50, 100 to each shot. The viewers gave a score of 100 if they felt the saliency model

best predicted the Foci of Attention (FOAs); they gave a score of 50 if they believed the

estimated saliency model was not the best but satisfactory; otherwise they gave score of 0.

No comparison with other saliency estimation methods was reported.

4.1.3 Saliency Based on Perceptual Importance Map

The Perceptual Importance Map [11] based on Zen method [162] (PIM-ZEN) (Fig. 4.1-

b) computes the saliency map based on MVs and DCT values. In this model, the saliency

map was computed as

S = Em + Ef + Ee, (4.5)

37

where matrix Em is the normalized motion magnitude, matrix Ef represents the Spatial

Frequency Content (SFC), and matrix Ee indicates the edge energy. The edge energy was

calculated by Laplacian of Gaussian over DCT coefficient values in each macroblock (MB).

The SFC was also computed for each MB as the number of DCT coefficient values that are

larger than a predefined threshold. The biological motivation behind Ef is that the SFC

at the eye-fixated locations in an image is noticeably higher than, on average, at random

locations [135].

In contrast to the PIM-ZEN model that gives the same weight to each energy term

above, in the Perceptual Importance Map based on Motion Center-Surround (PIM-MCS)

model [143] (Fig. 4.1-b) the saliency is estimated by the weighted linear combination of the

three energy terms,

S = 4Em + 2Ef + Ee. (4.6)

Sinha et al. [143] employed a motion energy term obtained by normalizing(E1m + E2

m

)in

which E1m and E2

m of the block n in P- or B-frames are, respectively, defined as

E1m (n) =

‖v (n)‖2

dc(n), (4.7)

E2m (n) =

∣∣∣∣‖v (n)‖ −2⇑(

2⇓ (L(‖v (n)‖))

)∣∣∣∣ , (4.8)

where dc(n) indicates the DC value of the block n, or equivalently, the normalized sum of

absolute difference of each decoded block. Here DCT means DCT of the residue for a MB

in P- and B-frames. Essentially, large DC value reflects an unreliable MV. Symbol v (n)

represents the MV of block n, L denotes a low-pass filter, and operators2⇓ and

2⇑ represent,

respectively, downsampling and upsampling by a factor of two. Note that function E2m

was inspired by a center-surround mechanism and attempts to measure how different is the

magnitude of v(n) from the motion magnitudes in its neighborhood.

In [143], the objective was video transcoding using the region-of-interest analysis. Unfor-

tunately, there was no evaluation based on eye-tracking data of the performance of saliency

estimation.

4.1.4 Saliency Based on Motion Center-Surround Difference Model

Liu et al. [104] proposed a method for detecting saliency using MVs based on the Motion

Center-Surround Difference Model (MCSDM) (Fig. 4.1-c). Seven different resolutions were

38

constructed according to MV sizes, k ∈ 4× 4, 4× 8, 8× 4, 8× 8, 8× 16, 16× 8, 16× 16,

where 4 × 4 and 16 × 16 are, respectively, the finest and the coarsest resolution. The MV

at any resolution was obtained by averaging the corresponding 4 × 4 MVs; for example, a

MV at resolution 16× 16 is computed by averaging the 16 corresponding 4× 4 MVs. The

motion saliency at level k for the MV v(n) was defined as the average magnitude difference

from 8-connected neighbors at the same level, denoted by Λk(n):

Mk(n) =

∑m∈Λk(n)

‖v(n)− v(m)‖

|Λk(n)|, (4.9)

in which |·| represents cardinality. The final saliency map was defined at the lowest level

(MV size 4× 4) by averaging multiscale motion saliency maps:

S(n) =1

7

16×16∑k=4×4

Mk(n) (4.10)

In [104], saliency is used in the context of rate control in video coding. There was no

evaluation of the saliency model on gaze data.

4.1.5 Saliency Based on Center-Surround Difference

In the Gaussian Center-Surround difference (GAUS-CS) model [42] and the Parametrized

Normalization, Sum and Product - Center-Surround difference (PNSP-CS) model [43] (Fig. 4.1-

d), saliency is identified through static and motion saliency maps. In this method, assuming

a YCrCb video sequence had been encoded by an MPEG-4 encoder, the static saliency was

determined from Intra-coded frames (I-frames) while the motion saliency was computed

from Predicted frames (P-frames) and Bi-predicted frames (B-frames).

Three features were extracted from I-frames from which static saliency was computed:

1. Luminance feature (L): DC coefficients from the Y channel,

2. Color feature (Cr,Cb): Two DC coefficients, one from each color channel,

3. Texture feature (T ): a vector containing the first nine AC coefficients in the Y channel.

These coefficients usually contain most of the energy in each DCT block [152].

From P- and B-frames, the set of MVs was extracted as the feature of motion saliency (M).

Based on the Feature Integration Theory of Attention [153], a conspicuity map was

constructed for each feature f ∈ L,Cr,Cb, T,M using the center-surround difference

39

mechanism. The sensitivity of HVS to center-surround differences decreases as the distance

between the center and surround is increased. Fang et al. simulated this relationship by a

Gaussian function:

Cf (n) =∑m 6=n

1

σ√

2π·∆f (n,m) · exp

−||n−m||222σ2

(4.11)

where Cf denotes the conspicuity of feature f , m and n are 2-D block indices, ||n −m||2

is the Euclidean distance between the centers of blocks n and m, σ is the parameter of

the Gaussian function, and ∆f (n,m) represents the absolute difference between the feature

values of blocks n and m.

Feature maps corresponding to luminance, Cr component, Cb component and texture

were all averaged into the static saliency map:

Ss(n) =1

4

∑f∈L,Cr,Cb,T

Cf (n). (4.12)

On the other hand, the motion saliency map was obtained simply as St(n) = CM (n). The

final saliency map was computed separately for I- and P/B-frames:

S =

Ss+S−1

t2 for I− frames,

St+S−1s

2 for P/B− frames,(4.13)

where S−1t and S−1

s are, respectively, the motion saliency map of the previous predicted

frame and the static saliency map of the previous I-frame.

Fang et al. reported high accuracy of their proposed saliency method as compared to

the well-known saliency models [69, 73, 50] when the comparison was performed on the

public database in [69] using KLD and ROC area measurements.

4.1.6 Saliency Based on Motion Saliency Map and Similarity Map

The Motion Saliency Map - Similarity Map (MSM-SM) model [126] (Fig. 4.1-e) consists

of two steps for generating the final motion saliency map. In the first step, the edge

strength of each block is computed based on low-frequency AC coefficients of the luma

and chroma channels [142]. These features (luma edge and chroma edge) then construct

spatial saliency map according to center-surround differences. The motion saliency map

is the result of refining the spatial saliency map by using an accumulated binary motion

map across neighboring frames. Each binary motion frame is obtained by thresholding

40

the magnitude of MVs, and the refining is carried out by element-wise multiplication. In

the second step, the dissimilarity of DC images of the luma and chroma channels among

co-located blocks over the frames is calculated by entropy. The final saliency map is the

product of the spatial and motion saliency maps. Muthuswamy and Rajan [126] evaluated

their model on segmented ground-truth video dataset using Equalized Error Rate (EER),

the value at which the false alarm rate is equal to the miss rate [111].

4.1.7 A Convex Approximation to IKN Saliency

In the APPROX model of Hadizadeh [51] (Fig. 4.1-f), the goal was to approximate the

standard IKN saliency model [73] only using DCT coefficients of image blocks. Specifically,

the method is searching for the portion of the image with the normalized frequency range

[π/256, π/16], assuming the image spectrum is in the normalized frequency range [0, π].

The range [π/256, π/16] has been chosen based on the analysis of the pyramid structure in

the IKN model [54]. The Wiener filter was used to extract the portion of the original image

signal in the frequency range [π/256, π/16] from the spectrum of each block.

The (j, l)-th 2-D DCT coefficient of the block n is given by

Xn(j, l) =1

4CjCl

W−1∑x=0

W−1∑y=0

n(x, y) · cos

((2x+ 1)jπ

2N

)· cos

((2y + 1)jπ

2N

), (4.14)

where W is the width and height of the block n, C0 = 1/√

2 and Ci = 1 for i 6= 0. The

Wiener transfer function to separate the desired signal from the undesired signal, when the

two signals are statistically orthogonal, has the following form

T (j, l) =SS(j, l)

SS(j, l) + SN (j, l), (4.15)

where SS(j, l) and SN (j, l) are the power spectral densities of the desired and the undesired

signal, respectively.

Hadizadeh used 1/f -model to compute the Wiener transfer function T . To do this,

two deterministic 1/f 2-D signals with the size of the original image were constructed such

that one covers the frequency band [π/256, π/16] and the other covers the remainder of the

spectrum. Then, a block of a given size (say 16 × 16) was extracted from the center of

each signal, and 2-D DCT was performed, resulting in XS(j, l) for the desired signal and

XN (j, l) for the undesired signal. The DCT-domain Wiener transfer function was obtained

41

as

T (j, l) =X2S(j, l)

X2S(j, l) + X2

N (j, l), (4.16)

in which X2S(j, l) and X2

N (j, l) are, respectively, the powers of the desired and the undesired

signal associated with DCT coefficient (j, l). The Wiener transfer function can be pre-

computed for typical resolutions and block sizes.

The approximation to the spatial saliency of the block n was computed as

Ss(n) =∑

(j,l)∈n

T 2(j, l) ·X2n(j, l). (4.17)

Similarly, the temporal saliency of the block n was computed as

St(n) =∑

(j,l)∈n

T 2(j, l) ·X2r(j, l), (4.18)

where Xr(j, l) is the (j, l)-th 2-D DCT coefficient of the residual block r. The residual block

r was obtained by absolute difference of the block n in the current frame and its co-located

block in the previous frame.

Finally, the overall saliency was approximated by the weighted sum of the obtained

spatial and temporal saliency:

S = (1− α) · Ss + α · St, (4.19)

where α is a positive constant.

Hadizadeh also proposed another method for estimating the temporal saliency in case

of camera motion. In this method, the camera motion is compensated prior to temporal

saliency computation. Let S ′t be the temporal saliency after global motion compensation

(GMC). Then, the overall saliency was refined as

S = (1− α1) · Ss + α1 · S ′t + α2 · Ss S ′t, (4.20)

where α1 and α2 are some normalizing constants. In (4.20), α1 trades off between the

spatial saliency and the temporal saliency, and α2 controls the mutual reinforcement.

Hadizadeh [51] evaluated his spatial saliency approximation on two common eye-tracking

datasets for images, Toronto [24] and MIT [7], using AUC [151] and KLD [31] measures.

Based on the reported results, his spatial saliency method is statistically very close to the

standard IKN saliency model for images [73], according to a t-test comparison [118]. For

42

evaluating the combination of spatial and temporal saliency approximation to the IKN

model, the KLD on the eye-tracking dataset in [56] containing 12 standard sequences was

used. The comparison was made against the IKN model with motion and flicker chan-

nels [66], and with MaxNorm normalization. Again, the two methods were shown to be

statistically similar. And because saliency based on the models in [51] is convex as a func-

tion of image data, these models are referred to as convex approximations to the IKN

model.

4.2 Experiments

In the literature, existing models have been developed for different applications and

their evaluation was based on different datasets and quantitative criteria. Furthermore,

models are often tailored to a particular video coding standard, and the encoding param-

eter settings used in the evaluation are often not reported. All of this makes a fair and

comprehensive comparison more challenging. To enable meaningful comparison, in this

work we reimplemented all compared methods on the same platform, and evaluated them

under the same encoding conditions. The results of the comparison indicate which strate-

gies seem promising in the context of compressed-domain saliency estimation for video, and

point the way towards improving existing models and developing new ones.

4.2.1 Experimental Setup

In order to have a unified framework for comparison, we have implemented all models

in MATLAB 8.0 on the same machine, an Intel (R) Core (TM) i7 CPU at 3.40 GHz and

16 GB RAM running 64-bit Windows 8.1. Where possible, we verified the implementa-

tion by comparing the results with those presented in the corresponding papers and/or by

contacting the authors. As seen in Table 4.1, each model assumed a certain video coding

standard. However, fundamentally, they all rely on the same type of information – MVs

and DCT of residual blocks (DCT-R) or pixel blocks (DCT-P). The main difference is in

the size of the blocks to which MVs are assigned or to which DCT is applied. In standards

up to MPEG-4 ASP, the minimum block size was 8× 8, whereas H.264/AVC allowed block

sizes down to 4 × 4 [159]. In pursuance of a fair comparison, for which all models should

accept the same input data, we chose to encode all videos in the MPEG-4 ASP format.

43

This choice ensured that seven out of nine models in the study did not require modification.

Minor modifications were necessary to two models in order for them to accept MPEG-4

ASP input data, as noted below.

First, in MCSDM [104], which is intended to operate on MVs from a H.264/AVC bit-

stream (Table 4.1), we changed the minimum block size from 4 × 4 to 8 × 8. Second, in

the APPROX model [51], where the spatial saliency map relies on DCT values of 16 × 16

pixel blocks, the 16 × 16 DCT was computed from the 8 × 8 DCTs using a fast algorithm

from [60]. Also, minimum MV block size was set to 8× 8.

To encode the videos used in the evaluation, the GOP (Group-of-Pictures) structure

was set to IPPP with the GOP size of 12, i.e., the first frame is coded as intra (I), the

next 11 frames are coded predictively (P), then the next frame is coded as I, and so on.

The MV search range was set to 16 with 1/4-pel motion compensation. The quantization

parameter (QP) value was set to 12. For this QP setting, the range of resulting Y-channel

peak signal-to-noise ratio (PSNR) was between 28.70 dB (for Mobile Calendar) to 39.82 dB

(for harry-potter-6-trailer). In the decoding stage, the DCT-P values (in I-frames) and

DCT-R values (in P-frames), as well as MVs (in P-frames) were extracted for each 8 × 8

block. Encoding and partial decoding to extract the required data was accomplished using

the FFMPEG library [4] (version 2.0.1).

4.2.2 Results

In this section, the performance of the nine compressed-domain saliency models de-

scribed above is compared amongst themselves, and also against three high-performing

pixel-domain models in order to gain insight into the relationship between the accuracy of

the current state of the art in pixel-domain and compressed-domain saliency estimation for

video.

Among the pixel-domain models, we chose AWS (Adaptive Whitening Saliency) [1],

which takes only spatial information into account, as well as MaxNorm [73] (including tem-

poral and flicker channels) and GBVS (Graph-Based Visual Saliency) [59] with DIOFM

channels (DKL-color, Intensity, Orientation, Flicker, and Motion), which take both spa-

tial and temporal information into account for estimating the saliency. AWS is frequently

reported as one of the top performing models on still natural images [22, 87]. MaxNorm

is known as a gold-standard model for saliency estimation. GBVS is another well-known

44

model, often used as a benchmark for comparison. While MATLAB implementations of

AWS and GBVS models are available, MaxNorm implementation is only available in C.

Therefore, for all pixel-domain and compressed-domain models under study we used MAT-

LAB implementations except MaxNorm.

Before presenting quantitative evaluation, we show a qualitative comparison of saliency

maps produced by various models on a few specific examples. Fig. 4.2 shows the saliency

maps for frame #150 of City, frame #150 of Mobile Calendar, frame #300 of one-show and

frame #300 of advert-iphone. In the figure, the MVF of each frame is also shown beneath

the corresponding frame.

In City, all the motion is due to camera movement. While observers typically look at the

building in the center of the frame (cf. IO in Fig. 4.2), all models declare the boundary of the

building as salient, where local motion is different from the global motion (GM). Meanwhile,

APPROX is also able to detect the central building as salient. Note that APPROX is the

only model in the study that employs GMC and its high scores on City are an indication

that other models could benefit from incorporating GMC.

The salient objects in Mobile Calendar, i.e., the ball and the train, are easily detectable

due to large and distinctive MVs. Therefore, all models, except AWS and GBVS, are able

to successfully estimate saliency in this sequence. Recall that AWS does not use temporal

information at all. GBVS also shows weak performance on this frame, in part because of

its implicit center-bias applied to the saliency map.

In one-show, large noisy MVs in low-texture areas cause all compressed-domain models

except MSM-SM to mistakenly declare them as salient regions. Recall that MSM-SM does

not directly use motion magnitude but rather uses processed MVs in the form of a motion

binary map. In this sequence, observers mostly focus on the face (cf. IO in Fig. 4.2) so

a model that was able to perform face detection would have done well in this example.

Unfortunately, none of the models is currently able to do face detection in the compressed

domain - this seems like a rather challenging problem. All three pixel-domain models –

MaxNorm, GBVS and AWS – also declare some part of non-salient regions as salient.

Finally, advert-iphone does not have any salient motion. Again, models typically detect

noisy MVs as salient. MSM-SM does not detect any saliency at all since none of the MVs

are strong enough according to the criteria of this model to activate its motion binary map.

In this example, as well as the previous one, the sensitivity of GAUSS-CS and PNSP-CS

45

City IO PMES MAM PIM-ZEN

MVF PIM-MCS MCSDM GAUS-CS PNSP-CS

MSM-SM APPROX MaxNorm GBVS AWS

Mobile IO PMES MAM PIM-ZEN



Figure 4.2: Sample saliency maps obtained by various models.

46

one-show IO PMES MAM PIM-ZEN



advert-iphone IO PMES MAM PIM-ZEN



Figure 4.2: Sample saliency maps obtained by various models.(cont.)

47

to spatial saliency is clearly visible. It is not surprising that AWS performs the best in this

example, because the true saliency in this example does not depend on motion.

Next, we present quantitative assessment of the saliency models using the data from the

SFU and DIEM datasets. We start with the assessment based on AUC′. Fig. 4.3 shows the

average AUC′ scores of various models across the test sequences. Note that all models are

able to produce saliency maps for P-frames, while only some of them are able to produce a

saliency map for I-frames. Hence, Fig. 4.3 (top) shows the average AUC′ scores on I-frames

for those models able to handle I-frames, while Fig. 4.3 (bottom) shows the average AUC′

scores for all models on P-frames. Sequences from the SFU dataset are indicated with

capital first letter.

As seen in the figure, all models achieved average AUC′ scores between those of IO, which

represents a kind of an upper bound (especially on the DIEM dataset), and GAUSS, which

represents center-biased, content-independent static saliency map. Note that GAUSS itself

has a slightly better AUC′ score than the pure chance score of 0.5. Recall that AUC′ corrects

for center bias by random sampling of control points based on empirical gaze distribution

across all frames and all sequences. It is encouraging that all models are able to surpass

GAUSS and achieve average AUC′ scores around 0.6.

Another interesting point in Fig. 4.3 is an indication of how difficult or easy is saliency

prediction in a given sequence according to AUC′. In the figure, the sequences are sorted

along the horizontal axis in decreasing order of average AUC′ score across all models. Al-

though the order is not the same for I- and P-frames, overall, it seems that one-show, sport-

scramblers, advert-bbc4-library, bbc-life-in-cold-blood, and ami-ib4010-closeup are the se-

quences where the saliency is easiest to predict, whereas City, Harbour, Tempete, news-tony-

blair-resignation, and university-forum-construction-ionic are the sequences where saliency

is hardest to predict. We will return to this issue shortly. Note that IO has better perfor-

mance on the sequences from the DIEM dataset. Here, IO saliency maps are formed by the

left eye gaze points and represent an excellent indicator of the ground-truth right eye gaze

points. In the sequences from the SFU dataset, where IO saliency map is formed from the

gaze points of the second viewing, the IO scores are not as high because the second-viewing

gaze points are not as good of a predictor of the ground-truth first-viewing gaze points.

A similar set of results quantifying the models’ performance according to NSS′, JSD′

and PCC are shown in Figs. 4.4, 4.5 and 4.6. As seen in Figs. 4.3, 4.4, 4.5 and 4.6, the

48

models that are able to handle I-frames (top parts of the figures) achieve similar average

scores on the I-frames as they do on the P-frames (bottom parts of the figures). For this

reason, in the remainder of the chapter the results for I- and P-frames will sometimes be

reported jointly. That is, in such cases, all scores will be the averages across all frames that

the model is able to handle. Since the number of I-frames is much smaller than the number

of P-frames, for the models that are able to handle I-frames, the effect of I-frame scores

on the combined score is relatively small. Note that PCC is not center bias-corrected, so

PCC scores for GAUSS are higher than all other models except IO. Also note that apart

from IO, which is always ranked first, the ranking of the models depends on the metric. For

example, MSM-SM scores well according to NSS′, but poorly according to JSD′.

Table 4.2 shows the ranking of test sequences according to the average scores across all

models except IO and GAUSS. The sequences are ranked in decreasing order of average

scores – the highest-ranked sequences are those for which the average scores are highest,

and therefore seem to be the easiest for saliency prediction. Meanwhile, the lowest-ranked

sequences are those for which saliency prediction seems the most difficult. Although the

ranking differs somewhat for different metrics, overall, Mobile Calendar seems to be among

the easiest sequence for saliency prediction, while City, Tempete and pingpong-no-bodies are

among the hardest. Mobile Calendar contains several moving objects, including a ball and a

train. The motion of each of these is sufficiently strong and different from the surroundings

that almost all models are able to correctly predict viewers’ gaze locations. It should be

noted that the background of this sequence involves many static colorful regions that, in the

absence of motion, would have the potential to attract attention. It is encouraging that the

compressed-based models are generally able to identify the salient moving objects against

such colorful and potentially attention-grabbing background. Meanwhile, AWS and GBVS

show a relatively poor performance on this sequence.

On the other hand, City, Tempete and pingpong-no-bodies do not contain salient moving

objects. In fact, City does not contain any moving objects; all the motion in this sequence

is due to camera movement. Tempete also contains significant camera motion (zoom out)

and in addition shows falling yellow leaves that act like motion noise, as they do not attract

viewers’ attention. pingpong-no-bodies does not include any salient moving object at all.

While all models get confused by the falling leaves in Tempete, APPROX achieves a decent

performance on City due to its use of GMC. APPROX is the only model in the study

49

os ssblicb

aic

mtn

inab

bbw

ssw

ffHall

Stefa

n ailm

gab

l

Socce

rpn

b ds

Mot

her

Gar

dennim ai

ufci

Mob

ilepashp

6ttu

cfBus

Cre

w

Forem

an

Tempe

tent

br

Har

bour

City

IO

MaxNorm

GAUS−CS

PNSP−CS

AWS

GBVS

APPROX

MSM−SM

GAUSS

0.2 1

0.5 0.8

Avg AUC’

0.6

0.8A

vg A

UC

’

os ablaic

Mob

ile

Stefa

nblicbHall

ss ailbw

s

mtn

inab

b

Gar

den

Socce

rds

swffpa

s

Mot

herm

gnim

Bushp

6t

Har

bour ai

ntbr

Forem

anpnbtu

cf

Cre

wuf

ci

Tempe

teCity

IO

MaxNorm

GAUS−CS

PNSP−CS

AWS

GBVS

PMES

PIM−ZEN

PIM−MCS

MAM

APPROX

MSM−SM

MCSDM

GAUSS

0.2 1

0.5 0.8

Avg AUC’

0.4

0.6

0.8

Avg A

UC

’

Figure 4.3: Accuracy of various saliency models over SFU and DIEM dataset according toAUC′ score for (top) I-frames and (bottom) P-frames. The 2-D color map shows the averageAUC′ score of each model on each sequence. Top: Average AUC′ score for each sequence,across all models. Right : Average AUC′ scores each model across all sequences. Errorbars represent standard error of the mean (SEM), σ/

√n, where σ is the sample standard

deviation of n samples.

50

os abl

mtn

in ss

Mob

ileblicbsw

ffab

bbw

suf

ci ailaic

Socce

r

Stefa

nHall

Gar

denpn

bhp

6t ds mg

Mot

her ai

nim

pastu

cf

Forem

anBusntbr

Cre

w

Har

bour

Tempe

teCity

IO

MaxNorm

GAUS−CS

PNSP−CS

MSM−SM

AWS

GBVS

APPROX

GAUSS

−0.6 3.3

0.5 1.5

Avg NSS’

0

1

Avg N

SS

’

abl

Mob

ile

mtn

in osblicb ss

swffab

b ail

Stefa

naicbw

suf

ci

Gar

den

Socce

rHallnt

brpa

s dshp

6t

Mot

herm

g

Har

bour

nim

pnb ai

Bus

Forem

antucf

Cre

w

Tempe

teCity

IO

MaxNorm

PMES

MSM−SM

PIM−ZEN

PNSP−CS

GAUS−CS

AWS

PIM−MCS

GBVS

MAM

APPROX

MCSDM

GAUSS

−0.7 3.8

0.5 1.5

Avg NSS’

0

1

2

Avg N

SS

’

Figure 4.4: Accuracy of various saliency models over SFU and DIEM dataset according toNSS′.

51

os

Socce

rHall

Mob

ile

Stefa

n

mtn

in ss ailab

laictu

cf

Gar

denbw

sblicb

Mot

herab

b

Forem

anBus

Cre

w ai mgCity

swff ds uf

ci

Har

bour

ntbr

Tempe

tehp

6tpa

snim

pnb

IO

MaxNorm

AWS

GBVS

GAUSS

GAUS−CS

PNSP−CS

APPROX

MSM−SM

0 0.8

0.2 0.4

Avg JSD’

0.2

0.4

Avg J

SD

’

os

Mob

ileHall

Stefa

nab

l

Socce

r ailaic

Mot

her

ss

Gar

den

mtn

in

Cre

wBus

blicb

Forem

an

Har

bour

bwstu

cfab

bCity m

ghp

6tsw

ffnt

br ds ai

Tempe

tepa

suf

cinim

pnb

IO

MaxNorm

AWS

GBVS

PMES

GAUS−CS

GAUSS

PNSP−CS

MCSDM

PIM−MCS

MAM

PIM−ZEN

APPROX

MSM−SM

0 0.8

0.2 0.4

Avg JSD’

0.2

0.4

Avg J

SD

’

Figure 4.5: Accuracy of various saliency models over SFU and DIEM dataset according toJSD′.

52

Hall

mtn

inblicb

Stefa

naic ss

bws

mghp

6tab

b

Socce

rab

losBuspn

b

Cre

wsw

ff

Mob

ile tucf

nim

pas dsC

ity ai

Tempe

te

Forem

an

Mot

hernt

br ailuf

ci

Har

bour

Gar

den

IO

GAUSS

GBVS

MaxNorm

GAUS−CS

AWS

PNSP−CS

APPROX

MSM−SM

−0.3 1

0.2 0.9

Avg PCC

0.2

0.4

Avg P

CC

Stefa

nHall

mtn

inblicb

ablaic

Mob

ilehp

6tbw

s ss mg

Socce

rsw

ff

Cre

wBus os ds

Mot

herab

b

Har

bour

pasnim

Forem

an ailnt

brtu

cf aiCitypn

buf

ci

Gar

den

Tempe

te

IO

GAUSS

GBVS

MaxNorm

GAUS−CS

AWS

PNSP−CS

PMES

PIM−ZEN

MSM−SM

MCSDM

PIM−MCS

MAM

APPROX

−0.3 1

0.2 0.9

Avg PCC

0

0.5

Avg P

CC

Figure 4.6: Accuracy of various saliency models over SFU and DIEM dataset according toPCC.

53

Table 4.2: Ranking of test sequences according to average scores across all models excluding IOand GAUSS.

Rank AUC′ JSD′ NSS′ PCC

1 os os abl Stefan2 abl Mobile Mobile Hall3 Mobile Hall mtnin abl4 Stefan Stefan os mtnin5 ail abl blicb Mobile6 aic Soccer ail blicb7 blicb ail Stefan aic8 Hall Mother ss bws9 ss aic swff Soccer10 Garden ss abb hp6t11 mtnin Crew Garden ss12 bws Garden aic Mother13 Soccer Bus Soccer mg14 abb mtnin bws Crew15 swff blicb Hall swff16 Mother Harbour ufci os17 pas Foreman ntbr ds18 ds bws pas ail19 Harbour City Mother Harbour20 nim tucf Harbour Bus21 mg abb ds abb22 Bus swff hp6t pas23 ntbr ntbr nim Foreman24 Foreman mg mg nim25 hp6t ds Bus ntbr26 ai hp6t pnb tucf27 pnb ai Foreman City28 Crew Tempete ai Garden29 tucf pas Crew ufci30 Tempete ufci tucf ai31 ufci nim Tempete pnb32 City pnb City Tempete

that employs GMC and its success on City is an indication that other models could be

improved by incorporating GMC. Note that AWS also scores well on City because, as a

spatial saliency model, it ignores motion and therefore does not get confused by camera

motion in this sequence.

Table 4.3 lists the ranking of sequences according to IO scores. Recall that IO measures

the congruence between the left and right eye gaze point in the DIEM dataset, and between

the first and second viewing in the SFU dataset. Not surprisingly, the sequences from the

DIEM dataset mostly occupy the top of the table, since there is high agreement between

the left and right eye gaze point in each frame of each sequence. While there are some

54

Table 4.3: Ranking of test sequences according to different metrics for IO.

Rank AUC′ JSD′ NSS′ PCC

1 os os os blicb2 abl abl abl nim3 mtnin mtnin mtnin mg4 ss ufci Mobile os5 ufci ss ufci pnb6 blicb tucf ss bws7 bws ai swff pas8 ai Stefan ail ds9 mg blicb blicb hp6t10 ds Mother ds ai11 swff Foreman ai ufci12 hp6t ail bws ss13 ail mg pas mtnin14 aic bws tucf aic15 tucf aic hp6t abl16 pas ds mg abb17 Stefan Hall Mother ail18 abb Soccer Soccer swff19 nim Mobile abb Stefan20 Mother swff aic ntbr21 Foreman hp6t Stefan tucf22 pnb Garden pnb Foreman23 Mobile pas nim Hall24 Hall abb Foreman Tempete25 Soccer Bus ntbr Mother26 ntbr Crew Hall Soccer27 Bus ntbr Garden Bus28 Garden nim Bus Mobile29 Tempete Harbour Tempete Garden30 Crew Tempete Crew Crew31 Harbour pnb Harbour Harbour32 City City City City

sequences from the SFU dataset that appear in the top half of the table according to some

metrics (for example, Stefan, Mother and Daughter, and Foreman according to JSD′, and

Mobile Calendar according to NSS′), most SFU sequences are at the bottom of the table.

This is not surprising, because the congruence between the first and second viewing of the

sequence is much lower than that between the left and right eye gaze point. In particular,

City is at the bottom of the table according to all four metrics. In this sequence, the central

building attracts viewers’ gaze in both viewings, but because the building occupies a large

portion of the frame, the actual gaze points on the first and second viewing may end up

being very far apart, leading to low scores even for IO.

55

The average scores of saliency models across all sequences in both datasets are shown

in Fig. 4.7 for various accuracy metrics. Note that the horizontal axis has been focused

on the relevant range of scores. Not surprisingly, IO achieves the highest scores regardless

of the metric. At the same time, the effect of center bias is easily revealed by comparing

AUC and NSS scores to their center bias-corrected versions AUC′ and NSS′. For example,

the AUC measures the accuracy of saliency prediction of a particular model against a

control distribution drawn uniformly across the frame. Since the uniform distribution is a

relatively poor control distribution for saliency and easy to outperform, all models achieve

a higher AUC score compared to their AUC′ score, which uses a control distribution fitted

to the empirical gaze points shown in Fig. 3.3. This effect is most visible in the GAUSS

benchmark model, which has the AUC score of around 0.8 (higher than all the models

except IO), but the AUC′ score of only slightly above 0.5 (lower than all other models).

This over-exaggeration of the accuracy of a simple scheme such as GAUSS when plain

AUC is used was the reason why [131, 76] suggest center bias correction via non uniform

control sampling. The center bias-corrected AUC′ score is a better reflection of the models’

accuracy. Center bias also has a significant effect on NSS, but a less pronounced effect on

JSD. It can also be observed that GAUSS (and then GBVS) achieves a higher PCC score

than any other method except IO, due to the accumulation of fixations near the center of

the frame.

In addition to the average scores, another type of assessment of a model’s performance is

counting its number of appearances among top performing models for each sequence [117].

To this end, a multiple comparison test is performed using Tukey’s honestly significant

difference as the criterion [63]. Specifically, for each sequence, we compute the average

score of a model across all frames, as well as the 95% confidence interval for the average

score. Then we find the model with the highest average score (excluding IO and GAUSS),

and find all the models whose 95% confidence interval overlaps that of the highest-scoring

model. All such models are considered top performers for the given sequence. Fig. 4.8

shows two examples. In the left panel, MaxNorm has the highest average AUC′ score, and

its 95% confidence interval does not overlap any of the other models’ intervals. Hence, in

this case, MaxNorm is the sole top performer. On the other hand, in the right panel, the

95% confidence interval of the top-scoring MCSDM overlaps the corresponding intervals of

PMES, PIM-MCS, APPROX and MaxNorm. In this case, all five models are considered top

56

0.6 0.8

PMES

MAM

PIM−ZEN

PIM−MCS

MCSDM

GAUS−CS

PNSP−CS

MSM−SM

APPROX

MaxNorm

GBVS

AWS

GAUSS

IO

Avg AUC’

0.7 0.9

Avg AUC

0.2 0.4

Avg JSD’

0.2 0.6

Avg JSD

0.5 1.5

Avg NSS’

0.7 2.5

Avg NSS

0.2 0.9

Avg PCC

Figure 4.7: Evaluation of models using various metrics.

performers. The number of appearances among top performers for each model is shown in

Fig. 4.9. These results show similar trends as average scores, with MaxNorm, AWS, GBVS,

PMES, GAUS-CS and PNSP-CS often being among top performers, while APPROX, MAM

and MCSDM rarely offering top scores.

Since a compressed video representation always involves some amount of information

loss, it is important to determine the sensitivity of the compressed-domain saliency model

to the amount of compression. Note that the predictive power of MVs and DCT coefficients

could change dramatically across the compression range. To study this issue, we repeated

the experiments described above for different amounts of compression, by varying the QP

parameter. The quality of encoded video drops as the QP increases, as shown by the PSNR

values on the SFU dataset for QP ∈ 4, 8, 12, 16, 20, 24 in Table 4.4.

Fig. 4.10 shows how the average AUC′ and NSS′ scores change as a function of the QP

parameter. In this experiment, MaxNorm, AWS and GBVS were applied to the decoded

video, hence they effectively used the same data as compressed-domain models, but in the

pixel domain after full video reconstruction. As seen in the figure, the models typically

score slightly poorer as the video quality drops, because of the less accurate MVs and DCT

57

0.55 0.6 0.65 0.7 0.75

AWSGBVS

MaxNormAPPROX

MSM−SMPNSP−CSGAUS−CS

MCSDMPIM−MCSPIM−ZEN

MAMPMES

Avg AUC’

0.4 0.5 0.6 0.7 0.8

Avg AUC’

Figure 4.8: Illustration of multiple comparison test for (left) Hall Monitor and (right)Mobile Calendar sequences using AUC′.

0 20 40

PMES

MAM

PIM−ZEN

PIM−MCS

MCSDM

GAUS−CS

PNSP−CS

MSM−SM

APPROX

MaxNorm

GBVS

AWS

0 20 40 0 20 40 0 20 40

GAUS−CS

PNSP−CS

MSM−SM

APPROX

MaxNorm

GBVS

AWS

AUC’ JSD’ NSS’ PCC

Figure 4.9: The number of appearances among top performers, using various evaluationmetrics. Results for I-frames are shown at the top, those for P-frames at the bottom.

58

Table 4.4: The PSNR value (in dB) for various QP values on sequences from the SFU dataset

Sequence

Bu

s

Cit

y

Cre

w

Fore

man

Gard

en

Hall

Harb

ou

r

Mobi

le

Moth

er

Soc

cer

Ste

fan

Tem

pete

Avg.

QP=4 36.8 37.6 38.8 38.2 37.3 39.3 36.6 36.2 41.0 38.6 38.0 36.9 37.9

QP=8 32.0 33.4 35.0 34.3 32.1 36.1 32.0 31.4 38.0 34.4 33.2 32.4 33.7

QP=12 29.4 31.3 33.0 32.4 29.3 34.3 29.6 28.7 36.2 32.4 30.5 30.0 31.4

QP=16 27.9 30.1 31.8 31.1 27.4 32.9 28.1 27.0 35.1 31.2 28.7 28.4 30.0

QP=20 26.6 29.1 30.9 30.1 26.0 31.8 26.9 25.6 34.2 30.3 27.3 27.3 28.8

QP=24 25.8 28.5 30.3 29.4 25.0 30.8 26.2 24.6 33.5 29.6 26.2 26.5 28.0

Table 4.5: Average processing time in milliseconds per frame.

Model

AW

S

GB

VS

MA

M

PM

ES

Max

Nor

m

GA

US

-CS

PN

SP

-CS

PIM

-ZE

N

AP

PR

OX

PIM

-MC

S

MS

M-S

M

MC

SD

M

Time (ms) 1535 783 166 98 91 57 57 18 11 9 7 4

coefficients at higher compression ratios. Nonetheless, the saliency prediction performance

is still reasonably consistent over this range of QP values, leading to the conclusion that the

models’ performance is not too sensitive to encoding parameters over a reasonable range

of video qualities. This observation is consistent with studies taken by Le Meur [94], and

Milanfar and Kim [87].

The average processing time per frame on the SFU dataset (CIF resolution videos at

30 fps) is listed in Table 4.5. The time taken for extracting MVs and DCT values from the

bitstream is excluded. Please note that these results correspond to MATLAB implementa-

tions of the models, except MaxNorm that was implemented in C, and the processing time

can be significantly decreased by implementation in a medium-level programming language

such as C/C++. Despite this, some of the models are fast enough for real time performance

(under 33 ms par frame) even when implemented in MATLAB. Discussion of accuracy and

complexity of the models is presented in the next section.

59

4 8 12 16 20 240.55

0.56

0.57

0.58

0.59

0.6

0.61

0.62

0.63

QP

Ave

rag

e A

UC

’

MAM

PMES

PIM−ZEN

PIM−MCS

MCSDM

GAUS−CS

PNSP−CS

MSM−SM

APPROX

MaxNorm

GBVS

AWS

4 8 12 16 20 240.2

0.25

0.3

0.35

0.4

0.45

0.5

QP

Ave

rag

e N

SS

’

MAM

PMES

PIM−ZEN

PIM−MCS

MCSDM

GAUS−CS

PNSP−CS

MSM−SM

APPROX

MaxNorm

GBVS

AWS

Figure 4.10: The relationship between the QP parameter and the models’ accuracy on theSFU dataset.

60

4.3 Discussion

Considering the results in Fig. 4.7 and Fig. 4.9, MaxNorm, GBVS, AWS, PMES, GAUS-

CS and PNSP-CS consistently achieve high scores across different metrics. It is encouraging

that the accuracy of some compressed-domain models is competitive with that of pixel-

domain models. This was already claimed by Fang et al.[43, 42], though based on a smaller

study. Note that, in general, achieving a high score with one metric does not guarantee a

high score with other metrics. As an example, MSM-SM achieves a relatively high average

scores across several metrics, but the lowest JSD and JSD′ score. Hence, the fact that

MaxNorm, GBVS, AWS, PMES, GAUS-CS and PNSP-CS perform consistently well across

all metrics considered in this study lends additional confidence in their accuracy.

PMES was the first compressed-domain saliency model, proposed in 2001, and it only

uses MVs to estimate saliency. It is well known that motion is a strong indicator of saliency

in dynamic visual scenes [120, 111, 68], so it is not surprising that MVs would be a pow-

erful cue for saliency estimation. PMES estimates saliency by considering two properties:

large motion magnitude in a spatio-temporal region, and the lack of coherence among MV

angles in that region. These two properties seem to describe salient objects reasonably well

in most cases, as demonstrated by the results. Taken together, they resemble a center-

surround mechanism where a region is considered salient if it sufficiently “stands out” from

its surroundings.

GAUS-CS and PNSP-CS show high accuracy in both I- and P-frames. Both models are

based on the center-surround difference mechanism, and both employ MVs for saliency esti-

mation in P-frames and DCT of pixel values in I-frames. The capability of center-surround

difference mechanism to predict where people look has been discussed extensively [73], so

their success is also not surprising.

Although PIM-MCS and MSM-SM also attempt to employ the center-surround differ-

ence mechanism, their scores are not as consistently high as those of GAUS-CS and PNSP-

CS. The reason may be that in GAUS-CS and PNSP-CS models, the contrast is inversely

proportional to the distance between the current DCT block and all other DCT blocks

in the frame, which means that they consider not only the contrast between blocks, but

also the distance between them. This seems to be a good strategy for compressed-domain

saliency estimation.

61

According to the results in the previous section, the lowest-scoring models on most met-

rics were APPROX and MCSDM. Incidentally, both these models were originally developed

for a different type of input data and had to be modified for this comparison, which may

degrade their performance. Specifically, both models were developed for MVs correspond-

ing to 4 × 4 blocks, whereas the evaluation in this work employed MVs corresponding to

8× 8 blocks. Additionally, APPROX originally assumed DCT coefficients of 16× 16 blocks

from a raw (uncompressed) frame, whereas in this evaluation, 16× 16 DCT was computed

from four 8× 8 DCTs of compressed frames, which involved quantization noise. Looking at

the results in Figs. 4.3 and 4.4, the gap between GAUS-CS (PNSP-CS) and APPROX is

smaller in the top parts of the figures (I-frames) than in the bottom parts (P-frames), which

indicates that the effect of quantization noise was not as detrimental to the performance of

APPROX as the switch from 4× 4 MVs to 8× 8 MVs.

The influence of global (camera) motion on visual saliency is still an open research

problem, with limited work in the literature addressing this issue. Reference [9] studied

separately the effect of pan/tilt and zoom-in/-out. It was found that in the case of pan/tilt,

the gaze points tend to shift towards the direction of pan/tilt, in the case of zoom-in,

they tend to concentrate near the frame center, and in the case of zoom-out, they tend to

scatter further out. On the other hand, according to [18], the presence of camera motion

tends to concentrate gaze points around the center of the frame “according to the direction

orthogonal to the tracking speed vector.”

Among the models tested in the present study, only APPROX took GM into account

by removing it before the analysis of MVs. This paid off in the case of City, which was

overall the most difficult sequence for other spatio-temporal saliency models in Figs. 4.3

and 4.4. However, GMC did not help much in the case of Tempete or Flower Garden.

In fact, Tempete contains strong zoom-out, which, according to [9], would tend to scatter

the gaze points around the frame. However, Figs. 4.3 and 4.4 show that GAUSS, with its

simple center-biased saliency map, scores well here (even with center-bias-corrected metrics),

suggesting that the gaze points are still located near the center of the frame. This is due to

the presence of a yellow bunch of flowers in the center of the frame, which turns out to be

highly attention-grabbing. Apparently, the key to accurate saliency estimation in Tempete

is not in the motion, but rather in the color present in the scene. Flower Garden is another

example where GMC did not pay off. The viewers’ gaze in this sequence is attracted to

62

the objects in the background, specifically the windmill and the pedestrians, whose motion

tends to be zeroed out after GMC on 8 × 8 MVs. Overall, the results suggest that GM

is not sufficiently well handled by current compressed-domain methods, and that further

research is needed to make progress on this front.

Considering models’ complexity and processing time in Table 4.5, MCSDM is the fastest

while AWS is the most demanding in terms of processing. While MaxNorm, AWS, GBVS,

PMES, GAUS-CS and PNSP-CS all scored highly in terms of accuracy, the processing time

of GAUS-CS and PNSP-CS is only half that of PMES and MaxNorm, one fifteenth that

of GBVS, and one thirtieth that of AWS, which would make them preferable in real-world

applications. Although entropy decoding time of MVs and DCT residuals was not taken

into account and although PMES skips I-frames and uses only MVs in P-frames, GAUS-CS

and PNSP-CS can also be forced to skip I-frames and rely only on MVs in P-frames, which

would make the decoding time required for these models equal to that of PMES, hence

the relative processing time in Table 4.5 would still be a good indicator of their relative

complexity.

It is interesting to note that the five fastest models (MCSDM, MSM-SM, PIM-MCS,

APPROX and PIM-ZEN) are able to offer real-time performance (under 33 ms per frame)

on CIF sequences even with a relatively inefficient MATLAB implementation. This suggests

that an optimized implementation of some of these models may be a good candidate for

low-weight saliency estimation for applications such as real-time video stream analysis and

video quality monitoring.

As mentioned before, this study was performed on MPEG-4 ASP bitstreams because this

way, the majority of the models (seven out of nine) did not require modification. The two

models that required modification, because they were developed for H.264/AVC bitstreams,

did not score particularly well, presumably because they have tailored their parameters to

smaller MV block sizes in H.264/AVC. Had the test been done on H.264/AVC bitstreams,

these two models may have scored higher, but there would have been considerable ambiguity

in how to extend the other majority of the models to handle smaller block sizes. Looking

towards the future, however, bitstreams that conform to H.264/AVC and the more recent

High Efficiency Video Coding (HEVC) [147] standard are of considerable interest, and

further research is needed to build saliency models that take advantage of their peculiarities.

63

4.4 Conclusions

In this chapter we attempted to provide a comprehensive comparison of nine compressed-

domain visual saliency models for video. All methods were reimplemented in MATLAB and

tested on two eye-tracking datasets using several accuracy metrics. Care was taken to cor-

rect for center bias and border effects in the employed metrics, which were issues found

in earlier studies on visual saliency model evaluation. The results indicate that in many

cases, reasonably accurate visual saliency estimation is possible using only motion vectors

from the compressed video bitstream. This is encouraging considering that motion vectors

occupy a relatively small portion of the bitstream (usually around 20%) and no further

decoding is required. Several compressed-domain saliency models showed competitive ac-

curacy with some of the best currently known pixel-domain models. On top of that, the

fastest compressed-domain methods are fast enough for real-time saliency estimation on

CIF video even with a relatively inefficient MATLAB implementation, which suggests that

their optimized implementation could be used for online saliency estimation in a variety of

applications.

Many sequences that have turned out to be difficult for models to handle contain global

(camera) motion. The influence of GM on visual saliency is not very well understood, and

most models in the study did not account for it. A number of compressed-domain global

motion estimation (GME) methods, based on motion vectors alone, have been developed

recently (cf. Section 7.2.2), so it is reasonable to expect that compressed-domain saliency

models should be able to benefit from these developments.

64

Chapter 5

Compressed-Domain Correlates of

Fixations

In this chapter, we first present two video features called Motion Vector Entropy (MVE)

and Smoothed Residual Norm (SRN) that can be computed from the compressed video

bitstream using motion vectors (MVs), block coding modes (BCMs), and transformed pre-

diction residuals. Next, we propose another feature as a measure of incompressibility, called

Operational Block Description Length (OBDL) that is computed directly from the output

of the entropy decoder, which is the first processing block in a video decoder. No further

decoding of the compressed bitstream is needed for computing OBDL, whereas partial de-

coding is required to extract MVE and SRN. Finally, the potential of these three features

to predict saliency is demonstrated by comparing their statistics around human fixation

points in a number of videos against the non-attended points selected randomly away from

fixations. The high statistical agreement between feature values and fixation points qualifies

these features as correlates of fixations. That is to say, these features are highly indicative

of attended regions in video.

5.1 Compressed-domain features

Typical video compression consists of motion estimation and motion-compensated pre-

diction, followed by transformation, quantization and entropy coding of prediction resid-

uals and MVs. These processing blocks have existed since the earliest video coding stan-

dards, getting more sophisticated over time. Our compressed-domain features are computed

65

from the outputs of these basic processing blocks. For concreteness, we shall focus on the

H.264/AVC coding standard [159], but the feature computation can be adjusted to other

video coding standards, including the latest High Efficiency Video Coding (HEVC) [147].

Due to the focus on H.264/AVC, our terminology involves 16 × 16 macroblocks (MBs),

block coding modes (BCM: INTER, INTRA, SKIP) for various block sizes (4 × 4, 8 × 8,

etc.), and the 4 × 4 integer transform that we shall refer to as “DCT” although it is only

an approximation to the actual Discrete Cosine Transform.

5.1.1 Motion Vector Entropy (MVE)

Motion vectors (MVs) in the video bitstream carry important cues regarding tempo-

ral changes in the scene. A MV is a two-dimensional vector v = (vx, vy) assigned to a

block, which represents its offset from the best-matching block in a reference frame. The

best-matching block is found via motion estimation, often coupled with rate-distortion op-

timization. The MV field can be considered as an approximation to the optical flow.

When a moving object passes through a certain region in the scene, it will generate dif-

ferent MVs in the corresponding spatio-temporal neighborhood; some MVs will correspond

to the background, others to the object itself, and object’s MVs themselves may be very

different from each other, especially if the object is flexible. On the other hand, an area

of the scene covered entirely by the background will tend to have consistent MVs, caused

mostly by the camera motion. From this point of view, variation of MVs in a given spatio-

temporal neighborhood could be used as an indicator of the presence of moving objects,

which in turn shows potential for attracting attention.

Before computing the feature that describes the above-mentioned concept, block pro-

cessing in the frame is performed as follows. SKIP blocks are assigned a zero MV, while

INTRA-coded blocks are excluded from analysis. Then all MVs in the frame are mapped

to 4× 4 blocks; for example, a MV assigned to an 8× 8 block is allocated to all four of its

constituent 4× 4 blocks, etc.

We define a motion cube as a causal spatio-temporal neighborhood of a given 4×4 block

n, as illustrated in Fig. 5.1. The spatial dimension of the cube (W ) is selected to be twice

the size of the fovea (2 of visual angle [66]) while the temporal dimension (L) is set to 200

ms. For example, for CIF resolution (352× 288) video at 30 frames per second and viewing

conditions specified in [56], these values are W = 52 pixels and L = 6 frames.

66

Figure 5.1: Motion cube N(n) associated with block n (shown in red) is its causal spatio-temporal neighborhood of size W ×W × L.

To give a quantitative interpretation of MV variability within the motion cube, we use

the normalized Motion Vector Entropy (MVE) map, defined as

SMVE(n) = − 1

logN

∑i∈H(N(n))

niN· log

(niN

), (5.1)

where N(n) is the motion cube associated with the 4×4 block n, H(·) is the histogram, i is

the bin index, ni is the number of MVs in bin i, and N =∑

i ni. The factor 1/ logN in (5.1)

serves to normalize MVE so that its maximum value is 1, achieved when ni = nj ,∀i, j. The

minimum value of MVE is 0, achieved when ni = 0 for all i except one.

Histogram H (N(n)) is constructed from MVs of inter-coded blocks within the motion

cube. Depending on the encoder settings, such as search range and MV accuracy (full pixel,

half pixel, etc.), each MV can be represented using a finite number of pairs of values (vx, vy).

Each possible pair of values (vx, vy) under the given motion estimation settings defines one

bin of the histogram. The cube is scanned and every occurrence of a particular (vx, vy)

results in incrementing the corresponding ni by 1.

It has been observed that large-size blocks are more likely to be part of the background,

whereas small-size blocks, arising from splitting during the motion estimation, are more

likely to belong to moving objects [13] (cf. Fig. 7.6). To take this into account, during block

processing, 4 × 4 INTER blocks are assigned random vectors from a uniform distribution

over the motion search range prior to mapping MVs from larger INTER and SKIP blocks

to their constituent 4× 4 blocks. This way, a motion cube that ended up with many 4× 4

INTER blocks during encoding is forced to have high MVE.

67

5.1.2 Smoothed Residual Norm (SRN)

Large motion-compensated prediction residual is an indication that the best-matching

block in the reference frame is not a very good match to the current block. This in turn

means that the motion of the current block cannot be well predicted using the block trans-

lation model, either due to the presence of higher-order motion or due to (dis)occlusions. Of

these two, (dis)occlusions often yield higher residuals. Moreover, (dis)occlusions are associ-

ated with surprise, as a new object enters the scene or gets revealed behind another moving

object, so they represent a potential attractor of attention. Therefore, large residuals might

be an indicator of regions that have the potential to attract attention.

The “size” of the residual is usually measured using a certain norm, for example `p norm

for some p ≥ 0. In this paper we employ the `0 norm, i.e., the number of non-zero elements,

since it is easier to compute than other popular norms such as `1 or `2. For any macroblock

(MB), we define Residual Norm (RN) as the norm of the quantized transformed prediction

residual of the MB, normalized to the range [0, 1]. For the `0 norm employed in this paper,

RN would be:

RN(z) =1

256‖z‖0, (5.2)

where z denotes the quantized transformed residual of a 16× 16 MB.

Finally, the map of MB residual norms is smoothed spatially using a 3×3 averaging filter,

temporally using a moving average filter over previous L frames, and finally upsampled by

a factor of 4 using bilinear interpolation. The result is the Smoothed Residual Norm (SRN)

map, with one value per 4× 4 block, just like the MVE map.

5.1.3 Operational Block Description Length (OBDL)

The Operational Block Description Length (OBDL) is computed directly from the out-

put of the entropy decoder, which is the first processing block in a video decoder. No

further decoding of the compressed bitstream is needed for computing this feature. The

number of bits spent on encoding each MB is extracted and mapped to the unit interval

[0, 1], where the value of 0 is assigned to the MB(s) requiring the least bits to code and

the value of 1 is assigned to the MB(s) requiring the most bits to code, among all MBs in

the frame. The normalized OBDL map is smoothed by convolution with a 2-D Gaussian

of standard deviation equal to 2 of visual angle. Although the spatially smoothed OBDL

68

map is already a solid saliency measure, we observed that an additional improvement in

the accuracy of saliency predictions is possible by performing further temporal smoothing.

This conforms with what is known about biological vision [12, 10, 124], where temporal

filtering is known to occur in the earliest layers of visual cortex. Specifically, we apply a

simple causal temporal averaging over 100 ms to obtain a feature derived from the OBDL.

The OBDL simplifies many of the previously proposed compression-based measures of

saliency. For example, representing the block by a set of DCT coefficients resembles the

subspace [64, 48, 114], sparse [65, 98] or independent component [24] decomposition, which

are at the core of various saliency measures. The differential encoding of DCT coefficients or,

more generally, spatial block prediction, resembles the center-surround operations of [73],

while motion-compensated prediction resembles the surprise mechanism of [69]. In fact,

given the well known convergence of modern entropy coders to the entropy rate of the

source being compressed

H =1

n

∑i

log1

P (xi), (5.3)

where P (x) is the probability of symbol x, the number of bits produced by the entropy coder

is a measure of the conditional self information of each block. Hence, a video compressor

is a very sophisticated implementation of the saliency principle of Bruce and Tsotsos [24],

which evaluates saliency as

S(x) = log1

P (x). (5.4)

While Bruce and Tsotsos [24] proposed a simple independent component analysis to extract

features x from the image pixels, the video compressor performs a sequence of operations in-

volving motion compensated prediction, DCT transform of the residuals, predictive coding

of DCT coefficients, quantization, and entropy coding, all within a rate-distortion opti-

mization framework. This results in a much more accurate measure of information and,

moreover, is much simpler to obtain in practice, given the widespread availability of video

codecs.

The proposed OBDL is even simpler to extract from compressed bitstreams than the

other forms of compressed-domain information including our compressed-domain features

in Sections 5.1.1 and 5.1.2, because the recovery of MVs or residuals is not required.

Overall, the OBDL combines the accuracy of the pixel-domain saliency measures (which

69

will be demonstrated later in the dissertation) with the computational efficiency of their

compressed-domain counterparts.

5.2 Discriminative Power of the Proposed Compressed-Domain

Features

To assess how indicative of fixations are the three proposed features, we use an approach

similar to the protocol of Reinagel and Zador [135], who performed an analogous analysis

for two other features – spatial contrast and local pixel correlation – on still natural images.

Their analysis showed that in still images, on average, spatial contrast is higher, while local

pixel correlation is lower, around fixation points compared to random points.

We follow the same protocol, using two eye-tracking datasets, the SFU [56] and DIEM [2]

(cf. Section 3.1 for details on the datasets). In these experiments, each video was encoded

in the H.264/AVC format using the FFMPEG library [4] (version 2.3) with a quantization

parameter (QP) of 30 and 1/4-pixel MV accuracy with no range restriction, and up to four

MVs per MB. After encoding, transformed residuals (DCT), BCMs, MVs and the number of

bits assigned to each MB of P-frames were extracted and the three features were computed

as explained above.

In each frame, feature values at fixation points were selected as the test sample, while

feature values at non-fixation points were selected as the control sample. Specifically, the

control sample was obtained by applying a nonparametric bootstrap technique [36] to all

non-fixation points of the video frame. Control points were sampled with replacement,

multiple times, with sample size equal to the number of fixation points. The average of the

feature values over all bootstrap (sub)samples was taken as the control sample mean.

The pairs of values (control sample mean, test sample mean) are shown in Fig. 5.2 for

each frame as a green dot. The top scatter plot corresponds to MVE, the middle scatter

plot to SRN and the bottom scatter plot to OBDL. From these plots, it is easy to see

that, on average, MVE, SRN and OBDL values at fixation points tend to be higher than,

respectively, MVE, SRN and OBDL values at randomly-selected non-fixation points. This

suggests that MVE, SRN and OBDL could be used as indicators of possible fixations in

video.

70

0 10

1

control sample mean

test sam

ple

mean

MVE

0 10

1

control sample mean

test sam

ple

mean

SRN

0 10

1

control sample mean

test sam

ple

mean

OBDL

Figure 5.2: Scatter plots of the pairs (control sample mean, test sample mean) in eachframe, for MVE (top), SRN (middle) and OBDL (bottom). Dots above the diagonal showthat feature values at fixation points are higher than at randomly selected points.

71

To validate this hypothesis, we perform a two-sample t-test [90] using the control and

test sample of each sequence. The null hypothesis was that the two samples originate in

populations of the same mean. A separate test is performed for MVE, SRN and OBDL.

This hypothesis was rejected by the two-sample t-test, at the 1% significance level, for all

sequences and all three features. The p-values obtained for each video sequence are listed in

Table 5.1, along with the percentage of frames where the test sample mean is greater than

the control sample mean. Note that the p-values in most cases are very low, indicating

low overall risk of rejecting the null hypothesis. However, the percentage columns show

that, while the test sample mean is higher than the control sample mean in most frames

of most sequences, there are also sequences (e.g., Foreman, abb, ai, pnb) with a significant

percentage of frames where this is not true.

Overall, these results lend strong support to the assertion that MVE, SRN and OBDL

are compressed-domain correlates of fixations in natural video. In the next chapter, we

describe two simple approaches to visual saliency estimation using the proposed features,

and then proceed to compare the proposed approaches against several state-of-the-art visual

saliency models.

72

Table 5.1: Results of statistical comparison of test and control samples. For each sequence,the p-value of a two-sample t-test and the percentage (%) of frames where the test samplemean is larger than the control sample mean are shown.

MVE SRN OBDL# Seq. p % p % p %

1 Bus 10−66 100 10−114 99 10−112 992 City 10−8 72 10−51 84 10−16 733 Crew 10−50 90 10−66 93 10−29 834 Foreman 10−26 76 10−14 58 10−10 565 Garden 10−20 83 10−53 88 10−52 906 Hall 10−223 97 10−222 95 10−211 967 Harbour 10−55 88 10−130 100 10−83 988 Mobile 10−66 94 10−86 91 10−58 889 Mother 10−181 99 10−168 100 10−120 10010 Soccer 10−69 97 10−93 97 10−68 9411 Stefan 10−48 100 10−69 100 10−53 9812 Tempete 10−5 63 10−54 97 10−31 8913 abb 10−24 66 10−66 75 10−36 7614 abl 10−119 93 10−144 96 10−86 9215 ai 10−24 65 10−59 78 10−37 7316 aic 10−Inf 100 10−262 100 10−192 10017 ail 10−216 98 10−199 98 10−162 9818 blicb 10−55 96 10−100 98 10−79 9619 bws 10−70 99 10−98 95 10−82 9820 ds 10−49 95 10−46 84 10−50 9221 hp6t 10−31 92 10−62 97 10−42 8422 mg 10−62 97 10−73 97 10−49 9123 mtnin 10−116 98 10−177 97 10−87 8324 ntbr 10−25 96 10−40 93 10−57 9825 nim 10−38 89 10−30 68 10−16 6826 os 10−112 95 10−141 100 10−123 9727 pas 10−78 94 10−92 98 10−112 9928 pnb 10−2 39 10−12 55 10−12 5629 ss 10−129 100 10−177 99 10−135 9630 swff 10−9 68 10−25 62 10−16 6031 tucf 10−8 85 10−48 92 10−22 8632 ufci 10−13 83 10−33 94 10−7 65

73

Chapter 6

Proposed Saliency Estimation Algorithms

The three compressed-domain features identified in Chapter 5 as visual correlates of

fixations in video suggest that a simple saliency estimate may be obtained without fully

reconstructing the video. In this chapter, we present two new visual saliency models for

compressed video that have higher accuracy in predicting fixations compared to state-of-

the-art models, even the pixel-domain ones. One model, called MVE+SRN, is built upon

MVE and SRN, as its name suggests, while another model, called OBDL-MRF, uses only

the OBDL feature.

6.1 MVE+SRN Saliency Estimation Model

Fig. 6.1 shows the block diagram of the proposed MVE+SRN algorithm to estimate

visual saliency. For each inter-coded frame, motion vectors (MVs), block coding modes

(BCMs) and transformed residuals (DCT) are entropy-decoded from the video bitstream.

MVs and BCMs are used to construct the Motion Vector Entropy (MVE) map, illustrated

in the left branch in Fig. 6.1 for a frame from sequence Stefan. Transformed residuals are

used to construct the Smoothed Residual Norm (SRN) map, shown in the right branch

in Fig. 6.1. Operations performed by various processing blocks were described when the

corresponding features were introduced in Sections 5.1.1 and 5.1.2. The final saliency map

is obtained by fusing the two feature maps.

A number of feature fusion methods have been investigated in the context of saliency

estimation [66, 73]. The appropriate fusion method will depend on whether the features in

question are independent, and whether their mutual action reinforces or diminishes saliency.

74

H.264/AVC

bitstream

MV+BCM DCT

Block

processing

Histogram

construction

Entropy

MVE Map SRN Map

Residual

norm

Filtering

Resizing

Fusion

Saliency MapOriginal frame

superimposed by fixations

Figure 6.1: Block diagram of the proposed MVE+SRN saliency estimation algorithm.

75

In our case, we note that MVE and SRN are somewhat independent, in the sense that one

could imagine a region in the scene with high MVE and low SRN, and vice versa. Also,

their combined action is likely to increase saliency – when both MVE and SRN are large,

the region is not only likely to contain moving objects (large MVE), but also contains parts

that are surprising and not easily predictable from previous frames (large SRN). Hence, our

fusion involves both additive and multiplicative combination of MVE and SRN maps,

S = N (SMVE + SSRN + SMVE SSRN ) , (6.1)

where the symbol denotes pointwise multiplication and N (·) indicates normalization to

the range [0, 1].

6.2 OBDL-MRF Saliency Estimation Model

The next proposed method measures visual saliency based on a Markov Random Field

(MRF) model of Operational Block Description Length (OBDL) feature responses.

6.2.1 MRF Model

While video compression algorithms are very sophisticated estimators of local infor-

mation content, they only produce local information estimates, since all the processing is

spatially and temporally localized to the macroblock (MB) unit. On the other hand, saliency

has both a local and a global component. For example, many saliency models implement

inhibition of return mechanisms [73], which suppress the saliency of image locations in the

neighborhood of a saliency peak. To account for these effects, we rely on a MRF model [156].

More specifically, the saliency detection problem is formulated as one of inferring the

maximum a posteriori (MAP) solution of a Spatio-Temporal Markov Random Field (ST-

MRF) model. This is defined with respect to a binary classification problem, where salient

blocks of 16 × 16 pixels belong to class 1 and non-salient blocks to class 0. The goal is to

determine the class labels ωt ∈ 0, 1 of the blocks of frame t, given the labels ω1···t−1 of

the previous frames, and all previously observed compressed information o1···t. The optimal

label assignment ωt∗ is that which maximizes the posterior probability P (ωt|ω1···t−1, o1···t).

By application of Bayes rule this can be written as

76

P (ωt|ω1···t−1, o1···t) ∝ P (ω1···t−1|ωt, o1···t) · P (ωt|o1···t)

∝ P (ω1···t−1|ωt, o1···t) · P (o1···t|ωt) · P (ωt), (6.2)

where ∝ denotes equality up to a normalization constant. Considering the monotonicity of

the logarithm, the MAP solution for the saliency labels ωt is then given by

ωt∗ = arg minψ∈Ωt

− logP (ω1···t−1|ψ, o1···t)− logP (o1···t|ψ)− logP (ψ)

, (6.3)

where Ωt denotes the set of all possible labeling configurations for frame t. From the

Hammersley-Clifford theorem [19], the probabilities in (6.3) can be expressed as Gibbs

distributions P (x) = 1Z exp −E(x)

C where E(x) is an energy function, C a constant, sometimes

referred to as “temperature,” and Z a partition function. This enables the reformulation of

the MAP estimation problem as

ωt∗ = arg minψ∈Ωt

1

CtE(ψ;ω1···t−1, o1···t) +

1

CoE(ψ; o1···t) +

1

CcE(ψ)

. (6.4)

The components E(ψ;ω1···t−1, o1···t), E(ψ; o1···t), and E(ψ) of the energy function, re-

spectively, measure the degree of temporal consistency of the saliency labels, the coherence

between labels and feature observations, and the spatial compactness of the label field. A

more precise definition of these three components is given in the following sections. Finally,

the minimization problem (6.4) is solved by the method of Iterated Conditional Modes

(ICM) [20] (cf. Section 6.2.5).

6.2.2 Temporal Consistency

Given a block at image location n = (x, y) of frame t, the spatio-temporal neighborhood

N(n) is defined as the set of blocks m = (x′, y′, t′) such that |x− x′| ≤ 1, |y − y′| ≤ 1 and

t − L < t′ < t for some L. The temporal consistency of the label field is measured locally,

using

E(ψ;ω1···t−1, o1···t) =∑n

Et(n), (6.5)

where Et(n) is a measure of inconsistency within N(n), which penalizes temporally incon-

sistent label assignments, i.e., ωt(x, y) 6= ωt′(x′, y′).

The saliency label ω(m) of block m is assumed to be Bernoulli distributed with pa-

rameter proportional to the strength of features o(m), i.e. P (ω(m)) = o(m)ω(m) · (1 −

77

o(m))1−ω(m). It follows that the probability b(n,m) that block m will bind with block n

(i.e. have label ψ(n)) is

b(n,m) = o(m)ψ(n) · (1− o(m))1−ψ(n). (6.6)

The consistency measure weights this probability by a similarity function, based on a Gaus-

sian function of the distance between n and m,

d(n,m) ∝ exp

(−||m− n||s2

2σ2s

)· exp

(−||m− n||t2

2σ2t

), (6.7)

where || · ||s2 and || · ||t2 are the Euclidean distances along the spatial and temporal dimension,

respectively, and σ2s , σ

2t two normalization parameters. The expected consistency between

the two locations is then

c(n,m) =b(n,m) · d(n,m)∑

p∈N(n) (b(n,p) · d(n,p)). (6.8)

This determines a prior expectation for the consistency of the labels, based on the observed

features o(m). The energy function then penalizes inconsistent labelings, proportionally to

this prior expectation of consistency

Et(n) =∑

m∈N(n)

c(n,m) · (1− ω(m))ψ(n) · ω(m)1−ψ(n). (6.9)

Note that Et(n) ranges from 0 to 1, taking the value 0 when all neighboring blocks m ∈ N(n)

have the same label as block n, and the value 1 when neighboring blocks all have label

different than ψ(n).

6.2.3 Observation Coherence

The incoherence between the observation and label fields at time t is measured with

an energy function E(ψ; o1···t). While this supports the dependence of wt on all prior

observations (o1···t−1), we assume that the current labels are dependent only on the current

observations (ot). Incoherence is then measured by the energy function

E(ψ; o1···t) =∑n

(infmo(m)

)1−ψ(n)·(

1− supmo(m)

)ψ(n)

, (6.10)

where infimum inf and supremum sup are defined over m = (x′, y′) such that |x− x′| ≤ 1,

|y − y′| ≤ 1. This is again in [0, 1] and penalizes the labeling of block n as non-salient, i.e.,

ψ(n) = 0 when the infimum of feature value infm o(m) is large, or as salient, i.e., ψ(n) = 1,

when the supremum of feature value supm o(m) is small.

78

6.2.4 Compactness

In general, the probability of a block being labeled salient should increase if many of its

neighbors are salient. The last energy component in (6.4) encourages this type of behavior.

It is defined as

E(ψ) =∑n

Φ(n)1−ψ(n) · (1− Φ(n))ψ(n) , (6.11)

where Φ(n) is a measure of saliency in the neighborhood of n. This is defined as

Φ(n) = α∑

m∈N+(n)

ψ(m) + β∑

m∈N×(n)

ψ(m), (6.12)

where N+(n) and N×(n) are, respectively, the first-order (North, South, East, and West)

and the second-order (North-East, North-West, South-East, and South-West) neighbor-

hoods of block n. In our experiments, we set α = 16 and β = 1

12 , to give higher weight to

first-order neighbors.

6.2.5 Optimization

The solution of (6.4) can be found with many numerical procedures. Two popular

methods are Stochastic Relaxation (SR) [49] and ICM [20]. SR has been reported to have

some advantage in accuracy over ICM, but at a higher computational cost [149]. In this

work, we adopt ICM, mainly due to its simplicity. The label of each block is initialized

according to the corresponding feature value, o(n), i.e. the block is labeled salient if o(n) >

0.5 and non-salient otherwise. Each block is then relabeled with the label (0 or 1) that

produces the largest reduction in the energy function. This relabeling is iterated until no

further energy reduction is possible. We limit the iterations to eight in our experiment. It

is worth mentioning that ICM is prone to getting trapped in local minima and the results

are dependent on the initial labeling.

6.2.6 Final Saliency Map

The procedure above produces the most probable, a posteriori, map of salient block

labels. To emphasize the locations with higher probability of attracting attention, the OBDL

of a block declared salient (non-salient) by the MRF is increased (decreased) according to

the OBDLs in its neighborhood. The process is formulated as

S(n) =

(supmo(m) · d(m,n)

)ψ(n)

·(

1− supm(1− o(m)) · d(m,n)

)1−ψ(n)

, (6.13)

79

where m = (x′, y′, t′) is defined as the set of blocks such that |x− x′| ≤ 1, |y − y′| ≤ 1

and t − L < t′ ≤ t, and d(m,n) as in (6.7). In this way, a block n labeled as salient

by the MRF inference is assigned a saliency equal to the largest feature value within its

neighborhood, weighted by its distance from n. On the other hand, for a block n declared

as non-salient, this operation is applied to the complement of the saliency values within Nn.

The complement of this value is then assigned as the saliency value of n.

6.3 Experiments

This section presents experimental evaluation of the proposed saliency estimation algo-

rithms and their comparison with several state of the art saliency models.

6.3.1 Experimental setup

The proposed algorithms were compared with a number of state-of-the-art algorithms for

saliency estimation in video. These methods are listed in Table 6.1. For each algorithm, the

target domain - pixel (pxl) or compressed (cmp) - and the implementation details are also

indicated (cf. Section 4.2.1 for more implementation details of cmp-domain algorithms.)

Encoding was done using FFMPEG library [4] (version 2.3) with QP ∈ 3, 6, ..., 51 in

the baseline profile, with default Group-of-Pictures (GOP) structure. For each MB, there

exists up to four MVs having 1/4-pixel accuracy with no range restriction. We avoided

the use of I- and B-frames since many of the compressed-domain methods did not specify

how to handle these frames. In principle, there are several possibilities for B-frames, such

as flipping forward MVs around the frame to create backward MVs (as in P-frames), or

forming two saliency maps, one from forward MVs and one from backward MVs, and then

averaging them. However, to avoid speculation and stay true to the algorithms the way

they were presented, we used only P-frames in the evaluation.

6.3.2 Results

A set of experiments was performed to compare the proposed algorithms to state-of-

the-art saliency algorithms. These experiments used quantization parameter QP = 30,

i.e. reasonably good video quality - average peak signal-to-noise (PSNR) across encoded

sequences of 35.8 dB. Fig. 6.2 illustrates the differences between the saliency predictions

80

Table 6.1: Saliency estimation algorithms used in our evaluation. D: target domain (cmp:compressed; pxl: pixel); I: Implementation (M: Matlab; P: Matlab p-code; C: C/C++; E:Executable).

# Algorithm First Author Year D I

1 MaxNorm Itti [73] (ilab.usc.edu/toolkit) 1998 pxl C2 Fancy1 Itti [66] (ilab.usc.edu/toolkit) 2004 pxl C3 SURP Itti [69] (ilab.usc.edu/toolkit) 2006 pxl C4 GBVS Harel [59] (DIOFM channel) 2007 pxl M5 STSD Seo [141] 2009 pxl M6 SORM Kim [88] 2011 pxl E7 AWS Diaz [48] 2012 pxl P8 PMES Ma [108] 2001 cmp M9 MAM Ma [109] 2002 cmp M10 PIM-ZEN Agarwal [11] 2003 cmp M11 PIM-MCS Sinha [143] 2004 cmp M12 MCSDM Liu [104] 2009 cmp M13 MSM-SM Muthuswamy [126] 2013 cmp M14 PNSP-CS Fang [43] 2014 cmp M

of various algorithms for a few sample video frames. In the figure, the motion vector field

(MVF), Intra-Observer (IO) and the raw OBDL are also shown for the corresponding frame.

Figs. 6.3 and 6.4 show the average AUC′ and NSS′ score, respectively, of various al-

gorithms across the test sequences. Not surprisingly, on average, pixel-domain methods

perform better than compressed-domain ones. However, our proposed compressed-domain

methods, MVE+SRN and OBDL-MRF, top all other methods, including pixel-domain ones,

on both metrics. Based on these results, while MVE+SRN is the best saliency predictor,

OBDL-MRF achieves very close prediction accuracy. Also note that the SRN feature, by

itself, has a close prediction capabilities to MVE+SRN. Including the MVE feature into

saliency prediction helps on sequences with considerable amount of motion, such as harry-

potter-6-trailer, but in many cases, SRN is sufficient. Another interesting point is that raw

OBDL feature, by itself, achieves comparable scores to state-of-the-art, while incorporating

spatial (OBDL-S) and temporal (OBDL-T) filtering as well as MRF inference results in

improved predictions, with best results produced by the full-blown OBDL-MRF.

The performance of the various saliency models was also evaluated using a multiple

comparison test [63] similar to what we did in Section 4.2.2. For each sequence, the average

score of a given model across all frames is computed, along with the 95% confidence interval

for the average score. The model with the highest average score is a top performer on that

81

Stefan MVF IO OBDL

OBDL-MRF MVE+SRN MaxNorm Fancy1

SURP GBVS STSD SORM

AWS PMES MAM PIM-ZEN

PIM-MCS MCSDM MSM-SM PNSP-CS

Figure 6.2: Sample saliency maps obtained by various algorithms.

82

Crew MVF IO OBDL


SURP GBVS STSD SORM



Figure 6.2: Sample saliency maps obtained by various algorithms.(cont.)

83

ami-ib4010-left MVF IO OBDL


SURP GBVS STSD SORM




84

advert-bbc4-library MVF IO OBDL


SURP GBVS STSD SORM




85

ailab

laic ss

Stefa

n osHall

blicb

Mob

ilebw

sab

b

Mot

her

Socce

r

mtn

inpa

s

Gar

densw

ff ds

Har

bourBus ai

ntbr m

gnim

Forem

anhp

6ttu

cfpn

b

Tempe

te

Cre

wuf

ciCity

MVE+SRN

OBDL−MRF

SRN

OBDL−T

MaxNorm

OBDL−S

MVE

Fancy1

OBDL

SURP

STSD

GBVS

AWS

PIM−ZEN

SORM

PIM−MCS

PMES

PNSP−CS

MAM

MCSDM

MSM−SM

0.3 0.9

0.6 0.7

Avg AUC’

0.4

0.6

0.8

Avg

AU

C’

Figure 6.3: Accuracy of various saliency algorithms over the two datasets according to AUC′.Each 2-D color map shows the average AUC′ score of each algorithm on each sequence. Theaverage AUC′ performance across sequences/algorithms shown in the sidebar/topbar. Errorbars represent standard error of the mean.

86

abl

mtn

in

Mob

ile ail

ss aicab

bblicb

Stefa

nsw

ff osbw

sHallnt

br

Mot

herpa

s

Socce

r

Gar

denpn

b

Har

bour

ufci ds ai

hp6tBusni

m mgtu

cf

Forem

an

Tempe

te

Cre

wCity

MVE+SRN

SRN

OBDL−MRF

OBDL−T

OBDL−S

MVE

OBDL

Fancy1

MaxNorm

PIM−ZEN

PMES

STSD

PIM−MCS

MSM−SM

SORM

GBVS

SURP

AWS

MAM

PNSP−CS

MCSDM

−0.4 5.3

0.5 1

Avg NSS’

0

1

2

3

Avg

NS

S’

Figure 6.4: Accuracy of various saliency algorithms over the two datasets according to NSS′.

87

0 5 10 15 20 25

MVE+SRN

OBDL−MRF

MaxNorm

Fancy1

SURP

GBVS

STSD

SORM

AWS

PMES

MAM

PIM−ZEN

PIM−MCS

MCSDM

MSM−SM

PNSP−CS

AUC’

0 5 10 15 20 25

NSS’

Figure 6.5: The number of appearances among top performers, using AUC′ and NSS′

evaluation metrics.

sequence, however, all other models whose 95% confidence interval overlaps that of the

highest-scoring model are also considered top performers on that sequence. The number of

appearances among top performers for each model is shown in Fig. 6.5. Again, pixel-domain

methods tend to do better than compressed-domain ones, but our methods top both groups.

As before, MVE+SNR comes in first in terms of both AUC′ and NSS′, while OBDL-MRF

is second.

It is worth mentioning that due to the very weak performance of AWS on sequences

where most other methods scored well in Figs. 6.3 and 6.4 (such as advert-bbc4-library,

ami-ib4010-left and Mobile), the average AUC′ and NSS′ scores of AWS are not particularly

high - the average was dragged down by the low scores on these few sequences. However,

in the multiple comparison test in Fig. 6.5, AWS shows strong performance, because it is

among top performers on many other sequences. These results corroborate the results of

the comparison made in [22] (which was performed on DIEM and CRCNS [69],[70]) where

AWS was among the highest-scoring algorithms under study.

88

We also compare the distribution of saliency values at the fixation locations against the

distribution of saliency values at random points from non-fixation locations in Fig. 6.6 in

terms of JSD. If these two distributions overlap substantially, then the saliency model pre-

dicts fixation points no better than a random guess. On the other hand, as one distribution

diverges from the other, the saliency model is better able to predict fixation points. As

seen in the Fig. 6.6, MVE+SRN and OBDL-MRF, respectively, generate higher divergence

between two distributions compared to other models.

As discussed in Section 4.2.2, the quality of the encoded video, measured in terms of

PSNR, drops as QP increases due to the larger amount of compression. Fig. 6.7 shows

how the average AUC′ and NSS′ scores change as a function of the average PSNR (across

sequences), by varying QP ∈ 3, 6, ..., 51. Note that pixel-domain methods are also sensitive

to compression - they do not use compressed-domain information, but they are impacted

by the accuracy of decoded pixel values. Our MVE+SRN and OBDL-MRF achieve their

best performance at PSNR around 35 dB. This is excellent news for our proposed models

because this range of PSNR is thought to be very appropriate in terms of balance between

objective quality and compression efficiency. Overall, MVE+SRN achieves the highest

accuracy followed by OBDL-MRF across most of the compression range.

Based on the results in Fig. 6.7, it appears that around the PSNR value of 35 dB,

compressed-domain features MVE, SRN and OBDL are all sufficiently informative and

sufficiently accurate. As the amount of compression reduces (i.e., quality increases) SRN

becomes less informative, since small quantization step-size makes each residual have large

`0 norm. At the same time, MVs may become too noisy, since rate-distortion optimization

does not impose sufficient constraints for smoothness. On the other hand, as the amount

of compression increases (i.e., quality reduces), both MVE and SRN become less accurate.

Both extremes are detrimental to saliency prediction. Since the OBDL feature incorportes

the number of bits needed for both transformed prediction residuals and MVs, it is not

surprising that the accuracy of OBDL-MRF degrades substantially at the extremes of the

compression range. While at low rates there are too few bits to enable a precise measurement

of saliency, at high rates there are too many bits available, and all blocks become salient

according to this feature.

To assess the complexity of various algorithms, processing time was measured on an

Intel (R) Core (TM) i7 CPU at 3.40 GHz and 16 GB RAM running 64-bit Windows 8.1

89

Table 6.2: Average processing time (ms) per frame.

Algorith

m

AW

S

GB

VS

PN

SP

-CS

MA

M

PM

ES

SU

RP

ST

SD

Fan

cy1

SO

RM

Max

Nor

m

PIM

-ZE

N

OBDL-M

RF

MVE+SRN

MC

SD

M

PIM

-MC

S

MS

M-S

M

Time 1492 923 895 778 579 323 227 98 92 89 43 39 30 15 10 8

implemented as indicated in Table 6.1. The results are shown in Table 6.2. As expected,

compressed-domain models tend to require far less processing time than their pixel-domain

counterparts. The proposed methods, MVE+SRN and OBDL-MRF, implemented in MAT-

LAB, required, respectively, an average of 30 ms and 39 ms per CIF video frame. While

this is slower than some of the other compressed-domain algorithms, it enables the com-

putation of saliency within or close to the real-time requirements, even in MATLAB. Note

that the decoding time is not included in these results. Recall that OBDL-MRF only re-

quires entropy decoding to get data necessary to process, whereas other compressed-domain

algorithms, including MVE+SRN, require additional decoding effort.

6.4 Discussion and conclusions

Using the compressed-domain features in Chapter 5, we constructed two simple visual

saliency estimation methods and compared them with fourteen other saliency prediction

methods for video. Some of these methods also made use of compressed-domain features,

while others operated in the pixel domain. Comparison was made using a number of es-

tablished metrics. The results showed that both proposed methods outperformed all other

methods, including pixel-domain ones.

A natural question to ask is - how come a compressed-domain method, which seems

to be restricted to a relatively constrained set of data, can outperform pixel-domain meth-

ods in terms of saliency prediction? To answer this question, one needs to realize that a

compressed-domain method is not a stand-alone entity. Its front end is the video encoder,

an extremely sophisticated algorithm whose goal is to provide the most compact repre-

sentation of video. Surely, such compact representation contains useful information about

various aspects of the video signal, including saliency.

90

0 0.5 10

5

10

15x 10

4 MVE+SRN

JSD = 7.92 × 10−2

0 0.5 10

5

10

15x 10

4 OBDL−MRF

JSD = 6.3 × 10−2

0 0.5 10

5

10

15x 10

4 Fancy1

JSD = 5.79 × 10−2

0 0.5 10

5

10

15x 10

4 MaxNorm

JSD = 5.5 × 10−2

0 0.5 10

5

10

15x 10

4 AWS

JSD = 4.14 × 10−2

0 0.5 10

5

10

15x 10

4 SURP

JSD = 4.02 × 10−2

0 0.5 10

5

10

15x 10

4 PIM−ZEN

JSD = 3.19 × 10−2

0 0.5 10

5

10

15x 10

4 STSD

JSD = 3.03 × 10−2

0 0.5 10

5

10

15x 10

4 PIM−MCS

JSD = 2.98 × 10−2

0 0.5 10

5

10

15x 10

4 PMES

JSD = 2.61 × 10−2

0 0.5 10

5

10

15x 10

4 MSM−SM

JSD = 2.43 × 10−2

0 0.5 10

5

10

15x 10

4 GBVS

JSD = 2.27 × 10−2

0 0.5 10

5

10

15x 10

4 SORM

JSD = 2.26 × 10−2

0 0.5 10

5

10

15x 10

4 PNSP−CS

JSD = 1.49 × 10−2

0 0.5 10

5

10

15x 10

4 MAM

JSD = 1.42 × 10−2

0 0.5 10

5

10

15x 10

4 MCSDM

JSD = 1.38 × 10−2

Figure 6.6: The frequencies of saliency values estimated by different algorithms at the fixa-tion locations (narrow blue bars) and random points from non-fixation locations (wide greenbars) vs the number of human fixations. The JSD between two distribution correspondingto each algorithm presents how large each distribution diverges from another (histogramswere sorted from left to right and top to bottom according to the JSD metric.)

91

24 27 30 33 36 39 42 45 480.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

Average PSNR (dB)

Ave

rag

e A

UC

’

MVE+SRN

OBDL−MRF

MaxNorm

Fancy1

SURP

GBVS

STSD

AWS

PMES

MAM

PIM−ZEN

PIM−MCS

MCSDM

MSM−SM

PNSP−CS

24 27 30 33 36 39 42 45 480.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average PSNR (dB)

Avera

ge N

SS

’

MVE+SRN

OBDL−MRF

MaxNorm

Fancy1

SURP

GBVS

STSD

AWS

PMES

MAM

PIM−ZEN

PIM−MCS

MCSDM

MSM−SM

PNSP−CS

Figure 6.7: The relationship between the average PSNR and the models’ accuracy.

92

Chapter 7

Compressed-Domain Tracking

This chapter presents our approach for region tracking in H.264/AVC-compressed video,

which could be used to enhance salient region analysis and, more generally, scene interpre-

tation, in the compressed domain. It is well known that motion has a strong impact on

saliency in dynamic scenes [120, 111, 68]. Hence, a number of saliency estimation meth-

ods have been proposed that rely on temporal changes [108, 109, 161, 104], as discussed

in Chapter 4. These methods estimate saliency in each frame using motion information,

extracted from the compressed bitstream, to decrease complexity. In video, the viewer’s at-

tention usually remains on a salient region for several consecutive frames. For example, [62]

showed that fixation duration ranges from 100 ms to 400 ms, mostly around 200 ms, which

equals 5 frames for a 25 fps video, or about 7 frames for 30 fps video. Fixation duration is

the period of time when the eyes fixate on a single location at the center of the gaze. Note

that fixation and attention are tightly inter-linked in normal viewing, although one is able

to change their attention independently of where the eyes are fixated [39]. Although the

relationship between visual saliency and tracking is an interesting one, in this chapter we

develop a general algorithm for region tracking that could be used independently of saliency

modeling. The algorithm was first presented in [81].

We focus on tracking a single moving region in the compressed domain using a Spatio-

Temporal Markov Random Field (ST-MRF) model. A ST-MRF model naturally integrates

the spatial and temporal aspects of the region’s motion. Built upon such a model, the

proposed method uses only the motion vectors (MVs) and block coding modes (BCMs)

from the compressed bitstream to perform tracking. First, the MVs are pre-processed

through intra-coded block motion approximation and global motion compensation (GMC).

93

Automatic

Labeling

Initialization

GME

H.264/AVC

Bitstream

ST-MRF

modelLabeling

Intra-Coded

Block

Processing

GMC

Preprocessing

Figure 7.1: The flowchart of our proposed moving object tracking.

At each frame, the decision of whether a particular block belongs to the region being tracked

is made with the help of the ST-MRF model, which is updated from frame to frame in order

to follow the changes in the region’s motion.

The flowchart of our proposed moving region tracking is shown in Fig. 7.1. In each frame,

the proposed method first approximates MVs of intra-coded blocks, estimates global motion

(GM) parameters, and removes GM from the motion vector field (MVF). The estimated

GM parameters are used to initialize a rough position of the region in the current frame

by projecting its previous position into the current frame. Eventually, the procedure of

Iterated Conditional Modes (ICM) updates and refines the predicted position according to

spatial and temporal coherence under the maximum a posteriori (MAP) criterion, defined

by the ST-MRF model.

7.1 MRF-Based Tracking

A moving rigid object/region is generally characterized by spatial compactness (i.e., not

dispersed across different parts of the frame), relative similarity of motion within the region,

and a continuous motion trajectory. Our ST-MRF model is based on rigid object/region

motion characteristics. We treat moving region tracking as a problem of inferring the MAP

solution of the ST-MRF model. More specifically, we consider the frame to be divided into

small blocks (4× 4 in our experiments). Region blocks will be labeled 1, non-region blocks

0. We want to infer the block labels ωt ∈ 0, 1 in frame t, given the labels ωt−1 in frame

t − 1, and the observed motion information ot =(vt, κt

). Here, the motion information

ot consists of the MVs from the compressed bitstream, denoted vt (n), and the BCM and

partition size κt (n), where n = (x, y) indicates the position of the block within the frame.

The criterion for choosing “the best” labeling ωt∗ is that it should maximize the posterior

94

probability P(ωt|ωt−1, ot

). Similar to (6.2) and (6.3), this problem is solved by a Bayesian

framework, where we express the posterior probability in terms of the inter-frame likelihood

P(ωt−1|ωt, ot

), the intra-frame likelihood P

(ot|ωt

), and the a priori probability P

(ωt)

as

follows

ωt∗ = arg maxψ∈Ω

P (ωt−1|ψ, ot) · P (ot|ψ) · P (ψ)

, (7.1)

where Ω denotes the set of all possible labeling configurations for frame t. The solution to

the maximization problem in (7.1) is the same as the solution to the following minimization

problem

ωt∗ = arg minψ∈Ω

− logP (ωt−1|ψ, ot)− logP (ot|ψ)− logP (ψ)

(7.2)

According to the Hammersley-Clifford theorem [19], the probabilities in (7.2) can be

expressed as Gibbs distributions of the form 1Z exp −E(x)

C for some energy function E(x),

partition function Z and normalizing constant C. Hence, we write

P (ωt−1|ψ, ot) =1

ZΓexp

− 1

CΓE(ψ;ωt−1, ot)

(7.3)

P (ot|ψ) =1

ZΛexp

− 1

CΛE(ψ; ot)

(7.4)

P (ψ) =1

ZΦexp

− 1

CΦE(ψ)

(7.5)

In the above equations, the three energy functions E(ψ;ωt−1, ot), E(ψ; ot), and E(ψ) rep-

resent the degree of inconsistency in temporal continuity, spatial context coherence, and

compactness, respectively. The parameters CΓ, CΛ and CΦ are scaling constants. In our

model, each of the three energy functions E is expressed as the summation of the corre-

sponding block-wise energy terms ξ over the object blocks, so that the optimization problem

becomes

ωt∗ = arg minψ∈Ω

1

CΓ

∑n:ψ(n)=1

ξ(n;ωt−1, ot) +1

CΛ

∑n:ψ(n)=1

ξ(n; ot) +1

CΦ

∑n:ψ(n)=1

ξ(n).

(7.6)

The first term measures the temporal discontinuity of labeling between consecutive frames

- the larger the difference between the labeling ωt−1 and the backwards-projected candidate

labeling ψ, the larger this term will be. The second term represents spatial incoherence

among region’s MVs - the larger the difference among the MVs within the labeled region

95

under the candidate labeling ψ, the larger this term will be. The compactness of region’s

shape is accounted for in the last term. All three terms will be defined more precisely in the

following sections. Finally, the minimization problem (7.6) will be solved by the method of

ICM [20].

7.1.1 Temporal Continuity

Temporal continuity is measured by the overlap between the labeling of the previous

frame, ωt−1, and the backwards-projected candidate labeling ψ for the current frame. Con-

sider a block n in the current frame that is assigned to the object by the candidate labeling

ψ, i.e. ψ(n) = 1. The block is projected backwards into the previous frame along its MV,

vt(n) = (vx(n), vy(n)), and the degree of overlap γ(n + vt(n), ωt−1) is computed as the

fraction of pixels within the projected block in the previous frame that carry the label 1

under the labeling ωt−1. The energy term for block n is taken to be

ξ(n;ωt−1, ot) = −γ(n + vt(n), ωt−1

). (7.7)

Our experiments suggest that temporal continuity is a powerful cue for region tracking,

and by itself is able to provide relatively accurate tracking in many cases. It is also rela-

tively simple to compute. However, its performance may suffer due to noisy or inaccurate

MVs, especially near region boundaries. Hence, a robust tracking algorithm should not

rely exclusively on temporal continuity, and should incorporate some spatial properties of

the MVF. Two such concepts, corresponding to the second and third terms in (7.6), are

discussed next.

7.1.2 Context Coherence

One of the characteristics of rigid object/region motion in natural video is the relative

consistency (coherence) of the MVs belonging to the object. This is true even if the motion

is not pure translation, so long as the frame rate is not too low. Such motion coherence has

frequently been used in compressed-domain segmentation [27, 119, 28, 163, 103]. A popular

approach is to model the MVs within a region with an independent bivariate Gaussian

distribution whose parameters are estimated from the decoded MVs. Once the parameter

estimates are obtained, one can adjust the segmentation by testing how well the MVs fit

into the assumed model. A particular problem with parameter estimation in this context

96

is the presence of outliers - incorrect or noisy MVs that are often found in flat-texture

regions or near region boundaries. For small regions, even a few outliers can lead to large

estimation errors. Sample variance is especially sensitive to outliers [137]. To resolve the

problem we employ robust statistics methods [137], which tend to be more resistant to the

effect of outliers compared to classical statistics.

A number of robust statistics methods have been proposed to estimate central tendency

of the data in the presence of outliers, e.g., Median, Trimmed Mean, Median Absolute

Deviation, and Inter Quartile Range [137, 155, 136]. We have tested these methods in the

context of our problem, and finally settled on a Modified Trimmed Mean method, as it

gave the best results. The main difference between our proposed Modified Trimmed Mean

and the conventional Trimmed Mean [137] is that instead of truncating a fixed percentage

of the data set from one or both ends of the distribution, in our method the outliers are

adaptively recognized and truncated, as explained below.

For the purpose of MV coherence analysis, the region’s motion is represented by a

single representative MV v. This vector is computed based on the Polar Vector Median

(cf. Section 7.2.1) of MVs that are assigned to the region by the candidate labeling ψ, i.e.

ψ(n) = 1. At this stage, we are using preprocessed MVs, v′(n) (cf. Section 7.2). The

deviation of a MV of the block n from the region’s representative vector v is computed as

the Euclidean distance between v′(n) and v:

d(n) =∥∥v′(n)− v

∥∥2. (7.8)

It is observed that the deviation d(n) for the blocks belonging to the object can be

modeled reasonably well by the rectified (non-negative) Gaussian distribution, except for

the outliers. Thus, we identify the outliers in the data by checking whether d(n) > τ , where

τ is computed as

τ = max 2 · σd, 1 , (7.9)

and σd is the sample standard deviation of d(n) assuming the average value of zero. The

distribution of d(n) for the rotating ball object in frame #4 of the Mobile Calendar sequence

is shown in Fig. 7.2 (top). In this example, the candidate labeling ψ is initially predicted by

projecting the previous frame labeling, i.e. ψ3, via current GM parameters. Two MVs with

d(n) ∼= 24.7 and one with d(n) ∼= 29 are recognized as the outliers based on the threshold

defined in (7.9). The fitted rectified Gaussian distributions before and after outlier removal

97

0 5 10 15 20 25 300

50

100

150

200

d(n)

Dis

trib

utio

n

Distribution of d(n) among object blocks

Distribution of d(n) among non−object blocksFitted Gaussian before Outlier RemovalFitted Gaussian after Outlier Removal

−1 −0.5 0 0.5 10

50

100

150

d’(n)

Dis

trib

ution

Distribution of d’(n) among object blocks

Distribution of d’(n) among non−object blocksObject/non−object seperator

Figure 7.2: Distribution of d(n) and d′(n) for the ball in frame #4 of the Mobile Calendar sequence.

are also shown. It can be seen that outliers significantly increase the sample variance and

that after their removal, the rectified Gaussian can be used as a reasonable model for d(n).

After removing the outliers, the sample standard deviation σd is recalculated. In certain

cases, when the region’s MVs are close to identical, the recalculated σd is close to zero. We

clip σd from below to 0.5 in order to avoid problems with subsequent computations. After

that, the MV deviation is normalized to the interval [−1, 1] as follows

d′(n) = min

d(n)/σd − 2

2, 1

. (7.10)

The higher the value of the normalized deviation d′(n), the less similar is v′(n) to v, so the

less likely is block n to belong to the region being tracked, as far as MV similarity goes.

Fig. 7.2 (bottom) shows the distribution of d′(n).

The effect of GM is taken into account by dividing d′(n) with a GM parameter ρ defined

as

ρ = 2− exp (−c1 · (ρA + ρB)) , (7.11)

where

ρA = c2 · (|m1|+ |m4|)c3 , (7.12)

98

ρB = |1−m2|+ |m3|+ |m5|+ |1−m6|, (7.13)

where mi, i = 1, 2, ..., 6, are the six affine GM parameters (cf. Section 7.2.2) and c1, c2 and

c3 are constant values. The logic behind (7.11) is as follows: the larger the camera motion,

the closer the value of ρ is to 2; the smaller the camera motion, the closer its value is to

1. Parameters m1 and m4 represent translation (value 0 means no translation), m2 and

m6 represent zoom (value 1 means no zoom), and m3 and m5 represent rotation (value 0

means no rotation). Hence, as any component of the affine motion increases, the exponent

in (7.11) becomes more negative, and the value of ρ gets closer to 2, which is the highest

it can be. In the case of fast camera motion, MVs are less reliable, and the influence of

context coherence on tracking decisions should be reduced. Hence, the block-wise context

coherence energy term in (7.6) is computed by dividing the normalized MV deviation in

(7.10) by ρ, that is

ξ(n; ot) = d′(n)/ρ. (7.14)

7.1.3 Compactness

While there are certainly counterexamples to this observation, most rigid objects/regions

in natural video tend to have compact shape, meaning that the chance of a block belonging

to the region being tracked is increased if many of its neighbors are known to belong to

the same region. We take this observation into account through the last term in (7.6). We

employ the 8-adjacency neighborhood with different weights for the first-order and second-

order neighbors. The block-wise energy term for compactness is computed by the weighted

sum of labels in the neighborhood of the current block:

ξ(n) = −α ·∑

p∈N+(n)

ψ(p)− β ·∑

p∈N×(n)

ψ(p), (7.15)

where N+(n) and N×(n) are, respectively, the first-order (North, South, East, and West)

and the second-order (North-East, North-West, South-East, and South-West) neighbor-

hoods of block n. Recall from (7.6) that inside of the sum, ψ(n) = 1. A neighboring block

with label 0 will not change the value of (7.15), while a neighboring block with label 1

will make (7.15) more negative. Hence, the higher the number of neighboring blocks that

belong to the region being tracked (label 1), the lower the value of the energy term in (7.15),

99

meaning that the more likely it is that block n also belongs to that region. We set α = 16

and β = 112 in our experiments to give higher weight to first-order neighbors.

7.1.4 Optimization

Now that each term in (7.6) is defined, the optimization problem needs to be solved. As

discussed in Section 6.2.5, we use ICM to solve (7.6), mainly due to its simplicity. At the

beginning, the label of each block is initialized by projecting the previous frame labeling

ωt−1 into the current frame using the current GM parameters. After that, each block is

relabeled with the label (0 or 1) that leads to the largest reduction in the energy function.

This relabeling procedure is iterated until no further energy reduction is achieved. Usually,

six iterations are enough to reach a local minimum.

7.2 Preprocessing

Our proposed tracking algorithm makes use of two types of information from the H.264/AVC-

compressed bitstream: BCM (partition) information and MVs. Texture data does not need

to be decoded in the proposed method. H.264/AVC defines four basic macroblock (MB)

modes [159]: 16× 16, 16× 8, 8× 16, and 8× 8, where the 8× 8 mode can be further split

into 8×4, 4×8, and 4×4 modes. Since the smallest coding mode (partition) in H.264/AVC

is 4× 4, in order to have a uniformly sampled MVF, we map all MVs to 4× 4 blocks. This

is straightforward in inter-coded blocks, as well as SKIP blocks where the MV is simply

set to zero. However, interpreting the motion in the intra-coded blocks is more involved.

In this section, we describe two preprocessing steps that are employed before the actual

ST-MRF optimization discussed in the previous section. These steps are the management

of intra-coded blocks and eliminating GM.

7.2.1 Polar Vector Median for Intra-Coded Blocks

Intra-coded blocks have no associated MVs. However, for the purpose of running the

ST-MRF optimization in the previous section, it is useful to assign MVs to these blocks.

We propose to do this based on the neighboring MVs using a new method called Polar

Vector Median (PVM). For this purpose, we employ MVs of the first-order neighboring

MBs (North, West, South, and East) that are not intra-coded. Fig. 7.3 shows a sample

100

INTRA-

CODED

v1v2v3

v4

v5

v6

v(b)=?

Figure 7.3: MV assignment for an intra-coded MB. One of the first-order neighboring MBs is alsointra-coded, and the remaining neighbors have MVs assigned to variable size blocks.

intra-coded MB along with its first-order neighboring MBs. In this example, the goal is to

find v(b) for all blocks b in the intra-coded MB. We collect MVs of the 4× 4 blocks from

the neighboring MBs that are closest to the current intra-coded MB and store them in the

list V . For this example, the list of MVs is V = (v1,v2,v3,v3,v4,v4,v5,v5,v6,v6,v6,v6).

Note that v6 appears four times in the list, because it is assigned to four 4 × 4 blocks

along the southern boundary of the intra-coded MB. For the same reason, v3, v4, and v5

appear twice, while v1 and v2 appear only once in the list. The list will contain at most 16

vectors, which happens when all first-order neighboring MBs are inter-coded. In the above

example, one of the neighboring MBs is intra-coded, so the list contains only 12 vectors.

The next step is to assign a representative vector from this collection of vectors. The

standard vector median [14] would be the vector from the list with the minimum total

distance to other vectors in the list. The existence of outliers, however, has an adverse

effect on the vector median. We therefore propose another method called PVM, which has

proved less cost demanding and more robust in our study.

In PVM, the representative vector is computed in polar coordinates as follows. Let

V = (vi)i=1:n be the list of n input vectors, sorted according to their angle (in radians) from

−π to +π. Then, a collection of m = b(n+ 1)/2c vectors is selected as V = (vi)i=k:k+m−1,

where index k is found as

k = arg minj

j+m−2∑i=j

θi, (7.16)

and θi denotes the angle between vectors vi and vi+1 (let v1 ≡ vn+1). The new list V

contains approximately half the original number of vectors in V chosen such that the sum

101

(a) (c) (d)(b)

Figure 7.4: Polar Vector Median: (a) original vectors, (b) angles of vectors; cyan vectors: candidatevectors for computing representative angle, red vector: representative angle, (c) lengths of vectors,red line: representative length, (d) result of polar vector median, green vector: standard vectormedian [14], red vector: polar vector median.

of the angles between them is minimum. Hence, they are clustered in a narrow beam. The

PVM v is constructed as follows: its angle is chosen to be the median of angles of the

vectors in V , while its magnitude is set to the median of magnitudes of the vectors in V ,

that is

∠v = median(∠vi)i=k:k+m−1, (7.17)

‖v‖2 = median(‖vi‖2)i=1:n. (7.18)

Fig. 7.4 shows an example of computing the PVM. It should be mentioned that zero-

vectors are excluded from angle calculations, since their angle is indeterminate. Once PVM

v is computed, it is assigned to all 4 × 4 blocks within the intra-coded MB (e.g., v(b) in

Fig. 7.3).

Fig. 7.5 shows the effect of assigning PVM of neighboring blocks to intra-coded blocks on

frame #35 of the Hall Monitor sequence. The intra-coded blocks are indicated in Fig. 7.5(a),

the labeling with zero MV assignment to intra-coded MBs is shown in Fig. 7.5(b), while the

labeling with PVM assignment to intra-coded MBs is shown in Fig. 7.5(c). When the block

labeling is computed as discussed in Section 7.1, all pixels in a given block are assigned

the label of that block. By comparing with the manually segmented ground truth, one can

identify correctly labeled region pixels (true positives - TP ), non-region pixels incorrectly

labeled as region pixels (false positives - FP ), and missed region pixels that are labeled

as non-region pixels (false negatives - FN). The numbers are TP = 3340, FP = 580,

FN = 199 for Fig. 7.5(c), where PVM is used, against TP = 3313, FP = 591, FN = 226

in Fig. 7.5(b), where zero-vector is used instead of the PVM. It is easy to see that detection

102

(a) (b) (c)

Figure 7.5: The effect of assigning PVM to intra-coded blocks: (a) intra-coded blocks indicatedas yellow squares; (b) tracking result without PVM assignment, (c) tracking result with PVMassignment. TP are shown as green, FP as blue, FN as red.

is improved by PVM assignment around the man’s feet in the bottom parts of Fig. 7.5(b)

and Fig. 7.5(c).

7.2.2 Global Motion Compensation

Global motion (GM), caused by camera movement, affects all pixels in the frame. Since

GM adds to the object’s native motion, for accurate tracking, it is important to remove

GM from the MV field prior to further processing. In this work we use the 6-parameter

affine model [37] to represent GM. Although less flexible than the 8-parameter perspective

model, the affine model has been widely used in the literature to represent GM, in part

because there are fewer parameters to be estimated, which often leads to higher estimation

accuracy. Given the parameters of the affine model, [m1, . . . ,m6], a block centered at (x, y)

in the current frame will be transformed to a quadrangle centered at (x′, y′) in the reference

frame, where (x′, y′) is given by

x′ = m1 +m2x+m3y, y′ = m4 +m5x+m6y. (7.19)

The MV due to affine motion is given by

v(x, y) = (x′ − x, y′ − y) (7.20)

103

To estimate GM parameters [m1, . . . ,m6], we use a modified version of the M-Estimator

introduced in [13] by Arvanitidou et al., which is an extension of [144]. This method

reduces the influence of outliers in a re-weighted iteration procedure based on the estimation

error obtained using least squares estimation. This iterative procedure continues until

convergence. Chen et al. [29] showed that the performance of this approach depends on the

tuning constant in weighting factor calculation. In [13], the weighting factor w(ξi) for i-th

MV, which imposes the strength of outlier suppression, is calculated as

w(ξi) =

(

1− ξ2iτ2·µ2ξ

)2

ξi < τ · µξ

0 ξi ≥ τ · µξ(7.21)

where τ is the tuning constant, ξi is the estimation error calculated as the Manhattan norm

between the observed MV and the MV obtained from the estimated model (7.19)-(7.20),

and µξ is the average estimation error over all the MVs in the frame. In our slightly modified

approach, to obtain the decision boundary for outlier suppression (which is τ ·µξ in (7.21)),

instead of using all the MVs in the frame, we choose only those MVs that are not currently

declared as outliers. In particular, the weighting factor is calculated iteratively as

w(ξi) =

(

1− ξ2i(µξ+2σξ)2

)2ξi < µξ + 2σξ

0 ξi ≥ µξ + 2σξ

(7.22)

where ξ is the set of estimation errors for MVs with a non-zero weighting factor after each

iteration, and σξ is the standard deviation of the new set ξ. Once the weighting factor for

a MV becomes zero, it will be discarded from the set ξ in the following iterations. In this

approach, the decision boundary for outlier suppression uses both the average value and

standard deviation of estimation errors over the set ξ, which is more robust in comparison

to the conventional approach where only the average value of estimation errors over all MVs

is used.

Arvanitidou et al. [13] observed that large block modes are more likely to be part of

the background, whereas small block modes, arising from splitting in the motion estimation

procedure at the encoder, are more likely to belong to foreground moving objects. As an

example, Fig. 7.6 shows small block modes (8 × 4, 4 × 8 and 4 × 4) indicated in red in a

couple of frames from Coastguard and Stefan sequences. Hence, only MVs of large blocks

(16 × 16, 16 × 8, 8 × 16 and 8 × 8) are used in global motion estimation (GME), while

MVs of small blocks (8× 4, 4× 8 and 4× 4) and intra-coded blocks are discarded from this

104

Figure 7.6: Small block modes (8 × 4, 4 × 8 and 4 × 4) of frame #2 for Coastguard and Stefansequences.

process. In our approach, for the purpose of GME, we also discard the MVs from the region

that was occupied by the object in the previous frame. To get fast convergence to a stable

solution and escape from being trapped in many local minima, we initialize the weighting

factor for i-th MV based on its dissimilarity to neighboring MVs, denoted by δ(ξi), which is

computed as the median of its Manhattan differences from MVs of its neighbors. Therefore,

the initial weight for i-th MV is computed by

w(ξi) = exp (−δ(ξi)). (7.23)

7.3 Experiments

7.3.1 Experimental setup

A number of standard test sequences were used to evaluate the performance of our

proposed approach. Sequences were in the YUV 4:2:0 format, at two resolutions, CIF

(Common Intermediate Format, 352×288 pixels) and SIF (Source Input Format, 352×240

pixels), all at the frame rate of 30 fps. All sequences were encoded using the H.264/AVC JM

v.18.0 encoder [5], at various bitrates, with the GOP (Group-of-Pictures) structure IPPP,

i.e., the first frame is coded as intra (I), and the subsequent frames are coded predictively

(P). Motion and partition information were extracted from the compressed bitstream, and

MVs were remapped to 4× 4 blocks, as explained in Section 7.2.

Some of the characteristics of the proposed approach are its robustness and stability.

To show this, we use the same parameters throughout all the experiments, as listed in

105

Table 7.1: Parameter values used in our experiments

Parameter CΓ CΦ CΛ c1 c2 c3Value 1 2/3 0.25 1/128 1 2

Table 7.1. We found that the average performance does not change much if some of these

parameter values are changed, especially the parameters c1, c2 and c3 that represent the

effect of camera motion on energy function, and only affect a few frames.

7.3.2 Results

Fig. 7.7 illustrates a few intermediate results from the tracking process for a sample

frame from Coastguard. As seen from Fig. 7.7(b), the MVF around the target (small

boat) is somewhat erratic, due to fast camera motion in this part of the sequence. The

proposed ST-MRF-based tracking algorithm computes the energy function for the chosen

MRF model, which is shown in Fig. 7.7(c). Therefore, despite the erratic MVF, the target

seems to be localized reasonably well. Fig. 7.7(d) shows the detected target region after

the tracking process has been completed. Pixels shaded with different colors indicate TP,

FP and FN, as explained in the caption of Fig. 7.5. As seen here, some erroneous decisions

are made around the boundary of the target object, but the object interior is detected well.

For comparison purposes, the segmentation result from two other methods, [119] and [163],

are illustrated in Fig. 7.7(e) and Fig. 7.7(f), respectively. Clearly, the proposed method has

produced a much better result compared to these two methods in the face of sudden and

fast camera movement.

In Fig. 7.8 our proposed method is compared with the methods of Liu et al. [103], Chen

et al. [119], and Zeng et al. [163] in terms of visual segmentation results on frame #97 of

Mobile Calendar. The region being tracked is the ball. In frame #97, which corresponds

to the moment when the ball touches the train and changes its direction, the blocks within

the ball and the train have almost equal MVs. This may cause confusion in the tracking

or segmentation task. As a consequence, the proposed method declares some parts of the

train as the ball; nonetheless, the ball itself is detected correctly (Fig. 7.8(a)). The method

from [103], on the other hand, misses a large part of the ball (Fig. 7.8(b)). It only detects

a small part of the ball and misclassifies some parts of the train and the background as the

target. Segmentation results of the methods from [119] (Fig. 7.8(c)) and [163] (Fig. 7.8(d))

106

(a) (b)

(c) (d)

(e) (f)

Figure 7.7: Object detection during ST-MRF-based tracking (a) frame #70 of Coastguard (b)target superimposed by scaled MVF after GMC, (c) the heatmap visualization of MRF energy, (d)tracking result by the proposed method, (e) segmentation result from [119], and (f) [163].

107

(a) (b)

(c) (d)

Figure 7.8: Object detection/tracking by (a) proposed method, (b) and method from [103] forframe #97 of Mobile Calendar (c) Segmentation result from [119], and (d) [163].

methods show that the ball is not separated from the train, which is not surprising, since

these methods rely on spatial segmentation of MVs. Another example that illustrates the

robustness of the proposed method is the Coastguard sequence. Here, our method is able

to track the small boat throughout the sequence, even during the fast camera movement,

as shown by the trajectory in Fig. 7.9(b). By comparison, none of the other three methods

were able to segment and/or track the small boat in the frames #68 to #76, where the

camera motion overwhelms the object motion.

The average values of Precision, Recall and F-measure for several sequences where

segmentation ground truth was available are shown in Table 7.2 for [119], [163], and the

proposed method. Precision is defined as the number of TP divided by the total number

of labeled pixels, i.e., the sum of TP and FP. Recall is defined as the number of TP divided

by the total number of ground truth labels, i.e., the sum of TP and FN. F-measure is the

harmonic mean of precision and recall. Unfortunately, for the method from [103], we were

only able to obtain object masks for Mobile Calendar from the authors. As seen in this

108

Table 7.2: Total average of Precision (P), Recall (R), and F-Measure (F ) in percent for differentmethods

Method MeasureMobile CoastguardStefan(CIF)

Stefan(SIF)

Hall Garden Tennis City Foreman Avg.

Proposed P 75.9 64.3 84.2 84.7 72.8 82.9 94.1 92.9 92.3 82.7R 88.4 89.4 68.3 67.8 84.4 95.8 88.0 96.5 90.4 85.5F 81.2 74.4 74.1 74.3 78.1 88.8 90.8 94.6 91.2 83.0

[119] P 31.8 4.8 18.5 18.4 27.9 53.2 76.3 86.8 85.8 44.8R 91.6 86.3 86.0 88.5 91.9 99.0 72.9 96.9 64.3 86.4F 40.5 8.1 25.5 24.4 37.3 68.8 69.9 91.5 69.9 48.4

[163] P 6.6 3.0 10.7 10.5 15.6 34.6 48.9 77.8 81.2 32.1R 93.3 95.9 87.8 88.4 90.1 99.2 76.2 97.0 65.3 88.1F 12.1 5.8 18.3 17.9 22.9 50.7 52.2 84.2 69.5 37.1

table, the proposed method has the highest Precision and F-measure across all sequences,

averaging almost a two-fold improvement in both of these metrics. The reason why its Recall

is sometimes lower than that of the other two methods is that it tracks the boundary of the

target object fairly closely in a block-based fashion, and therefore excludes from the object

those pixels that fall into boundary blocks that get declared as the background, resulting

in higher FN. By comparison, the other two methods often include the object and a large

portion of the background into the segmented region, resulting in a smaller FN.

The example shown in Fig. 7.9 shows the tracked trajectory produced by the proposed

method for several standard sequences, superimposed onto the last frame in the sequence.

Trajectory is obtained by connecting the center of gravity of the tracked target through the

frames. The blue lines connect the centers of gravity produced by the proposed method,

while yellow lines connect the centers of gravity of the ground truth. The trajectory result

of ball tracking in Mobile Calendar is shown in Fig. 7.9(a). Although the ball has unre-

liable MVs due to its rotational movement and relatively small size, and even changes its

direction twice, the proposed algorithm is able to track it reasonably well. In Coastguard

in Fig. 7.9(b), the small boat moves from right to left; the camera at first follows the small

boat until frame #66, then moves upwards quickly, and after that it follows the big boat

starting at frame #75. During the first 66 frames, because the camera motion is in line

with the movement of the small boat, the location of the small boat relative to the frame is

fairly static. During the frames #67 to #74, when the camera moves upwards quickly, the

MVs are rather erratic. The method of [103] has problems in this part of the sequence and

fails to track the small boat, as noted in [103]. Similarly, the methods of [163] and [119]

109

fail here as well, since the camera motion completely overwhelms object motion. However,

the proposed method is able to track the small boat fairly accurately even through this

challenging part of the sequence. Fig. 7.9(c) and Fig. 7.9(d) (Stefan CIF and SIF) are

good examples of having both rapid movement of the target object (the player) and the

camera. Since our algorithm is not able to detect the player’s feet as a part of the target

region, due to their small size relative to the block size, the center of gravity produced by

the proposed algorithm is above the ground truth center of gravity. Other than that, the

proposed method correctly tracks the player even in this challenging MVF involving fast

camera motion and rapid player’s movement. Fig. 7.9(e) shows the trajectory results for

Hall Monitor, where the camera is fixed and the man is moving from the door to the end of

the hall. In Fig. 7.9(f) the camera motion is toward the right, therefore the position of the

target (tree trunk) changes from right to left within the frame. The center of gravity goes

upwards in the latter part because the bottom of the tree trunk disappears and the center

of gravity rises as the result.

110

(a) (b)

(c) (d)

(e) (f)

Figure 7.9: Trajectory results of (a) Mobile Calendar, (b) Coastguard (c) Stefan CIF (d) StefanSIF (e) Hall Monitor and (f) Flower Garden sequences (blue lines: proposed algorithm, yellow lines:ground truth)

111

Chapter 8

Conclusions and Future Work

8.1 Summary and Conclusions

In this dissertation, we studied two important video processing problems – saliency esti-

mation and tracking – in the compressed domain. Although pixel-based methods have the

potential to yield more accurate results compared to compressed-domain methods, their

high computational complexity limit their use in practice. Since all digital video content

available to the end-users is in a compressed form anyway, pixel-domain algorithms require

additional computational effort to decode the video bitstream before they can be applied.

On the other hand, compressed-domain algorithms are able to significantly reduce computa-

tional complexity by utilizing information from the encoded video bitstream at the expense

of losing some accuracy.

In terms of saliency estimation, we proposed three compressed-domain features as a

measure of saliency, and showed that they have high correlation with human gaze points.

Subsequently, two saliency estimation algorithms have been proposed and shown to have

superior accuracy with respect to state-of-the-art algorithms, even pixel-domain ones.

In particular, in Chapter 2 we presented an overview of a popular pixel-domain saliency

model, the so-called Itti-Koch-Niebur (IKN). The IKN saliency map is estimated from color

images through a biologically plausible manner. This bottom-up model of visual attention

analyzes various pre-attentive independent feature channels such as intensity, color and

orientation. Each feature is calculated and contrasted within two different resolutions using

a center-surround mechanism where each resolution is obtained by progressively low-pass

filtering and down-sampling the input image. More specifically, 6 intensity feature maps, 12

112

color feature maps, and 24 maps for orientation feature maps are created from the image.

Finally, all computed feature maps are combined across all resolutions as well as feature

channels to create the master saliency map. We also described some of variants of IKN in

Chapter 2. Two normalization operators for combining feature maps, namely MaxNorm

and FancyOne, were explained. In contrast to the MaxNorm operator, FancyOne is more

in tune with the local connectivity of cortical neurons and produces sparser saliency maps

with sharper peaks, leading to more accurate prediction of human fixations. In another

variation, two new features that are responsible for motion and flicker contrasts were added

to the standard IKN model to address spatio-temporal saliency estimation. Particularly,

24 motion feature maps and 6 flicker feature maps are created for various orientations, and

combined with the feature maps from basic IKN model to generate the spatio-temporal

saliency map.

A unified framework for accuracy measurement of visual saliency models was introduced

in Chapter 3. Two popular eye-tracking datasets for video, namely the SFU and DIEM

datasets, were used in our study. In addition to these ground-truth data, two saliency maps,

i.e., Intra-Observer (IO) and Gaussian center-bias (GAUSS), were used as benchmarks to

evaluate computational models. IO saliency map is obtained by the convolution of a 2-

D Gaussian blob with the second set of eye tracking fixations of the same observer. IO

saliency map was used as a benchmark for top performer saliency models. GAUSS map is

simply a 2-D Gaussian blob located at the image center, considered as the output of a fully

center-biased model. Several metrics for evaluating the accuracy of visual saliency models

were also reviewed in this chapter, and then several novel metrics, namely AUC′, NSS′ and

JSD′, were proposed to overcome the shortcomings of conventional evaluation metrics, such

as gaze point uncertainty, center bias and border effects. A highly accurate computational

saliency model is expected to perform well across most metrics.

A comparison among existing compressed-domain saliency estimation models was given

in Chapter 4. Three well-known pixel-domain saliency models, i.e., MaxNorm, GBVS and

AWS, were also considered in this comparison to gain insight into the performance gap

between the current compressed-domain and pixel-domain state-of-the-art. In total, nine

compressed-domain visual saliency models were included in the study. While different video

coding standards were assumed by different compressed-domain models, our comparison was

carried out using the MPEG-4 ASP format. In this case, only two models (MCSDM and

113

APPROX) required minor modifications. The experimental results revealed that MaxNorm,

AWS, GBVS, PMES, GAUS-CS and PNSP-CS sustain superior performance according to

different accuracy metrics. While AWS needs 1535 ms on average per CIF video frame

in our environmental setting, GBVS 783 ms, MaxNorm 91 ms, and the three top per-

former compressed-domain models from 57 ms to 98 ms. It is encouraging that using only

compressed-domain information such as motion vectors (MVs), one can obtain comparable

accuracy to pixel-domain models with much less complexity. Furthermore, based on our

analysis, even though the model’s accuracy gets worse as the quality of encoded video drops,

the performance of a saliency model is relatively consistent over a meaningful range of Peak

Signal-to-Noise Ratio (PSNR).

In Chapter 5, we proposed three novel compressed-domain features – Motion Vector En-

tropy (MVE), Smoothed Residual Norm (SRN) and Operational Block Description Length

(OBDL). MVE is computed by MVs and block coding modes (BCMs), and SRN by trans-

formed prediction residuals. OBDL, on the other hand, is obtained directly from the output

of the entropy decoder and is considered as a measure of incompressibility. Both MVE and

SRN are computed from the H.264/AVC bitstream in this work, but can be extended to

other video coding standards such as newly developed HEVC. The adaptation of OBDL to

any encoding standard is even simpler since OBDL is simply the number of bits required

to encode a block within a frame. We analyzed the statistics of the proposed compressed-

domain features around fixation points and randomly-selected non-fixation points, and val-

idated using a two-sample t-test the hypothesis that on average, feature values at fixation

points tend to be higher than at non-fixation points.

In Chapter 6, two saliency estimation models were introduced based on the compressed-

domain features proposed in Chapter 5. The first model is called MVE+SRN, and com-

putes the saliency map by fusing MVE and SRN feature maps. In the second model, called

OBDL-MRF, OBDL feature maps are filtered spatially and temporally, and the results are

incorporated into a Markov Random Field (MRF) model. In other words, the saliency

labeling of blocks within a frame at a certain time is solved using a maximum a posteriori

(MAP) of a MRF model. This allows formulating the log-posterior as the sum of temporal

consistency, observation coherence and compactness. To find the optimum label assignment,

we used Iterated Conditional Modes (ICM) mainly due to its low computational complexity.

The resulting saliency label assignments are then exploited to enhance the initial saliency

114

value. While, at a high level, OBDL-MRF is similar to well-known saliency models, such as

those based on self-information and surprise, it has the distinct advantage of being readily

available at the output of any video encoder, which already exists in most modern cam-

eras. Furthermore, the compressibility measure now proposed naturally takes into account

the trade-off between spatial and temporal information, because the video encoder already

performs rate-distortion optimization to minimize the number of bits needed to reconstruct

different regions in video. In this sense, the proposed solution is a much more sophisti-

cated measure of compressibility than previous measures based on reconstruction error, or

cruder measurements of self-information. We compared our proposed saliency models, in-

cluding MVE+SRN and OBDL-MRF, with seven compressed-domain saliency models as

well as seven prominent pixel-domain counterparts on H.264/AVC-compressed video. The

resulting saliency measures were shown highly accurate for the prediction of eye fixations,

achieving state-of-the-art results on standard benchmarks. This is complemented by very

low complexity, an average of 30 ms for MVE+SRN and 39 ms for OBDL-MRF per CIF

video frame in MATLAB, which makes these models appropriate for practical deployment.

Our analysis of the relationship between the performance of our saliency models and objec-

tive quality showed that both MVE+SRN and OBDL-MRF achieve their highest accuracy

at PSNR around 35 dB while their performance degrades dramatically at both extremes of

PSNR, below 26 dB and above 42 dB.

Finally, in Chapter 7, we have presented a novel approach to track a moving ob-

ject/region in a H.264/AVC-compressed video. The only data from the compressed stream

used in the proposed method are the MVs and BCMs. As a result, the proposed method

has a fairly low processing time, yet still provides high accuracy. After the preprocessing

stage, which consists of intra-coded block approximation and global motion compensation

(GMC), we employed a MRF model to detect and track a moving target. Using this model,

an estimate of the labeling of the current frame was formed based on the previous frame

labeling and current motion information. The results of experimental evaluations on ground

truth video demonstrate superior functionality and accuracy of our approach against other

state-of-the-art compressed-domain segmentation/tracking approaches.

115

8.2 Future Work

In this research study, our proposed compressed features of saliency were extensively

tested on H.264/AVC-compressed video. While H.264/AVC is in wide use today in me-

dia and entertainment industry, the new encoding standard, i.e., HEVC, has continued to

attract market interests since its publishing in 2013. Additionally, HEVC doubles the com-

pression ratio compared to H.264/AVC at the same level of video quality. Consequently, it

would be highly interesting to investigate if there is a stronger correlation between our video

features, specifically OBDL, and human fixation points in HEVC compared to H.264/AVC

at the same level of quality.

In our proposed MRF model for saliency detection and also for video object tracking, we

used fixed temperature parameters. Although our algorithm works well even with fixed pa-

rameter values, possibly better performance may be obtained by adaptive tuning, although

this would in general increase the complexity. Along these lines, dynamic parameter tuning

as proposed by Kato et al. [77] would be worth investigating as future work. We also be-

lieve that our relatively good results of MRF inference are partly due to reasonably accurate

initialization of the ICM, i.e., ICM is able to find good solutions in our case because the

labels are initialized to a reasonably good configuration. As a possible future work, we can

test this assertion by comparing ICM results to modern energy minimization algorithms for

MRF such as graph cuts [149]. In addition, while the MRF model has some weaknesses over

Conditional Random Field (CRF) [97], the labeling problem could probably be improved

incrementally in the future.

In the MVE+SRN method, we used additive and multiplicative operations to com-

bine the compressed features MVE and SRN with the same weight for each of the three

terms in (6.1). Possibly more accurate saliency results can be obtained by setting differ-

ent weights for different features and operations. For example, since SRN achieves more

accurate saliency prediction according to different metrics (Section 6.3.2), it would seem rea-

sonable to assign higher weight to SRN compared to MVE in saliency estimation. Also, one

can expect improvement by weighting the impact of mutual reinforcement (multiplication

between MVE and SRN).

We discussed various situations where most algorithms fail to estimate saliency. For

example, the falling yellow leaves in Tempete confuse most of the algorithms as they predict

116

these leaves to be salient, whereas the static object (the flower) in the middle of the frame

is what actually attracts human attention. Also, for video sequences such as City and

Flower Garden, where motion is produced only via camera motion, the algorithms generally

get confused by the objects closer to the camera. Another example would be dynamic

background (e.g., moving trees) which may require many bits to code, and may therefore

appear as salient to algorithms relying on compressed-domain features such as OBDL. These

are only a few examples where our compressed-domain algorithms, and in fact most existing

algorithms, might not perform well. More research on characterizing such situations and

finding appropriate solutions is needed.

Our MRF-based region tracking method attempts to track a single moving region, de-

tected in an intra-coded frame, through subsequent inter-coded frames. To be able to use

this method in a general setting, however, it should be extended to multiple region tracking.

In light of that, each region can be individually tracked by our proposed MRF framework.

However, the framework should be able to detect and resolve occlusion. We suggest to use

Merge-Split approach [46] for occlusion reasoning. In this approach, once several regions

are predicted as being occluded, they are merged and encapsulated into a new region. This

new region is then tracked using its new feature characteristic. Upon separation, the re-

gion is split into separated regions. An occlusion in the next frame can be predicted by

the estimated position of blocks belonging to tracked regions using their current MVs. For

tracking an occluded region, MRF framework can still be applied. To do this, all energy

terms in Eq. (7.6) can remain the same, except for context coherence energy, because a new

encapsulated region consists of various context coherency attributes. To manage context

coherence energy in the case of occlusion, our suggestion is to define several representative

MVs rather than a single representative MV. In turn, the context coherence energy needs

to be redefined to handle several groups of consistent motions.

It would also be interesting to investigate a new compressed-domain saliency model,

which detects salient regions in intra-coded frames and simply tracks them using the

proposed ST-MRF framework through inter-coded frames for a certain period of time.

This would also enable hybrid pixel-compressed-domain saliency estimation, where a pixel-

domain estimate is obtained in intra-coded frames, and then tracked and/or refined in

subsequent inter-frames. Typically, the fraction of intra-coded frames is relatively small

compared to the total number of frames in the video and they require less computational

117

effort for decoding compared to inter-coded frames. This allows a more complex saliency

estimation procedure to be employed on intra-coded frames. At the same time, there are

highly accurate still image saliency models in the pixel-domain, such as AWS, that could

be employed for saliency estimation in these frames.

While most efforts on saliency estimation have been done on high-quality noiseless video,

much less attention has been given to low-quality video. Since one of the main application

areas of saliency-based processing is video surveillance, it is important for a practical solu-

tion to perform well on low-quality video taken by cheap cameras and in unfavorable lighting

conditions. Unfortunately, at this time, an eye-tracking dataset including video sequences

with different sources of noise is not available, so it is not possible to comprehensively test

existing saliency models on such video sequences. Creating such a dataset would therefore

enable further progress on this topic.

In addition to the material presented in this dissertation, we have proposed a new

method for compressed-domain estimation of global motion (GM) in [79], which has the

potential to enhance tracking and possibly saliency estimation accuracy, by incorporat-

ing camera motion into saliency computation. The influence of GMC on the accuracy of

saliency estimation has been shown by Hadizadeh in [51] and also discussed in Chapter 4,

so it is reasonable to expect that proper handling of global motion would enhance saliency

estimation.

Finally, in [80] we presented a method for visualizing the motion of an object on a

background sprite (mosaic). The background sprite was generated efficiently by using a

limited amount of information from the compressed video stream - MVs from each frame,

and pixel values from a select subset of frames. As a further contribution, this method can

be adjusted for saliency visualization to illustrate trajectories of estimated Foci of Attention

(FOAs) over a video sequence on the constructed background sprite.

118

Bibliography

[1] Adaptive whitening saliency model (AWS). http://persoal.citius.usc.es/xose.vidal/research/aws/AWSmodel.html. [Online]. 44

[2] The dynamic images and eye movements (DIEM) project. http://thediemproject.wordpress.com. [Online]. 5, 20, 70

[3] Eye tracking database for standard video sequences. http://www.sfu.ca/~ibajic/

datasets.html. [Online]. 19

[4] FFMPEG project. http://www.ffmpeg.org. [Online]. 44, 70, 80

[5] H. 264/AVC reference software. http://iphome.hhi.de/suehring/tml/. [Online].105

[6] Locarna systems. http://www.locarna.com. [Online]. 19, 27

[7] Saliency benchmark datasets. http://people.csail.mit.edu/tjudd/

SaliencyBenchmark/. [Online]. 42

[8] SR Research. http://www.sr-research.com. [Online]. 20, 27

[9] G. Abdollahian, Z. Pizlo, and E.J. Delp. A study on the effect of camera motion onhuman visual attention. In Proc. IEEE ICIP’08, pages 693–696, 2008. 62

[10] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception ofmotion. J. Opt. Soc. Am, 2(2):284–299, 1985. 69

[11] G. Agarwal, A. Anbu, and A. Sinha. A fast algorithm to find the region-of-interest inthe compressed MPEG domain. In Proc. IEEE ICME’03, volume 2, pages 133–136,2003. 31, 35, 37, 81

[12] S. M. Anstis and D. M. Mackay. The perception of apparent movement [and dis-cussion]. Philosophical Transactions of the Royal Society of London. B, BiologicalSciences, 290(1038):153–168, 1980. 69

[13] M. G. Arvanitidou, A. Glantz, A. Krutz, T. Sikora, M. Mrak, and A. Kondoz. Globalmotion estimation using variable block sizes and its application to object segmenta-tion. In Proc. IEEE WIAMIS’09, pages 173–176, 2009. 67, 104

[14] J. Astola, P. Haavisto, and Y. Neuvo. Vector median filters. Proceedings of the IEEE,78(4):678–689, 1990. xiv, 101, 102

[15] F. Attneave. Informational aspects of visual perception. Psychological Review, 61:183–193, 1954. 2

119

http://persoal.citius.usc.es/xose.vidal/research/aws/AWSmodel.html

http://persoal.citius.usc.es/xose.vidal/research/aws/AWSmodel.html

http://thediemproject.wordpress.com

http://thediemproject.wordpress.com

http://www.sfu.ca/~ibajic/datasets.html

http://www.sfu.ca/~ibajic/datasets.html

http://www.ffmpeg.org

http://iphome.hhi.de/suehring/tml/

http://www.locarna.com

http://people.csail.mit.edu/tjudd/SaliencyBenchmark/

http://people.csail.mit.edu/tjudd/SaliencyBenchmark/

http://www.sr-research.com

[16] H. Barlow. Cerebral cortex as a model builder. In Models of the Visual Cortex, pages37–46, 1985. 2

[17] H. Barlow. Redundancy reduction revisited. Network: Computation in Neural Sys-tems, 12:241–253, 2001. 2

[18] E. Baudrier, V. Rosselli, and M. Larabi. Camera motion influence on dynamic saliencycentral bias. In Proc. IEEE ICASSP’09, pages 817–820, 2009. 62

[19] J. Besag. Spatial interaction and the spatial analysis of lattice systems. Journal ofthe Royal Statistical Society. Series B, 36:192–236, 1974. 77, 95

[20] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal StatisticalSociety. Series B, 48:259–302, 1986. 77, 79, 96

[21] A. Borji and L. Itti. State-of-the-art in visual attention modeling. IEEE Trans.Pattern Anal. Mach. Intell., 35(1):185–207, 2013. 2, 5, 9, 24, 25

[22] A. Borji, D. N. Sihite, and L. Itti. Quantitative analysis of human-model agreementin visual saliency modeling: A comparative study. IEEE Trans. Image Process.,22(1):55–69, 2013. 3, 5, 9, 21, 24, 28, 44, 88

[23] C. D. Brown and H. T. Davis. Receiver operating characteristics curves and relateddecision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems,80(1):24–38, 2006. 24

[24] N. Bruce and J. Tsotsos. Saliency based on information maximization. Advances inNeural Information Processing Systems, 18:155, 2006. 2, 42, 69

[25] M. W. Cannon and S. C. Fullenkamp. A model for inhibitory lateral interaction effectsin perceived contrast. Vision Research, 36(8):1115–1125, 1996. 13

[26] N. R. Carlson. Psychology: The Science of Behaviour, 4th edition. Pearson EducationCanada, 2010. 27

[27] Y. M. Chen and I. V. Bajic. A joint approach to global motion estimation and motionsegmentation from a coarsely sampled motion vector field. IEEE Trans. Circuits Syst.Video Technol., 21(9):1316–1328, 2011. 96

[28] Y. M. Chen, I. V. Bajic, and P. Saeedi. Motion segmentation in compressed videousing Markov random fields. In Proc. IEEE ICME’10, pages 760–765, 2010. 96

[29] Y. M. Chen, I. V. Bajic, and P. Saeedi. Moving region segmentation from compressedvideo using global motion estimation and markov random fields. IEEE Trans. Multi-media, 13(3):421–431, 2011. 104

[30] T. M. Cover and J. A Thomas. Elements of Information Theory. Wiley-Interscience,2 edition, 2006. 2

[31] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience,2012. 42

120

[32] D. Culibrk, M. Mirkovic, V. Zlokolica, M. Pokric, V. Crnojevic, and D. Kukolj. Salientmotion features for video quality assessment. IEEE Trans. Image Process., 20(4):948–958, 2011. 9

[33] I. Dagan, L. Lee, and F. Pereira. Similarity-based methods for word sense disambigua-tion. In Proceedings of the eighth conference on European chapter of the Associationfor Computational Linguistics, pages 56–63. Association for Computational Linguis-tics, 1997. 26

[34] J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, andorientation optimized by two-dimensional visual cortical filters. J. Optical Society ofAmerica A, 2(7):1160–1169, 1985. 12

[35] S. Dogan, A. H. Sadka, A. M. Kondoz, E. Kasutani, and T. Ebrahimi. Fast regionof interest selection in the transform domain for video transcoding. In Proc. IEEEWIAMIS’05, 2005. 31

[36] B. Efron and R. Tibshirani. An introduction to the bootstrap, volume 57. CRC press,1993. 24, 70

[37] W. Einhauser, M. Spain, and P. Perona. Objects predict fixations better than earlysaliency. Journal of Vision, 8(14), 2008. 24

[38] S. Engel, X. Zhang, and B. Wandell. Colour tuning in human visual cortex measuredwith functional magnetic resonance imaging. Nature, 388(6637):68–71, 1997. 11

[39] U. Engelke, H. Kaprykowsky, H. J. Zepernick, and P. Ndjiki-Nya. Visual attention inquality assessment. IEEE Signal Process. Mag., 28(6):50–59, 2011. 5, 19, 93

[40] U. Engelke and H. J. Zepernick. Framework for optimal region of interest–basedquality assessment in wireless imaging. Journal of Electronic Imaging, 19(1):1–13,2010. 9

[41] Y. Fang, Z. Chen, W. Lin, and C. W. Lin. Saliency detection in the compressed domainfor adaptive image retargeting. IEEE Trans. Image Process., 21(9):3888–3901, 2012.9, 31

[42] Y. Fang, W. Lin, Z. Chen, C. M. Tsai, and C. W. Lin. Video saliency detection inthe compressed domain. In Proc. 20th ACM Intl. Conf. Multimedia, pages 697–700.ACM, 2012. 31, 35, 39, 61

[43] Y. Fang, W. Lin, Z. Chen, C. M. Tsai, and C. W. Lin. A video saliency detectionmodel in compressed domain. IEEE Trans. Circuits Syst. Video Technol., 24(1):27–38,2014. 35, 39, 61, 81

[44] X. Feng, T. Liu, D. Yang, and Y. Wang. Saliency inspired full-reference qualitymetrics for packet-loss-impaired video. IEEE Trans. Broadcast., 57(1):81–88, 2011. 9

[45] K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato. Saliency-basedvideo segmentation with graph cuts and sequentially updated priors. In Proc. IEEEICME’09, pages 638–641, 2009. 9

121

[46] P. F. Gabriel, J. G. Verly, J. H. Piater, and A. Genon. The state of the art inmultiple object tracking under occlusion in video sequences. In Advanced Conceptsfor Intelligent Vision Systems, pages 166–173, 2003. 117

[47] D. Gao and N. Vasconcelos. On the plausibility of the discriminant center-surroundhypothesis for visual saliency. Journal of Vision, 8, 7:1–18, 2008. 2

[48] A. Garcia-Diaz, X. R. Fdez-Vidal, X. M. Pardo, and R. Dosil. Saliency from hierarchi-cal adaptation through decorrelation and variance normalization. Image and VisionComputing, 30(1):51–64, 2012. 3, 69, 81

[49] S. Geman and D. Geman. Stochastic relaxation, gibbs distribution and the Bayesianrestoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741, 1984.79

[50] C. Guo and L. Zhang. A novel multiresolution spatiotemporal saliency detection modeland its applications in image and video compression. IEEE Trans. Image Process.,19(1):185–198, 2010. 9, 40

[51] H. Hadizadeh. Visual saliency in video compression and transmission. PhD thesis,Simon Fraser University, Apr. 2013. 31, 35, 41, 42, 43, 44, 118

[52] H. Hadizadeh and I. V. Bajic. Saliency-preserving video compression. In Proc. IEEEICME’11, pages 1–6, 2011. 9

[53] H. Hadizadeh and I. V. Bajic. Saliency-aware video compression. IEEE Trans. ImageProcess., 23(1):19–33, Jan. 2014. 9

[54] H. Hadizadeh, I. V. Bajic, and G. Cheung. Saliency-cognizant error concealment inloss-corrupted streaming video. In Proc. IEEE ICME’12, pages 73–78, 2012. 41

[55] H. Hadizadeh, I. V. Bajic, and G. Cheung. Video error concealment using acomputation-efficient low saliency prior. IEEE Trans. Multimedia, 15(8):2099–2113,Dec. 2013. 9

[56] H. Hadizadeh, M. J. Enriquez, and I. V. Bajic. Eye-tracking database for a set ofstandard video sequences. IEEE Trans. Image Process., 21(2):898–903, Feb. 2012. 5,9, 19, 20, 35, 43, 66, 70

[57] A. Hagiwara, A. Sugimoto, and K. Kawamoto. Saliency-based image editing forguiding visual attention. In Proc. PETMEI’11, pages 43–48, Sep. 2011. 9

[58] S. Han and N. Vasconcelos. Biologically plausible saliency mechanisms improve feed-forward object recognition. Vision Research, 50(22):2295–2307, 2010. 9

[59] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. Advances in neuralinformation processing systems, 19:545–552, 2007. 44, 81

[60] Z. He, M. Bystrom, and S. H. Nawab. Bidirectional conversion between DCT coeffi-cients of blocks and their subblocks. IEEE Trans. Signal Process., 53(8):2835–2841,2005. 44

[61] D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physicalreview E, 51(5):4282–4286, 1995. 2

122

[62] T. Ho-Phuoc, N. Guyader, F. Landragin, and A. Guerin-Dugue. When viewing nat-ural scenes, do abnormal colors impact on spatial or temporal parameters of eyemovements? Journal of Vision, 12(2), 2012. 93

[63] Y. Hochberg and A. C. Tamhane. Multiple comparison procedures. John Wiley &Sons, Inc., 1987. 56, 81

[64] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In Proc.IEEE CVPR’07, pages 1–8, 2007. 3, 69

[65] X. Hou and L. Zhang. Dynamic visual attention: Searching for coding length in-crements. Advances in Neural Information Processing Systems, 21:681–688, 2008. 3,69

[66] L. Itti. Automatic foveation for video compression using a neurobiological model ofvisual attention. IEEE Trans. Image Process., 13(10):1304–1318, 2004. 9, 16, 43, 66,74, 81

[67] L. Itti and P. Baldi. A principled approach to detecting surprising events in video. InProc. IEEE CVPR’05, volume 1, pages 631–637, 2005. 24, 25

[68] L. Itti and P. Baldi. Bayesian surprise attracts human attention. Vision Research,49(10):1295–1306, 2009. 24, 25, 61, 93

[69] L. Itti and P. F. Baldi. Bayesian surprise attracts human attention. Advances inNeural Information Processing Systems, 19:547–554, 2006. 2, 25, 35, 40, 69, 81, 88

[70] L. Itti and R. Carmi. Eye-tracking data from human volunteers watching complexvideo stimuli. 2009. CRCNS.org. 35, 88

[71] L. Itti, N. Dhavale, and F. Pighin. Realistic avatar eye and head animation using aneurobiological model of visual attention. In Optical Science and Technology, SPIE’s48th Annual Meeting, pages 64–78. International Society for Optics and Photonics,2004. 16

[72] L. Itti and C. Koch. Feature combination strategies for saliency-based visual attentionsystems. Journal of Electronic Imaging, 10(1):161–169, 2001. xii, 14, 15

[73] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapidscene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259, 1998. xii,2, 9, 10, 12, 13, 16, 17, 40, 41, 42, 44, 61, 69, 74, 76, 81

[74] H. Jeffreys. An invariant form for the prior probability in estimation problems. Pro-ceedings of the Royal Society of London. Series A. Mathematical and Physical Sci-ences, 186(1007):453–461, 1946. 25

[75] Q. G. Ji, Z. D. Fang, Z. H. Xie, and Z. M. Lu. Video abstraction based on thevisual attention model and online clustering. Signal Processing: Image Commun.,28(3):241–253, 2013. 9

[76] C. Kanan, M. H. Tong, L. Zhang, and G. W. Cottrell. SUN: Top-down saliency usingnatural statistics. Visual Cognition, 17(6-7):979–1003, 2009. 2, 28, 29, 56

123

[77] Z. Kato, T. C. Pong, and J. Chung-Mong Lee. Color image segmentation and parame-ter estimation in a Markovian framework. Pattern Recognition Letters, 22(3):309–321,2001. 116

[78] H. Khalilian and I. V. Bajic. Video watermarking with empirical PCA-based decoding.IEEE Trans. Image Processing, 22(12):4825–4840, Dec 2013. 9

[79] S. H. Khatoonabadi and I. V. Bajic. Compressed-domain global motion estimationbased on the normalized direct linear transform algorithm. In Proc. InternationalTechnical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC’13), 2013. 118

[80] S. H. Khatoonabadi and I. V. Bajic. Still visualization of object motion in compressedvideo. In Proc. IEEE ICME’13 Workshop: MMIX, 2013. 118

[81] S. H. Khatoonabadi and I. V. Bajic. Video object tracking in the compressed domainusing spatio-temporal Markov random fields. IEEE Trans. Image Process., 22(1):300–313, 2013. 7, 93

[82] S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. Comparison of visual saliency modelsfor compressed video. In Proc. IEEE ICIP’14, pages 1081–1085, 2014. 5, 32

[83] S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. Compressed-domain correlates offixations in video. In ACM Multimedia PIVP, PIVP ’14, pages 3–8, 2014. 6

[84] S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. Compressed-domain correlates of hu-man fixations in dynamic scenes. Multimedia Tools and Applications, 2015. [Acceptedfor publication]. 6

[85] S. H. Khatoonabadi, I. V. Bajic, and Y. Shan. Compressed-domain visual saliencymodels: A comparative study. IEEE Trans. Image Process., 2015. [Submitted]. 5, 32

[86] S. H. Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan. How many bits doesit take for a stimulus to be salient? In Proc. IEEE CVPR’15, pages 5501–5510, 2015.6

[87] C. Kim and P. Milanfar. Visual saliency in noisy images. Journal of Vision, 13(4):1–14, 2013. 44, 59

[88] W. Kim, C. Jung, and C. Kim. Spatiotemporal saliency detection and its applicationsin static and dynamic scenes. IEEE Trans. Circuits Syst. Video Technol., 21(4):446–456, 2011. 81

[89] D. Knill and W. Richards. Perception as Bayesian Inference. Cambridge Univ. Press,1996. 2

[90] E. Kreyszig. Introductory mathematical statistics: principles and methods. Wiley NewYork, 1970. 72

[91] S. Kullback. Information theory and statistics. Courier Dover Publications, 1997. 25

[92] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Math-ematical Statistics, 22(1):79–86, 1951. 25

124

[93] E. C. Larson, C. Vu, and D. M. Chandler. Can visual fixation patterns improve imagefidelity assessment? In Proc. IEEE ICIP’08, pages 2572–2575, 2008. 9

[94] O. Le Meur. Robustness and repeatability of saliency models subjected to visualdegradations. In Proc. IEEE ICIP’11, pages 3285–3288, 2011. 59

[95] O. Le Meur and T. Baccino. Methods for comparing scanpaths and saliency maps:Strengths and weaknesses. Behavior Research Methods, 45(1):251–266, 2013. 24, 27

[96] A. G. Leventhal. The neural basis of visual function. Vision and Visual Dysfunction.Boca Raton, CRC Press, 1991. 10, 11, 12

[97] S. Z. Li. Markov random field modeling in image analysis. Springer Science & BusinessMedia, 2009. 116

[98] X. Li, H. Lu, L. Zhang, X. Ruan, and M. H. Yang. Saliency detection via dense andsparse reconstruction. In Proc. IEEE ICCV’13, pages 2976–2983, 2013. 3, 69

[99] Z. Li, S. Qin, and L. Itti. Visual attention guided bit allocation in video compression.Image and Vision Computing, 21(1):1–19, Jan. 2011. 9

[100] J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory,37(1):145–151, 1991. 26

[101] X. Lin, H. Ma, L. Luo, and Y. Chen. No-reference video quality assessment in thecompressed domain. IEEE Trans. Consum. Electron., 58(2):505–512, 2012. 31

[102] H. Liu and I. Heynderickx. Visual attention in objective image quality assessment:based on eye-tracking data. IEEE Trans. Circuits Syst. Video Technol., 21(7):971–982,2011. 9

[103] Z. Liu, Y. Lu, and Z. Zhang. Real-time spatiotemporal segmentation of video objectsin the H. 264 compressed domain. J. Vis. Commun. Image R., 18(3):275–290, 2007.xv, 96, 106, 108, 109

[104] Z. Liu, H. Yan, L. Shen, Y. Wang, and Z. Zhang. A motion attention model based ratecontrol algorithm for H. 264/AVC. In The 8th IEEE/ACIS International Conferenceon Computer and Information Science (ICIS’09), pages 568–573, 2009. 31, 35, 38,39, 44, 81, 93

[105] T. Lu, Z. Yuan, Y. Huang, D. Wu, and H. Yu. Video retargeting with nonlinearspatial-temporal saliency fusion. In Proc. IEEE ICIP’10, pages 1801–1804, 2010. 9

[106] Z. Lu, W. Lin, E. P. Ong, X. Yang, and S. Yao. PQSM-based RR and NR video qual-ity metrics. In Proceedings-SPIE The International Society of Optical Engineering,volume 5150, pages 633–640. International Society for Optical Engineering, 2003. 9

[107] Y. F. Ma, X. S. Hua, L. Lu, and H. J. Zhang. A generic framework of user attentionmodel and its application in video summarization. IEEE Trans. Multimedia, 7(5):907–919, 2005. 2

[108] Y. F. Ma and H. J. Zhang. A new perceived motion based shot content representation.In Proc. IEEE ICIP’01, volume 3, pages 426–429, 2001. 31, 35, 36, 81, 93

125

[109] Y. F. Ma and H. J. Zhang. A model of motion attention for video skimming. In Proc.IEEE ICIP’02, volume 1, pages 129–132, 2002. 31, 35, 37, 81, 93

[110] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection incrowded scenes. In Proc. IEEE CVPR’10, pages 1975–1981, 2010. 3, 9

[111] V. Mahadevan and N. Vasconcelos. Spatiotemporal saliency in dynamic scenes. IEEETrans. Pattern Anal. Mach. Intell., 32(1):171–177, 2010. 35, 41, 61, 93

[112] V. Mahadevan and N. Vasconcelos. Biologically inspired object tracking using center-surround saliency mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 35(3):541–554, 2013. 9

[113] S. Mallat. A theory for multiresolution signal decomposition: the wavelet representa-tion. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 11:674–693, July 1989. 2

[114] R. Margolin, A. Tal, and L. Zelnik-Manor. What makes a patch distinct? In Proc.IEEE CVPR’13, pages 1139–1146, 2013. 3, 69

[115] V. A. Mateescu and I. V. Bajic. Guiding visual attention by manipulating orientationin images. In Proc. IEEE ICME’11, pages 1–6, Jul. 2013. 9

[116] V. A. Mateescu and I. V. Bajic. Attention retargeting by color manipulation in images.In Proc. 1st Intl. Workshop on Perception Inspired Video Processing, PIVP ’14, pages15–20, 2014. 9

[117] V. A. Mateescu, H. Hadizadeh, and I. V. Bajic. Evaluation of several visual saliencymodels in terms of gaze prediction accuracy on video. In Proc. IEEE Globecom’12Workshop: QoEMC, pages 1304–1308, Dec. 2012. 9, 14, 56

[118] J. T. McClave and T. Sincich. Statistics, 9th edition, 2003. 42

[119] Y. M. Meng, I. V. Bajic, and P. Saeedi. Moving region segmentation from compressedvideo using global motion estimation and Markov random fields. IEEE Trans. Multi-media, 13(3):421–431, 2011. xv, 96, 106, 107, 108, 109

[120] R. Milanese, S. Gil, and T. Pun. Attentive mechanisms for dynamic and static sceneanalysis. Optical Engineering, 34(8):2428–2434, 1995. 61, 93

[121] A. K. Mishra, Y. Aloimonos, L. F. Cheong, and A. Kassim. Active visual segmenta-tion. IEEE Trans. Pattern Anal. Mach. Intell., 34(4):639–653, 2012. 9

[122] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson. Clustering of gaze duringdynamic scene viewing is predicted by motion. Cognitive Computation, 3(1):5–24,2011. 27

[123] A. K. Moorthy and A. C. Bovik. Visual importance pooling for image quality assess-ment. IEEE J. Sel. Topics Signal Process., 3(2):193–201, 2009. 9

[124] B. Moulden, J. Renshaw, and G. Mather. Two channels for flicker in the human visualsystem. Perception, 13(4):387–400, 1984. 69

[125] MPEG-7 Output Document ISO/MPEG. MPEG-7 visual part of experimentationmodel (XM) version 2.0. 1999. 35

126

[126] K. Muthuswamy and D. Rajan. Salient motion detection in compressed domain. IEEESignal Process. Lett., 20(10):996–999, Oct. 2013. 31, 35, 40, 41, 81

[127] V. Navalpakkam and L. Itti. Modeling the influence of task on attention. VisionResearch, 45(2):205–231, 2005. 2

[128] E. Niebur and C. Koch. Computational architectures for attention. chapter 9, pages163–186. Cambridge, MA: MIT Press, 1998. 1

[129] A. Oliva and A. Torralba. Building the gist of a scene: The role of global imagefeatures in recognition. Progress in brain research, 155:23–36, 2006. 2

[130] B. Olshausen and D. Field. Emergence of simple-cell receptive field properties bylearning a sparse code for natural images. Nature, 381:607–609, 1996. 2

[131] D. J. Parkhurst and E. Niebur. Scene content selected by active vision. Spatial Vision,16(2):125–154, 2003. 28, 29, 56

[132] R. J. Peters, A. Iyer, L. Itti, and C. Koch. Components of bottom-up gaze allocationin natural images. Vision Research, 45(18):2397–2416, 2005. 26

[133] M. I. Posner and Y. Cohen. Components of visual orienting. Attention and Perfor-mance X: Control of Language Processes, 32:531–556, 1984. 15, 16

[134] W. Reichardt. Evaluation of optical motion information by movement detectors.Journal of Comparative Physiology A: Neuroethology, Sensory, Neural, and BehavioralPhysiology, 161(4):533–547, 1987. 16

[135] P. Reinagel and A. M. Zador. Natural scene statistics at the center of gaze. Network:Computation in Neural Systems, 10:1–10, 1999. 38, 70

[136] P. J. Rousseeuw and C. Croux. Alternatives to the median absolute deviation. Journalof the American Statistical Association, 88(424):1273–1283, 1993. 97

[137] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. WileySeries in Probability and Mathematical Statistics, New York, 2003. 97

[138] N. G. Sadaka, L. J. Karam, R. Ferzli, and G. P. Abousleman. A no-reference per-ceptual image sharpness metric based on saliency-weighted foveal pooling. In Proc.IEEE ICIP’08, pages 369–372, 2008. 9

[139] E. F. Schisterman, D. Faraggi, B. Reiser, and M. Trevisan. Statistical inference forthe area under the receiver operating characteristic curve in the presence of randommeasurement error. American Journal of Epidemiology, 154(2):174–179, 2001. 24

[140] N. Sebe and M. S. Lew. Comparing salient point detectors. Pattern RecognitionLetters, 24(1):89–96, 2003. 3

[141] H. J. Seo and P. Milanfar. Static and space-time visual saliency detection by self-resemblance. Journal of Vision, 9(12):15, 2009. 2, 81

[142] B. Shen and I. K. Sethi. Convolution-based edge detection for image/video in blockDCT domain. J. Vis. Commun. Image R., 7(4):411–423, 1996. 40

127

[143] A. Sinha, G. Agarwal, and A. Anbu. Region-of-interest based compressed domainvideo transcoding scheme. In Proc. IEEE ICASSP’04, volume 3, pages 161–164,2004. 31, 35, 38, 81

[144] A. Smolic, M. Hoeynck, and J. R. Ohm. Low-complexity global motion estimationfrom P-frame motion vectors for MPEG-7 applications. In Proc. IEEE ICIP’00,volume 2, pages 271–274, 2000. 104

[145] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S. C. Zhu. On advances in statisticalmodeling of natural images. Journal of mathematical imaging and vision, 18(1):17–33,2003. 2

[146] C. Stauffer and W. Grimson. Adaptive background mixture models for real-timetracking. In Proc. IEEE CVPR’99, pages 246–252, 1999. 3, 9

[147] G.J. Sullivan, J. Ohm, Woo-Jin Han, and T. Wiegand. Overview of the high effi-ciency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol.,22(12):1649–1668, 2012. 63, 66

[148] J. A. Swets. Signal detection theory and ROC analysis in psychology and diagnostics:Collected papers. Lawrence Erlbaum Associates, Inc., 1996. 24

[149] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tap-pen, and C. Rother. A comparative study of energy minimization methods for Markovrandom fields. In Proc. ECCV’06, volume 2, pages 16–29, 2006. 79, 116

[150] B. W. Tatler. The central fixation bias in scene viewing: Selecting an optimal viewingposition independently of motor biases and image feature distributions. Journal ofVision, 7(14), 2007. 28

[151] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist. Visual correlates of fixation selection:Effects of scale and time. Vision Research, 45(5):643–659, 2005. 28, 29, 42

[152] C. Theoharatos, V. K. Pothos, N. A. Laskaris, G. Economou, and S. Fotopoulos. Mul-tivariate image similarity in the compressed domain using statistical graph matching.Pattern Recognition, 39(10):1892–1904, 2006. 39

[153] A. M. Treisman and G. Gelade. A feature-integration theory of attention. CognitivePsychology, 12(1):97–136, 1980. 10, 39

[154] P. Vandewalle, J. Kovacevic, and M. Vetterli. Reproducible research in signal pro-cessing. IEEE Signal Process. Mag., 26(3):37–47, 2009. 7

[155] W. N. Venables and B. D. Ripley. Modern applied statistics with S. Springer, 2002.97

[156] C. Wang, N. Komodakis, and N. Paragios. Markov random field modeling, inference& learning in computer vision & image understanding: A survey. Computer Visionand Image Understanding, 117(11):1610–1627, 2013. 76

[157] Z. Wang and A. C. Bovik. Foveated image and video coding. In H. R. Wu and B. D.Rao, editors, Digital Video Image Quality and Perceptual Coding, pages 431–458.2005. 9

128

[158] Z. Wang and Q. Li. Information content weighting for perceptual image qualityassessment. IEEE Trans. Image Process., 20(5):1185–1198, 2011. 9

[159] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of theH. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol.,13(7):560–576, 2003. 43, 66, 100

[160] S. Winkler and R. Subramanian. Overview of eye tracking datasets. In 5th Intl.Workshop on Quality of Multimedia Experience (QoMEX), pages 212–217. IEEE,2013. 19

[161] R. Xie and S. Yu. Region-of-interest-based video transcoding from MPEG-2 to H.264 in the compressed domain. Optical Engineering, 47(9):097001–097001, 2008. 31,93

[162] H. Zen, T. Hasegawa, and S. Ozawa. Moving object detection from MPEG codedpicture. In Proc. IEEE ICIP’99, volume 4, pages 25–29, 1999. 37

[163] W. Zeng, J. Du, W. Gao, and Q. Huang. Robust moving object segmentation on H.264/AVC compressed video using the block-based MRF model. Real-Time Imaging,11(4):290–299, 2005. xv, 96, 106, 107, 108, 109

[164] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. SUN: A bayesianframework for saliency using natural statistics. Journal of Vision, 8(7), 2008. 2, 28

[165] S. H. Zhong, Y. Liu, Y. Liu, and F. L. Chung. A semantic no-reference image sharpnessmetric based on top-down and bottom-up saliency map modeling. In Proc. IEEEICIP’10, pages 1553–1556, 2010. 9

129

saliency and tracking in compressed...

Documents