bitrate reduction techniques for low-complexity ...skip decision algorithm to perform roi coding for...

Bitrate Reduction Techniques forLow-Complexity Surveillance Video Coding

by

Pushkar Gorur

Submitted to the

Department of Electrical Communication Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

INDIAN INSTITUTE OF SCIENCE

July 2016

© Pushkar Gorur

2016

All rights reserved

ii

Abstract

High resolution surveillance video cameras are invaluable resources for effective crime pre-

vention and forensic investigations. However, increasing communication bandwidth re-

quirements of high definition surveillance videos are severely limiting the number of cam-

eras that can be deployed. Higher bitrate also increases operating expenses due to higher

data communication and storage costs. Hence, it is essential to develop low complexity

algorithms which reduce data rate of the compressed video stream without affecting the

image fidelity. In this thesis, a computer vision aided H.264 surveillance video encoder and

four associated algorithms are proposed to reduce the bitrate and computational complex-

ity. The proposed techniques are (I) Speeded up foreground segmentation (II) Skip decision

(III) Reference frame selection and (IV) Face Region-of-Interest (ROI) coding.

In the first part of the thesis, a modification to the adaptive Gaussian Mixture Model

(GMM) based foreground segmentation algorithm is proposed to reduce computational

complexity. This is achieved by replacing expensive floating point computations with low

cost integer operations. To maintain accuracy, we compute periodic floating point updates

for the GMM weight parameter using the value of an integer counter. Experiments show

speedups in the range of 1.33 - 1.44 on standard video datasets where a large fraction of

pixels are multimodal.

In the second part, we propose a skip decision technique that uses a spatial sampler to

sample pixels. The sampled pixels are segmented using the speeded up GMM algorithm.

iii

The storage pattern of the GMM parameters in memory is also modified to improve cache

performance. Skip selection is performed using the segmentation results of the sampled

pixels. In the third part, a reference frame selection algorithm is proposed to maximize the

number of background Macroblocks (MB’s) (i.e. MB’s that contain background image con-

tent) in the Decoded Picture Buffer. This reduces the cost of coding uncovered background

regions. Distortion over foreground pixels is measured to quantify the performance of skip

decision and reference frame selection techniques. Experimental results show bit rate sav-

ings of up to 94.5% over methods proposed in literature on video surveillance data sets.

The proposed techniques also provide up to 74.5% reduction in compression complexity

without increasing the distortion over the foreground regions in the video sequence.

In the final part of the thesis, face and shadow region detection is combined with the

skip decision algorithm to perform ROI coding for pedestrian surveillance videos. Since

person identification requires high quality face images, MB’s containing face image content

are encoded with a low Quantization Parameter setting (i.e. high quality). Other regions

of the body in the image are considered as RORI (Regions of reduced interest) and are

encoded at low quality. The shadow regions are marked as Skip. Techniques that use only

facial features to detect faces (e.g. Viola Jones face detector) are not robust in real world

scenarios. Hence, we propose to initially detect pedestrians using deformable part models.

The face region is determined using the deformed part locations. Detected pedestrians are

tracked using an optical flow based tracker combined with a Kalman filter. The tracker im-

proves the accuracy and also avoids the need to run the object detector on already detected

pedestrians. Shadow and skin detector scores are computed over super pixels. Bilattice

based logic inference is used to combine multiple likelihood scores and classify the super

pixels as ROI, RORI or RONI. The coding mode and QP values of the MB’s are determined

using the super pixel labels. The proposed techniques provide a further reduction in bitrate

of up to 50.2%.

iv

Acknowledgements

Firstly, I would like to thank my adviser Prof. Bharadwaj Amrutur for the patient guidance

and freedom he has provided me throughout my PhD program. When I was sitting like a

frog in the VLSI (P & N) wells, he nudged me to come out and explore the field of video

signal processing. His support and reassurances during the early days of my PhD when I

was working on the H.264 encoder has been invaluable. His detailed feedback about my

writing has helped me to present my research work more clearly. His emphasis on solving

the practical problem of surveillance video bitrate reduction helped me to stay focussed.

Without his constant course corrections, I would have drifted away and lost track (like

early versions of my pedestrian tracker!)

I take this opportunity to thank Prof. P. S. Sastry and Prof. Vittal Rao for the mathematics

concepts they imparted to me through their courses. I am very fortunate to have had the

continuous support and guidance of Prof. A. G. Menon. He was instrumental in my decision

to join the PhD program at ECE IISc. I would like to thank Prof. K. R. Ramakrishnan for

having allowed me to join the surveillance related discussions with the Bengaluru City

police officers. I would also like to thank him for the discussions related to the lampTop

project.

I would like to thank TCS for supporting my research through the TCS fellowship pro-

gram. I would also like to thank Dr. Balamuralidhar for taking time to discuss my research

work at TCS labs, Bengaluru.

v

I have been fortunate to have received help from many wonderful colleagues. Suhas

Kashyap has helped a lot in collecting surveillance videos. He also provided ground truth

segmentation data for the test videos used to validate the skip decision and reference frame

selection algorithms. He features prominently in a lot of videos that I have used in this

thesis! I thank Harish for discussions about the inference for the Face ROI encoder. I thank

Bhargava for helping me collect surveillance videos and test the pedestrian ROI encoder. I

wish to learn a lot of deep learning from him now! Thanks to Anirudh for developing the

SSE based convolution code for DPM. Samik helped to perform experiments on shadows

and initial feasibility studies of ROI coding. I would like to thank Ajit Gupte for taking time

to discuss my research at TI and Qualcomm. Working with Doney on the lampTop project

was a lot of fun.

I thank the ECE staff for all support. I would like to thank Srinivas Murthy Sir and

Radhika Madam in particular for all the help that they have provided me. I would like

to express my appreciation for the help and support I received from friends in our lab.

BT, PD, Rajath, Anand, Kaushik, Mohan, Manikandan, Satyam, Janaki, Hitesh, Vikram,

Viveka, Doney, Syam, Siva, Bhargava, Sagar, Karthik, Prachet, Akshay, Pratik, Mallikarjun,

Nagaraju, Auritro, Balram and Abhishek kept the lab environment enjoyable and fun. I used

to pull them out of lab to capture surveillance videos for testing!

I thank the Robert Bosch Center for Cyber Physical Systems for supporting my travel to

the ITS conference to present the lampTop project research work.

This dissertation would not have been possible without the peaceful walks at Sankey

tank. I thank all the people who have worked and continue to work to make it such a nice

place. I also thank all the staff in IISc who keep the gardens and campus beautiful. The

very thought of wading through Bengaluru traffic to get to work after the PhD is frightening.

Finally, I wish to thank my family for their support during these years.

vi

List of publications from this thesis

Journal Articles

• Pushkar Gorur, Bharadwaj Amrutur, Skip Decision and Reference Frame Selection for

Low Complexity H.264/AVC Surveillance Video Coding, IEEE Transactions on Circuits

and Systems for Video Technology, vol.24, no.7, pp. 1156-1169, July 2014.

• Pushkar Gorur, Bhargava Srivatsa, Bharadwaj Amrutur, Region-of-Interest (ROI) Video

Coding for Pedestrian Surveillance Cameras (to be submitted).

Conference Proceeding

• Pushkar Gorur, Bharadwaj Amrutur, Speeded up Gaussian Mixture Model Algorithm

for Background Subtraction, IEEE Conference on Advanced Video and Signal Based

Surveillance (AVSS), pp. 386-391, 2011.

vii

Abbreviations

AV C Advanced Video Coding

BG Background

CABAC Context-Adaptive Binary Arithmetic Coding

CCTV Closed Circuit Television

DCT Discrete Cosine Transform

DPCM Differential Pulse-Code Modulation

DPB Decoded Picture Buffer

DV R Digital Video Recorder

EM Expectation Maximization

FG Foreground

GMM Gaussian Mixture Model

HEV C High Efficiency Video Coding

HD High Definition

HOG Histogram of Oriented Gradients

HQF High Quality Frame

IDR Instantaneous Decoder Refresh

JM Joint Model

KL Kullback-Leibler

LAN Local Area Network

viii

LLC Last Level Cache

MAD Mean Absolute Difference

MB Macroblock

MP Megapixel

MPEG Moving Picture Experts Group

MSE Mean Square Error

MV Motion Vector

NAL Network Abstraction Layer

PIR Passive Infrared

PMV Predicted Motion Vector

PoE Power on Ethernet

POC Picture Order Count

PSNR Peak Signal-to-Noise Ratio

QP Quantization Parameter

RC Rate Control

RD Rate Distortion

RDO Rate Distortion Optimization

ROI Region Of Interest

RONI Region Of No Interest

RORI Region Of Reduced Interest

RPB Reference Picture Buffer

RPLR Reference Picture List Reordering

SE Syntax element

S −MD Sampler based Motion Detection

SRL Statistical Relational Learning

SVM Support Vector Machine

V BR Variable Bitrate

V GA Video Graphics Array

V J Viola Jones

ix

Contents

Abstract iii

Acknowledgements v

List of publications from this thesis vii

Abbreviations viii

1 Introduction 1

1.1 Recent Trends in Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Bitrate increase in HD surveillance . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 How much resolution is ‘good enough’? . . . . . . . . . . . . . . . . 2

1.2.2 Bitrate versus camera resolution . . . . . . . . . . . . . . . . . . . . 7

1.3 Bitrate increase in low light surveillance . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Interplay between exposure, gain and noise . . . . . . . . . . . . . . 8

1.3.2 Bitrate versus noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Challenges due to increased bitrate . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 Proposed surveillance video encoder architecture . . . . . . . . . . . 13

1.5.2 Bitrate & computational complexity reduction . . . . . . . . . . . . . 14

1.6 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

x

2 Background and Related Work 18

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 A review of video encoding techniques . . . . . . . . . . . . . . . . . . . . . 18

2.3 H.264 basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.2 Macroblock Skip mode . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.3 Macroblock QP signaling . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Bitrate & complexity reduction techniques for video surveillance . . . . . . . 26

2.4.1 Skip detection techniques . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2 Background reference frame selection techniques . . . . . . . . . . . 29

2.4.3 ROI coding techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.4 Mode decision and motion estimation related techniques . . . . . . . 37

2.4.5 Hardware related advancements . . . . . . . . . . . . . . . . . . . . 38

2.4.6 Distributed video coding based techniques . . . . . . . . . . . . . . . 39

2.4.7 Wireless and/or Remote surveillance specific techniques . . . . . . . 40

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Speeded up GMM Algorithm for Background Subtraction 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.1 Adaptive Mixture Learning with fast convergence . . . . . . . . . . . 44

3.2.2 Automatic selection of number of components . . . . . . . . . . . . . 46

3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.1 Weight update interval Experiment . . . . . . . . . . . . . . . . . . . 52

3.4.2 Adaptive Mixture Learning Experiment . . . . . . . . . . . . . . . . . 53

3.4.3 Background subtraction experiment . . . . . . . . . . . . . . . . . . 54

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xi

4 Skip decision & Reference Frame Selection for H.264 Surveillance Coding 60

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Basic sampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2 Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Sampler based Background MB detection . . . . . . . . . . . . . . . . . . . . 68

4.4.1 GMM S-MD as a Stratified-Adaptive-Cluster sampler . . . . . . . . . 71

4.4.2 Spatio-temporal priors . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.3 Cache performance optimization . . . . . . . . . . . . . . . . . . . . 73

4.5 Reference frame selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6 Macroblock Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7 Skip Signalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.8 Optimum Reference Frame selection . . . . . . . . . . . . . . . . . . . . . . 80

4.8.1 Proposed Adaptive Reference Frame Selection Technique . . . . . . . 80

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Results: Skip Decision and Reference Frame Selection 83

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Skip Selection using GMM S-MD . . . . . . . . . . . . . . . . . . . . . . . . 85

5.4 Analysis of GMM S-MD Performance . . . . . . . . . . . . . . . . . . . . . . 97

5.5 Background PSNR and its Impact . . . . . . . . . . . . . . . . . . . . . . . . 106

5.6 Analysis of the Proposed Adaptive Reference Frame Selection Technique . . 111

5.7 Performance of the Proposed Adaptive Reference Frame Selection Technique 113

5.8 RD performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6 ROI video coding for Pedestrian Surveillance 120

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

xii

6.2 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2.1 Low level inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2.2 High level inferencing . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4 Shadow detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4.1 Weak shadow detector . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.4.2 Physics based shadow detection over super pixels . . . . . . . . . . . 129

6.4.3 Texture based shadow detection . . . . . . . . . . . . . . . . . . . . . 132

6.5 Skin detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.6 Pedestrian detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.6.1 DPM based pedestrian detection: A brief review . . . . . . . . . . . . 138

6.6.2 Proposed modifications to DPM . . . . . . . . . . . . . . . . . . . . . 141

6.7 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.8 Detection by Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.8.1 Components of a tracker . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.8.2 FG blob based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.8.3 Optic flow based tracker . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.9 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.9.1 Bilattice logic for ROI, RORI & RONI super pixel inference . . . . . . 156

6.10 Macroblock mode and quality parameter assignment . . . . . . . . . . . . . 161

6.11 ROI, RORI & RONI video compression results . . . . . . . . . . . . . . . . . 162

6.11.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.11.2 Bitrate reduction and accuracy . . . . . . . . . . . . . . . . . . . . . 163

6.11.3 Impact of detector errors on ROI encoder performance . . . . . . . . 169

6.11.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . 174

6.11.5 Complexity control for ROI encoding . . . . . . . . . . . . . . . . . . 175

6.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

xiii

7 Conclusion 180

7.1 Future Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . 183

7.1.1 Coding for surveillance cameras on drones . . . . . . . . . . . . . . . 183

7.1.2 Power-Rate-Distortion optimization of ROI encoders . . . . . . . . . 183

7.1.3 360◦ surveillance video coding . . . . . . . . . . . . . . . . . . . . . 184

7.1.4 HDR surveillance video coding . . . . . . . . . . . . . . . . . . . . . 185

A Alternate derivation of the Speeded up GMM update 186

B Sampler design 189

B.1 Analysis of a simple systematic sampler . . . . . . . . . . . . . . . . . . . . . 189

B.1.1 Uniform versus Non Uniform sampling patterns . . . . . . . . . . . . 190

B.1.2 Uniform systematic sampler accuracy . . . . . . . . . . . . . . . . . . 192

B.2 Analysis of the proposed sampler . . . . . . . . . . . . . . . . . . . . . . . . 197

C Bilattice logic based inference 199

xiv

List of Tables

1.1 BSIA standard [1] recommendations for image detail . . . . . . . . . . . . . 3

3.1 Average frame rate using proposed scheme, [2] & [3]. Average speedup

obtained using proposed scheme measured over [3] (Zivkovic) . . . . . . . 55

4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Average execution time reduction of encoder (QP set to 24) . . . . . . . . . 87

5.2 Performance comparison of proposed GMM S-MD on ‘No activity’ datasets . 88

5.3 Average execution time of Skip detection . . . . . . . . . . . . . . . . . . . . 89

5.4 Performance comparison of reference frame selection algorithms . . . . . . 116

5.4 Performance comparison of reference frame selection algorithms . . . . . . 117

xv

List of Figures

1.1 Images with minimum image detail required to perform detection, observa-

tion, recognition & identification surveillance tasks . . . . . . . . . . . . . . 4

1.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Top view showing the coverage of a MOBOTIX surveillance camera with a

3.6mm lens [4] for recognition tasks . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Average typical optimized bitrate of Bosch security cameras [5, 6] plotted

against the camera resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Low light scene with nominal exposure and gain settings . . . . . . . . . . . 9

1.6 Surveillance snapshots with different camera exposure settings (camera gain

has been increased to improve visibility and contrast) . . . . . . . . . . . . . 10

1.7 Surveillance snapshots with different camera gain settings . . . . . . . . . . 10

1.8 Bitrate versus pixel noise in a low light scene . . . . . . . . . . . . . . . . . 11

1.9 Architecture of the proposed surveillance video encoder . . . . . . . . . . . 14

2.1 Architecture of the H.264 encoder [7] . . . . . . . . . . . . . . . . . . . . . 22

2.2 Decoded Picture Buffer managed using MMCO commands . . . . . . . . . . 23

2.3 Reference frame list management in H.264 . . . . . . . . . . . . . . . . . . . 25

3.1 Weight update using proposed update for a monotonically (a) increasing and

(b) decreasing case. The weights are plotted on the y axis with respect to time 51

xvi

3.2 Weight update using (a) original GMM update equations and (b) proposed

weight update for a GMM with weights=[0.7, 0.25, 0.05] and Tw = 16. The

weights are plotted on the y axis with respect to time. Please note that the

graph shows that proposed technique does not affect the learning rate. The

increase in frame rate provided by the proposed technique is illustrated in

Fig. 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Frame rate (fps) and error % are plotted with respect to the weight update

interval Tw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 (a) Synthetic distribution based on commonly observed surveillance videos

(b) KL divergence achieved by the proposed and the original method [2] . . 54

3.4 Instantaneous frame rates plotted against frame count using proposed scheme,

[2] & [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Average precision-recall curves obtained using proposed scheme, [2] & [3]

for the 10 dataset videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.6 Detection results on the Hall [8] and video7 [9] dataset videos. (a) & (e)

are original images from the Hall and video7 datasets respectively. (b) &

(f) are the corresponding Ground truth images. (c) & (g) are the segmenta-

tion masks obtained using Lee [2]. (d) & (h) are the segmentation masks

obtained using the proposed scheme . . . . . . . . . . . . . . . . . . . . . . 58

4.1 Proposed surveillance specific video coding architecture . . . . . . . . . . . 61

4.2 Basic sampling techniques (a) Random (b) Cluster (c) Stratified (d) Systematic 66

4.3 Stratified Adaptive Cluster Sampling (a) First stage (d) Second stage . . . . 67

4.4 GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-

MD’) flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xvii

4.5 Figure shows the sampling pattern of pixels in an image. The sampled pixels

are partitioned into 4 sparse sets A1, A2, A3, & A4. Also shown are the GMM

data structures of pixels mapped onto different cache lines to improve cache

locality. The models of the dominant modes are arranged in a contiguous

manner. Also, the data elements belonging to a single set of pixels are present

in a contiguous array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Salient MB’s and Sampled pixel plot . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Sequence of frames in display order . . . . . . . . . . . . . . . . . . . . . . . 75

4.8 Macroblock reference assignment . . . . . . . . . . . . . . . . . . . . . . . . 76

4.9 Skip Signalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.10 Pseudo Code of Proposed Reference Frame Selection Scheme . . . . . . . . 81

5.1 Snapshots from the video dataset (a) Entrance (b) Parking Lot (c) Access

Door (d) Backyard1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 RD data for the (a) Bridge & (b) Walkway video sequences . . . . . . . . . . 91

5.3 RD data for the (a) Access Door & (b) Entrance video sequences . . . . . . . 92

5.4 RD data for the (a) PETS-1 & (b) PETS-2 video sequences . . . . . . . . . . 93

5.5 Encoded frames of the (a) Light Switch (b) Bridge and (c) Low light video

sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.6 Encoded frames of the (a) CDW (b) PETS-2 and (c) PETS-3 video sequences 95

5.7 Figure on the left shows a poorly lit corridor scene with increased camera

gain settings. Also, on the right, 100 RGB sample values of a pixel (pixel P

in the image) from the video are plotted in the 3D RGB space. In the same

picture, the background GMM mode is shown, i.e. the points on the sphere

are at a distance of 2.5 σ (Mahalanobis distance) from the mean value of the

mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.8 Impact of varying Tsparse on GMM S-MD performance shown for the (a)

Walkway and (b) Backyard2 sequences . . . . . . . . . . . . . . . . . . . . . 102

5.9 Encoded frame from the ‘Parking lot’ video (a) Without Spatio-Temporal bias

(object is missed) (b) With Spatio-Temporal bias (object is detected) . . . . 103

xviii

5.10 Encoded frame from the ‘Parking lot’ sequence with (a)Ddense = 4 (b) Ddense

= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.11 (a) Correctly detected foreground objects (marked in yellow) and (b) RD

data for the ‘Parking lot’ video . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.12 Impact of varying learning rate on GMM S-MD performance on (a) Bridge

and (b) Backyard2 sequences . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.13 Snaps of the encoded ‘Walkway’ dataset coded using (a) JM and (b) GMM

S-MD show that the proposed method does not produce any conspicuous

distortion in the background. The DPM detections (yellow rectangles) are

overlaid on the images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.14 Figure shows the PSNR plots for the ‘Walkway’ dataset frame in Fig. 5.13.

The PSNR plots have been computed over the (a) entire frame and (b) Fore-

ground regions. Although the proposed technique reduces the total PSNR, it

significantly improves the RD performance for foreground image regions. . . 109

5.15 (a) and (b) Show two encoded frames (with different sunlight intensities) in

the ‘Sunlight variation’ video. We observe that the proposed scheme does not

wrongly mark FG MB’s as ‘Skip’ under fast illumination changes. . . . . . . . 110

5.16 Slow reduction in illumination observed in the ‘Evening fade’ video. Encoded

frames captured (a) before and (b) after the reduction . . . . . . . . . . . . 110

5.17 No. of MB’s in the set B as a percentage of NMB (Total No. of MB’s in a

frame) for the ‘Entrance’ video . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.18 No. of FG→BG MB’s which require non-zero residual coding (i.e. No. of

FG→ BG 〈U〉 MB’s) in the ‘Entrance’ video . . . . . . . . . . . . . . . . . . . 113

6.1 Number of bits required to encode MB’s of a surveillance frame at uniform

quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Skin pixel detection in a surveillance video frame . . . . . . . . . . . . . . . 122

6.3 Face region detection using the Viola Jones detector . . . . . . . . . . . . . . 123

6.4 Architecture of the proposed ROI, RORI and RONI detector . . . . . . . . . 126

6.5 Super pixels detected in a surveillance video frame . . . . . . . . . . . . . . 128

xix

6.6 The shaded volume shown in the RGB color space is considered as shadow

pixel values by the weak shadow detector . . . . . . . . . . . . . . . . . . . 130

6.7 Pixel values of a surface is plotted from a video sequence. Intermittent fore-

ground object motion causes shadows on the surface. . . . . . . . . . . . . . 131

6.8 Shadow scores of super pixels plotted for a surveillance video frame . . . . . 134

6.9 Skin scores of super pixels in a surveillance video frame . . . . . . . . . . . 136

6.10 HOG feature computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.11 DPM part filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.12 Edge enhancement of blob boundaries . . . . . . . . . . . . . . . . . . . . . 142

6.13 Proposed DPM cascade for pedestrian detection . . . . . . . . . . . . . . . . 143

6.14 Sample result of DPM cascade . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.15 Geometry of the surveillance camera system showing ground planes at dif-

ferent elevations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.16 Sample surveillance video snapshots showing feasible and infeasible pedes-

trian hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.17 (a) Search region is initialized using the Kalman filter prediction. (b) Positive

and negative filters are applied on the FG blob to determine the left and right

bounds of the head region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.18 The five part-templates are shown here. Feature matching scores are accu-

mulated over these part-templates. Correspondence vectors are computed

for each part-template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.19 The template and the current frames (separated in time by 10 frames) are

shown. NCC scores of the five part-templates for the image in (a) is (0.77,

0.9, 0.87, 0.8, 0.81). The order of the scores is (left-upper-body, right-upper-

body, head-shoulder, torso, upper body). NCC scores for the image in (b) is

(0.93, 0.69, 0.88, 0.8, 0.83). Here, the score of the right-upper-body template

is lower due to occlusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6.20 Figure shows detector scores of a pedestrian on the Bilattice square . . . . . 158

xx

6.21 (a) Figure shows a pedestrian detection and different super pixels in the

blob (b) The pedestrian bounding box is divided into face, torso and leg

rectangles. The super pixels in the leg region are assigned a prior RORI score

based on the distance ySP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.22 Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the

‘Entrance road’ video. The comparison results have been obtained using (I)

Proposed method and (II) Only skip detection. The overall bitrate reduction

using the proposed technique is 37.2%. The total face region distortion met-

rics using the proposed method and the FG skip detection encoder were both

measured as 40.8dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.23 Figure shows frames from the ‘Entrance road’ video compressed using (a)

Only skip detection (b) Proposed ROI encoder. . . . . . . . . . . . . . . . . . 165

6.24 Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the

‘Porch’ video. The comparsion results have been obtained using (I) Proposed

method and (II) Only skip detection. The overall bitrate reduction using the

proposed technique is 50.2%. The total face region distortion metrics using

the proposed method and the FG skip detection encoder were both measured

as 40.9dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.25 Figure shows frames from the ‘Porch’ video compressed using (a) Only skip

detection (b) Proposed ROI encoder. . . . . . . . . . . . . . . . . . . . . . . 167

6.26 Figure shows that the proposed ROI encoder removes finer details in the

RORI MB’s but maintains image quality of the face region. . . . . . . . . . . 168

6.27 The table shows the different MB labeling errors and their consequences

(cells are color coded to signify the severity). Here, the rows correspond

to true MB labels and columns to the MB labels assigned by the ROI detector. 169

xxi

6.28 Figure shows that the DPM detector has failed to detect pedestrians A & B.

Pedestrian A is severely occluded by B. The head region of pedestrian B has

poor contrast. Accurate detection of pedestrian B would have reduced bit

cost of the frame by 15kbits. In contrast, detection of pedestrian A would

reduce bit cost by only 1kbit. . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6.29 (a) Figure shows few more DPM detector failures on small pedestrians (b)

Here, the tracker has tracked the pedestrian (bounded by the green box)

based on a previous detection. If only the DPM detector was applied on the

current frame, the pedestrian would have been missed. . . . . . . . . . . . . 171

6.30 (a) Figure shows localization error of the DPM detector. The detector has

included the shadow regions below the pedestrian (due to incorrect shadow

detection) in the bounding box (b) The torso has been detected as a head

shoulder region. Again, the shadow region has been included in the bound-

ing box due to incorrect detection. . . . . . . . . . . . . . . . . . . . . . . . 172

6.31 Figure shows three frame a, b & c (that are temporally ordered) in which the

DPM detector has detected the child before (i.e. in (a)) and after (i.e. in (c))

the occlusion. However, the detector has failed during the occlusion (i.e. in

(b)). If the child was not tracked by the tracker, his face image regions would

be encoded in low quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.32 Figure shows the tracker bounding box positions as the tracked pedestrian

gets occluded and reappears later. During the occlusion (i.e. in (b)), the

NCC score of the tracked pedestrian drops from 0.76 to 0.41. This would

trigger the execution of the DPM detector. However, since the template is not

updated, the tracker reassigns the correct bounding box when the pedestrian

reappears from occlusion in (c). The NCC score also increases to 0.7. . . . . 174

6.33 Bit count savings is plotted against the height of the pedestrian image in

pixels. QP = 24 for the video encoded without ROI coding. For the ROI

encoded video, QPROI = 24 and QPRORI = 32. . . . . . . . . . . . . . . . . 176

xxii

6.34 Scene shows multiple pedestrians in the scene. Pedestrians A & D cover a

large number of MB’s in the image. Hence, ROI detection on image regions

of these pedestrians provides higher bitrate savings. . . . . . . . . . . . . . . 177

B.1 GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-

MD’) flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

B.2 1D sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B.3 ROC curve of the pixel level classifier for different values of v (or normalized

signal level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

B.4 Sampler accuracy for different values of stride length (v = 3, L = 20, T = 2.5)196

B.5 Sampler accuracy for different values of pixel level classifier threshold (v =

3, L = 20, dsys = 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

C.1 Double Hasse diagrams of different bilattices. In (c), a surveillance video

frame is shown. Also, the logic values of pedestrian and non pedestrian

image regions are shown in the double Hasse diagram. . . . . . . . . . . . . 207

C.2 Double Hasse diagrams show partial ordering based on belief and informa-

tion in bilattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

C.3 Construction of the square bilattice . . . . . . . . . . . . . . . . . . . . . . . 209

xxiii

Chapter 1

Introduction

1.1 Recent Trends in Video Surveillance

The video surveillance market has undergone marked changes in the last decade. Early

surveillance camera networks were mostly installed and operated by government munici-

pal corporations, prisons, banks and casinos. However, since the beginning of this century,

we see an increased adoption of surveillance cameras in private premises including homes

and commercial buildings. Greater density of surveillance cameras has resulted in higher

coverage. This has significantly increased the use of surveillance footage in criminal investi-

gations and as evidence in judicial inquiries. Surveillance videos have been instrumental in

solving many crimes e.g. the murder of toddler James Bulger, the Boston marathon bomb-

ing & the London 7 July 2005 attacks. However, we frequently find instances where low

resolution surveillance footage has delayed or hindered investigation, for example in the

case of the Bangkok bombing & the Bodh Gaya bombings. Although consumer awareness

of the importance of high definition (HD) surveillance camera footage has increased, mar-

ket penetration of such cameras has been slow. This is primarily due to cost considerations

and insufficient security budgets.

Advances in VLSI manufacturing over the past decade have reduced the cost of HD cam-

eras. However, increasing the resolution of surveillance videos increases the bitrate of the

encoded streams. Also, low light conditions severely affect the compression performance

of video encoders. The increased bitrate due to these factors has resulted in higher data

communication and storage costs. Unlike camera and storage costs, data communication

expenses are recurring in nature. Hence they increase the operating costs of the surveil-

lance system. In addition, higher data communication bandwidth requirements of HD cam-

era streams would necessitate upgrading of the network infrastructure in some cases. Also,

1

Chapter 1. Introduction 2

multiple HD video streams routed to a central server would increase network congestion in

routers close to the server.

The aforementioned issues have hindered the adoption of high resolution surveillance

systems. Hence, it is very important to reduce the bitrate of HD camera videos to deliver the

operational requirement of the consumer. In this chapter, we discuss all the stated issues

in detail. We then describe the surveillance video encoder architecture that we propose to

address the challenges. We also enumerate the techniques proposed in this thesis to reduce

the bitrate and computational complexity of surveillance video encoders. The organization

of the thesis is provided in the concluding section of this chapter.

1.2 Bitrate increase in HD surveillance

1.2.1 How much resolution is ‘good enough’?

The BSIA (British security industry association) code of practice for CCTV surveillance sys-

tems document [1] has classified tasks based on the intended objectives as: Monitor, Detect,

Observe, Recognise & Identify. Table 1.1 lists the task and the image detail (equivalent pix-

els per meter) required at the target distance. It also lists the number of pixels that should

cover the face region in the image (The average width of the human face is 16 centimeters).

In this table, there is a subtle difference between recognition and identification tasks. The

recognition task requires that the observer has seen the pedestrian before. A typical use case

is to make the surveillance video public in media after the crime is committed (Many crimes

have been solved in which people recognise the accused in the publicly released videos and

report to the police). The identification task allows matching of the pedestrian image with

database records. Clearly, the identification task has higher utility in comparison with the

recognition task.

To gain a better understanding of the data in Table 1.1, we extract pedestrian image

snapshots with image detail matching the resolution requirements. Fig. 1.1 shows these


Table 1.1: BSIA standard [1] recommendations for image detail

Task Description Pixels/m Pixels/face

(horizontal)

Monitor View the number, direction and speed

of movement of people (given that their

presence is known)

12.5 -

Detect Determine presence of any target (e.g. a

person or vehicle)

25 4

Observe View characteristic details of an individ-

ual (e.g. clothing)

62.5 10

Recognise Identify individuals if operator has seen

individual before

125 20

Identify Identify individuals beyond reasonable

doubt

250 40

images for Detect, Observe, Recognise & Identify tasks. We can easily appreciate the im-

portance of capturing high resolution images to perform recognition and identification op-

erations. Before deployment of a surveillance camera system, the task that the system is

required to perform has to be decided. Based on this, the required resolution of the video

can be determined using the parameters of the camera.


(a) Detection (b) Observation

(c) Recognition (d) Identification

Figure 1.1: Images with minimum image detail required to perform detection, observation,

recognition & identification surveillance tasks


Fig. 1.2 shows the model for a pinhole camera. Let the image detail (i.e. number of

pixels that cover 1 meter length at the object distance) required for the surveillance task be

dtaskI . The coverage of the camera is defined as the region on the ground plane where the

image detail satisfies the requirements for the surveillance task (listed in Table 1.1). This

region is specified by two parameters, Dtaskmax and Stask

max (see Figs. 1.2 and 1.3). Dtaskmax is the

maximum distance (from the camera) and Staskmax is the maximum horizontal span within

which the image detail requirements are met. Let f be the focal length of the camera.

Let W & H be the width and height of the image sensor respectively. Let Rhorz & Rvert

be the horizontal and vertical camera resolutions respectively. Let αhorz & αvert be the

horizontal and vertical angles of view respectively. Dtaskmax & Stask

max can be computed from the

camera parameters using Eqns. 1.1, 1.2 & 1.3 (We note that the camera tilt has not been

considered while deriving the equations. This would result in a small difference between

the true coverage and coverage computed using Eqns. 1.1, 1.2 & 1.3).

Dtaskmax = min

(

Rhorzf

dtaskI W,Rvertf

dtaskI H

)

(1.1)

= min

(

Rhorz

2dtaskI tan (αhorz/2),

Rvert

2dtaskI tan (αvert/2)

)

(1.2)

Staskmax =

Rhorz

dtaskI

(1.3)

Eqns. 1.1 & 1.2 show that for a given task and fixed camera lens parameters, the

coverage region increases linearly with the resolution. In Fig. 1.3, Dtaskmax and Stask

max (of

a MOBOTIX surveillance camera with a B036 lens [4]) required to perform recognition

operations are shown. The figure shows that the coverage region of low resolution cameras

is very small. For example, at VGA resolution, the maximum distance at which we can

recognize the subject is 2.4m. This clearly indicates that HD cameras are required to provide

reasonable coverage for recognition and identification tasks.

Also, previously, when operators were compelled to use low resolution cameras (due to


cost considerations), surveillance monitoring over large regions was achieved by increasing

the number of cameras. However, it is preferred to have larger coverage regions using fewer

cameras since reducing the number of cameras avoids the installation and wiring of multi-

ple devices. This further motivates the adoption of high definition cameras to achieve the

surveillance task. Anticipating these market requirements, almost all the leading surveil-

lance camera manufacturers have introduced 12MP (4000 * 3000) resolution imagers in

their latest product offerings.

horzα

f taskDmax

W

H

taskSmax

Image sensor

Image detail of object = dI

task pixels/meter

Figure 1.2: Pinhole camera model


10.3m

Camera

7.7m

2.4m

6m

19.4m

25.9m

Coverage at 0.3MP (640x480) resolution

Coverage at 6MP (3072x2048) resolution

Coverage at 3MP (1920x1536) resolution

Figure 1.3: Top view showing the coverage of a MOBOTIX surveillance camera with a

3.6mm lens [4] for recognition tasks

1.2.2 Bitrate versus camera resolution

Fig. 1.4 shows the average typical optimized bandwidth of Bosch security cameras [5, 6]

plotted against the image resolution. This data suggests network bandwidth requirements

to be in the range of 4 - 6Mbps for 12MP H.264 surveillance videos. Such a camera used

for identification tasks can provide coverage up to 11m (when horizontal angle of view is

set to 70°). In comparison, the bitrate of a VGA stream is ≈ 600kbps. However, coverage

distance achieved by the VGA camera for the identification task is only ≈ 1.8m. Hence, a

VGA camera would not be suitable for such tasks. The graph also shows that incrementing

the frame rate increases the bitrate only sub linearly. This is due to efficient removal of

temporal redundancy by the H.264 encoder when frames are more closely spaced in time.


�

��

��

��

��

��

��

��

� � � � ��

��

��

��

��

��

��

��

��

Figure 1.4: Average typical optimized bitrate of Bosch security cameras [5, 6] plotted

against the camera resolution

1.3 Bitrate increase in low light surveillance

In the previous section, the average bitrate of Bosch surveillance cameras was found to

increase to up to 6Mbps when 12MP cameras are used. In low light conditions, the encoded

video bitrate increases further due to increased noise in the camera image. To gain insight

into this issue, we first need to understand the interplay between image noise, camera blur

and scene lighting.

1.3.1 Interplay between exposure, gain and noise

In Fig. 1.5, we show a surveillance camera image captured under low light conditions. Fig.

1.6 shows snapshots from the scene captured with high and low exposure settings. We can

observe that the facial details are blurred when exposure setting is high. Recognition tasks

will not be possible with such videos. We now set the exposure such that the blur is reduced

(like in Fig. 1.6b) and vary the gain of the camera. When the gain of the camera is low,

the visibility is very poor. Hence, the gain has to be increased to enable recognition of the


person in the video.

However, increasing the gain of the camera increases the noise in the image. This can

be clearly seen in Fig. 1.7b. This noise in the image is caused due to amplification of the

photon and read noise of the imager [10]. Low light performance can be improved by

using bigger image sensors. However, the cost of large image sensor based cameras are

significantly higher than the small pixel size cameras. Hence, surveillance installations very

commonly utilize small image sensors.

Figure 1.5: Low light scene with nominal exposure and gain settings


(a) Long exposure time (b) Short exposure time

Figure 1.6: Surveillance snapshots with different camera exposure settings (camera gain

has been increased to improve visibility and contrast)

(a) Low gain (b) High gain

Figure 1.7: Surveillance snapshots with different camera gain settings


1.3.2 Bitrate versus noise

To understand the impact of noise on the bitrate of the compressed surveillance footage, we

have varied the camera gain and encoded the video using the x264 video encoder [11]. Fig.

1.8 shows the bitrate plotted against the noise. Here, noise is computed as the standard

deviation of a pixel time series (100 frames) averaged over the entire image. Only the gray

scale values of the pixels are considered. The scene did not have any foreground objects.

Fig. 1.8 shows that the bitrate increases drastically as the pixel noise increases. The high

frequency noise content in the image results in large residual content and hence severely

affects the compression performance (It is well known that explosions in gaming videos

result in very high bitrates).

�

��

��

��

��

��

��

��

��

� � � � ��

��

��

��

Figure 1.8: Bitrate versus pixel noise in a low light scene


1.4 Challenges due to increased bitrate

From the discussion in Section 1.2, it is clear that higher resolutions (12MP) are required to

enable recognition and identification capabilities of the surveillance system. Unfortunately,

such high resolution video streams require large data communication bandwidths. In a

typical scenario where we need to support large number of cameras (e.g. in a campus),

transporting such high bandwidth video streams necessitate high speed Ethernet LAN in-

frastructure. Existing network systems which have been designed for VGA or slightly higher

resolutions might require upgrading. High bitrate video streams also increase the storage

cost. In the case of surveillance encoders that employ rate control, insufficient network

bandwidth would force the encoder to reduce the image quality or to reduce the frame

rate.

Also, the discussion in Section 1.3 showed that low light conditions exacerbate the issues

of high bandwidth requirement. Surveillance videos captured in poor lighting conditions

are very common. Identification tasks are inherently difficult in such videos. High bitrate

would cause rate control based encoders to further reduce the quality. This would make

such videos unusable.

‘On demand’ video surveillance systems have been used increasingly for applications

such as temporary asset monitoring & crisis monitoring. Surveillance camera coverage

could be increased temporarily during specific events (predictable events, e.g. sporting

events or unpredictable events, e.g. street protests) by building adhoc networks of low

power battery operated wireless cameras. Video surveillance data communication for such

applications in the market [12] currently use 3G / 4G (LTE and HSPA+) cellular standards.

However, delivering high bandwidth video data over such wireless networks would severely

limit the number of cameras that can be deployed. Higher bitrate also increases operating

costs due to higher data communication costs and increased storage requirements.

In addition to bitrate reduction, it is also important to reduce the computational com-

plexity and power consumption of the camera platform. Computational complexity reduc-

tion helps in reducing the cost of camera. Also, reducing the complexity lowers the power


consumption. Although power is not a very big concern in fixed camera platforms (pow-

ered from a wall socket), it is very critical in On-Demand surveillance applications where

the camera is powered by a battery. The total power consumption of such platforms is equal

to the sum of the image sensor power, video encoder power and the network device (wired

/ wireless) power. The power consumption of the encoder and the network device dominate

the total platform power. For example, the Omnivision OV9712 720p resolution image sen-

sor consumes 110mW. In comparison, the 720p resolution Bosch TINYON IP 2000 camera

platform consumes 2.65W. Increasing the video resolution increases the power consumed

by the encoder and network device. For example, the 12MP resolution Bosch DINION IP

ultra 8000 MP camera consumes 9W of power. For On-Demand surveillance applications,

such high energy requirements would reduce the operational time of the camera.

1.5 Thesis Contribution

In the previous section, we have described the challenges of increased bitrate & computa-

tional complexity in surveillance camera systems. In this thesis, we propose four techniques

to alleviate these challenges. We now provide a brief introduction of our contributions in

this section.

1.5.1 Proposed surveillance video encoder architecture

Fig. 1.9 shows the high level architecture of the surveillance encoder system. The input

video frames from the camera sensor are stored in an image buffer. The proposed tech-

niques analyze the frames and generate control parameters (e.g. MB QP’s, skip decision of

MB’s, index of the reference frame to be replaced) that are signaled to the H.264 encoder.

The frames that have been analyzed by the proposed techniques are compressed using the

H.264 video encoder. The output NAL unit stream is transmitted using a wired / wireless

communication module.


H.264/AVC encoder

Camera Video stream

Proposed techniques for bitrate reduction

Motion Est. & Compensation

DCT and Quant.

Entropy coding

NAL Buffer

Mode decision

ReconstructFrame buffer

Speeded up GMM + Skip decision

Reference frame selection

Face ROI region detection

Figure 1.9: Architecture of the proposed surveillance video encoder

1.5.2 Bitrate & computational complexity reduction

The four techniques we introduce in this thesis to reduce bitrate and computational com-

plexity of surveillance encoders are: (I) Speeded up foreground segmentation (II) Skip de-

cision (III) Reference frame selection & (IV) Face Region-of-Interest (ROI) coding. To offer

better insights of our contributions, we partition the bitrate cost and describe how the pro-

posed techniques reduce each cost component. The bit cost of a static-camera surveillance

video stream can be partitioned as follows:

• Background image region coding cost

• Uncovered background image region coding cost

• Shadow image region coding cost

• Non face image region (clothing, arms) coding cost

• Face image region coding cost

Background, uncovered background and shadow image regions do not contain useful

information required to perform surveillance tasks. Hence, coding of these regions can be

skipped. Image regions of foreground objects such as cars and pedestrians in the scene are


required. However, high image fidelity on all the foreground image regions is not required.

For example, in pedestrian surveillance, only the face image regions need to be encoded

in high quality. Non face image regions, i.e. images capturing clothing and arms can be

encoded in lower quality. From this discussion, it is clear that optimal bit allocation based

on regions of interest (ROI) can help to reduce the bitrate without affecting the surveil-

lance task. However, such a ROI encoder system should accurately label image regions as

‘background’, ‘foreground’, ‘non face’ and ‘face’. Incorrect marking of ‘face’ regions as ‘non

face’/‘background’ will severely impact the utility of the encoded video. Also, as already

noted in the previous section, the computational complexity of the ROI detector should

be be minimized. In this thesis, we introduce multiple techniques to accurately determine

regions of interest required to reduce the bitrate.

The uncovered background image regions cannot be marked as ‘Skip’ if the reference

frames do not contain the appropriate image content required for reconstruction. Hence,

we introduce a technique to optimally select the reference frames required to reconstruct

uncovered background image regions. We also describe the computation of the H.264 en-

coder parameters (skip mode, QP and reference frame index) required to implement the

proposed techniques in this thesis.

We now provide a synopsis of the proposed techniques:

1. Speeded up foreground segmentation: To label background image regions, we pro-

pose to use the GMM based segmentation algorithm. The variance parameter of the

GMM algorithm models the noise statistics in low light conditions and the multiple

modes of the GMM capture environment noise, hence reducing false positives. How-

ever, the GMM algorithm is compute intensive. Hence, we propose a modification to

the adaptive Gaussian Mixture Model (GMM) based foreground segmentation algo-

rithm to reduce computational complexity. This is achieved by replacing expensive

floating point computations with low cost integer operations. To maintain accuracy,

we compute periodic floating point updates for the GMM weight parameter using the

value of an integer counter. Experiments show speedups in the range of 1.33 - 1.44

on standard video datasets where a large fraction of pixels are multimodal.


2. Skip decision: As we have already seen, bit cost of background regions is very high

under low light conditions. Even under good lighting, bit cost of background regions

increases with increase in environmental noise (e.g. shaking tree leaves). To detect

such noisy background regions, we propose a spatial sampler based skip decision

technique. The spatially sampled pixels are segmented using the speeded up GMM

algorithm. The storage pattern of the GMM parameters in memory is also modified

to improve cache performance. Skip selection is performed using the segmentation

results of the sampled pixels. Using a two stage sampler reduces the computation

complexity without affecting the accuracy of the skip detector. Experimental results

show bit rate savings of up to 94.5% over methods proposed in literature on video

surveillance data sets. The proposed techniques also provide up to 74.5% reduction in

compression complexity without increasing the distortion over the foreground regions

in the video sequence.

3. Reference frame selection: A reference frame selection algorithm is proposed to

maximize the number of background Macroblocks (MB’s) (i.e. MB’s that contain

background image content) in the Decoded Picture Buffer. This reduces the cost of

coding uncovered background regions. Distortion over foreground pixels is measured

to quantify the performance of skip decision and reference frame selection techniques.

4. Face Region-of-Interest (ROI) coding: A face ROI encoding technique for pedestrian

surveillance is proposed. Face and shadow region detection is combined with the skip

decision algorithm to perform ROI coding for pedestrian surveillance videos. As we

showed, person identification requires high quality face images, MB’s containing face

image content are encoded with a low Quantization Parameter (QP) setting (i.e. high

quality). Other regions of the body in the image are considered as RORI (Regions of

reduced interest) and are encoded at low quality. The shadow regions are marked as

Skip. Techniques that use only facial features to detect the ROI MB’s (e.g. Viola Jones

face detector) are not robust in real world scenarios. Hence, to accurately determine

the ROI, RORI & RONI MB’s, we combine the outputs of multiple detectors. We pose


the MB labelling task as a super pixel classification problem. Shadow and skin detec-

tor scores of super pixels are computed. Pedestrians are detected using deformable

part models. The face region is determined using the deformed part locations. De-

tected pedestrians are tracked using an optical flow based tracker combined with a

Kalman filter. The tracker improves the accuracy and also avoids the need to run

the object detector on already detected pedestrians. Bilattice based logic inference is

used to combine multiple likelihood scores and determine the labels of the super pix-

els. The coding mode and QP values of the MB’s are computed using the super pixel

labels. Results show that the proposed face ROI coding technique provides a further

reduction in bitrate of up to 50.2%.

1.6 Organization of the thesis

A review of state-of-the-art video coding techniques for surveillance is presented in Chapter

2. Chapter 3 describes the technique we propose to reduce computational complexity of

foreground segmentation. The results of the proposed algorithm are also included in the

same chapter. Chapter 4 introduces a skip decision technique for H.264 surveillance video

coding. A reference frame selection algorithm for H.264 surveillance video coding is also

described in Chapter 4. The results of the skip decision and reference frame selection

techniques are provided in Chapter 5. The Face ROI encoder that we propose is described

in Chapter 6. Chapter 7 concludes the thesis and discusses few open problems for future

work.

Chapter 2

Background and Related Work

2.1 Introduction

In the previous chapter, we described the challenges faced by consumers due to increasing

bandwidth requirements of surveillance video cameras. We also listed the four techniques

we propose to reduce bitrate and computational complexity of surveillance encoders. We

achieve this by exploiting unique characteristics and requirements of static camera surveil-

lance video encoders. Before we delve into the details of these techniques in the successive

chapters, we review the state-of-the-art surveillance video compression algorithms here.

We begin by providing a concise summary of different video encoding techniques. We

briefly describe the H.264 standard and a few relevant aspects. Following this, we review bi-

trate and computational complexity reduction techniques proposed by previous researches.

2.2 A review of video encoding techniques

Video coding has evolved rapidly with the development of Motion JPEG, MPEG1, MPEG2,

MPEG-4 Part 2, MPEG-4 Visual, H.264/MPEG-4 AVC, VP8, VP9 and HEVC video standards.

Along with the standardization of video coding techniques by committees, research groups

in industry and academia have explored various video coding ideas. All these techniques

can be broadly classified as lossy and lossless. Lossless compression permits the decoder

to reconstruct the original video from the compressed data. Lossy compression allows only

the reconstruction of an approximation of the original video. Well designed lossy com-

pression systems provide good bitrate savings without degrading the quality too much. All

surveillance systems use lossy compression techniques which we will now briefly review.

18

Chapter 2. Background and Related Work 19

1. Discrete Cosine Transform (DCT) based intra frame coding: Each video frame of

the video sequence is compressed separately using DCT based image coding. The

quantized coefficients are reordered and losslessly packed into the output bit stream.

DCT intra frame encoders do not exploit the inter frame redundancy and hence have

low compression ratios. However, this coding technique is very robust against com-

munication channel errors. MJPEG is a popular DCT intra frame coding standard

which was used by early surveillance systems. Current surveillance systems continue

to offer capability of streaming MJPEG format videos [5].

2. Object based video coding: The scene image is considered to be a collection of

multiple video objects (VO). The background is also considered as a video object.

The VO’s are coded separately. The objects can have arbitrary shapes. This coding

technique also allows individual objects to be encoded with different quantization

parameters. It also allows different temporal resolutions to be specified for the objects.

Hence, object based video standards seamlessly allow encoders to implement ROI

compression.

MPEG-4 Visual Part 2 is one of the popular standard that supports object based video

coding. This standard specifies shape coding, motion compensation and texture cod-

ing of arbitrary-shaped video objects. In shape coding, the shape of the edge of the

object needs to be specified by the encoder. The mask which indicates the pixels that

are part of the VO is coded using Context based binary Arithmetic Encoding. Each

pixel is assigned a transparency parameter. Block DCT of the gray scale transparency

data is quantized, reordered, run-level and entropy coded. Richardson [13] provides

a detailed discussion of the MPEG-4 standard.

3. Block based DPCM/DCT coding: This has been one of the most popular coding

techniques adopted by a majority of the successful video compression standards, e.g.

MPEG-1, MPEG-2, H.261 and H.263, H.264/MPEG-4 AVC, VP8, VP9, HEVC. Encoders

supporting these standards consist of four stages, i.e. motion estimation & compensa-

tion, transform computation, entropy coding and reference frame reconstruction. The


H.264 standard which is used in this thesis is described in more detail in Section 2.3.

4. Distributed video coding: Motivated by theoretical results of distributed source cod-

ing by Slepian and Wolf [14, 15], there have been a lot of attempts to shift the

computational complexity from the encoder to the decoder [16, 17, 18]. Here, tem-

poral redundancy of the video sequence is exploited only at the decoder, i.e. motion

estimation is performed at the decoder to obtain estimates for the side information.

The complexity of the encoder is very low since it only uses intra frames. This has

special appeal in the case of battery operated wireless video surveillance systems in

which reducing the complexity of the encoder increases the life of the system. The

decoder would be part of the network transcoder or a server connected to a power

line. The robustness of this technique against communication errors is also higher

than inter frame encoders since errors do not accumulate over multiple frames.

Until recently, H.264 has been the best performing video coding standard widely used by

surveillance video encoders. It is a block based DPCM/DCT coding standard that specifies a

large set of encoding techniques for efficient video compression, e.g. variable block-size mo-

tion compensation, multiple reference frames, Bi-directional predicted frames, various intra

prediction modes, six tap filtering, an in-loop deblocking filter, CABAC. HEVC which is the

newly finalized standard has improved upon H.264 by introducing numerous advances, e.g.

large coding tree blocks, more intra prediction directions, improvements to the deblocking

filter, adaptive motion vector prediction. However, these techniques are computationally

very expensive and require dedicated hardware accelerators to implement the encoder. As

a result, surveillance camera manufacturers continue to mostly use the H.264 standard in

almost all their product offerings. Hence, in this thesis, all the proposed techniques have

been implemented and tested using the H.264 encoder. However, we note that the algo-

rithms introduced here can also be applied to the new DPCM/DCT based standards (i.e.

HEVC and VP9). In the next section, we describe the H.264 video encoding standard, with

greater emphasis provided on techniques relevant to this thesis.


2.3 H.264 basics

The architecture of the H.264 encoder [7] is shown in Fig. 2.1. Inter prediction (Motion

estimation & Motion compensation) and intra prediction are performed on the macroblocks

(MBs) in the input frame FC . Mode decision determines the best mode for each of the

macroblocks. The MB residuals for the best mode are computed and are further transform-

quantized. The quantized coefficient data is reordered in a zig zag sequence to ensure

that the low frequency DCT coefficients are clustered. The reordered residual data, motion

vector values and associated header information for each macroblock are entropy coded

and packed into NAL units.

The encoder would require the decoded frame to perform Motion Estimation (ME) on

future frames. Hence, inverse transform & inverse quantize operations are performed on the

quantized data and the reconstructed frame is stored in the Decoded Picture Buffer (DPB).

An in-loop deblocking filter is also applied on the reconstructed frame before storing it into

the DPB.

Ch

ap

ter

2.

Back

gro

un

dan

dR

ela

ted

Work

22

FC (Current)

F`x

(reference)

ME

Choose Intra prediction

T

MC

Intra prediction

Filter T-1 Q-1

Q

F`C

(reconstructed) +

+

+

-Reorder

Entropy encoding NAL

Rate controlMode

decision

Encoded video stream

F`y

(reference)

DPB

Abbreviations:

ME: Motion estimationMC: Motion compensationT: TransformQ: Quantize

Figure 2.1: Architecture of the H.264 encoder [7]


2.3.1 Reference frames

Inter predicted MB’s use image data in the reference frames (stored in the DPB) to reduce

the residual content. After the current frame is compressed, it is reconstructed and is stored

in the DPB for inter prediction by future frames. The H.264 standard allows storage of up

to 16 frames in the DPB. Along with this, the standard also allows the encoder to manage

the DPB, i.e. to insert or remove frames from the DPB. The standard specifies two ways to

achieve this: (I) Sliding window in which the oldest short term reference frame is removed

(II) Adaptive memory control which is supported through Memory Management Control

Operations or MMCO commands. Reference pictures in the DPB are marked as either short

term or long term. The oldest short term picture is removed from the DPB when the DPB

is full. Fig. 2.2 shows an example where MMCO commands have been used to manage

the DPB. At time instance n + 1, the reconstructed frame has been marked as a long term

reference picture and has been inserted into the DPB with long term frame index set to 4.

45485254

Short term

1Long term

DPB

(a) Frame n

45485254

Short term

14

Long term

DPB

(b) Frame n+ 1

48525457

Short term

14

Long term

DPB

(c) Frame n+ 2

48525457

Short term

45

Long term

DPB

(d) Frame n+ 3

Figure 2.2: Decoded Picture Buffer managed using MMCO commands

Prediction for each MB can be obtained from either a single frame (in a P type slice)

or from two frames (in a B type slice). H.264 specifies two reference lists which contain

‘Picture Order Count’ (POC) values of frames in the DPB. For MB’s in a P frame, only a single

reference picture is used for prediction. Hence, only the index of the reference picture List 0


is signaled in the bitstream. B frame MB’s utilize two frames for prediction and consequently

require the encoder to specify the indices of both the reference frame lists. B frames increase

the latency and the complexity of the encoder. In this thesis, we utilize only P slices. Hence,

we review reference frame list management details for only P frames.

Fig. 2.3 shows the List 0 and a sequence of video frames. Macroblock MB1 in the

current frame uses a previous picture (frame num 45) for motion compensation. The

previous picture with frame num equal to 45 is marked as a short term reference frame

in the DPB. Its position in List 0 (zero indexed from top) is 1. Hence, the encoder signals

List0(1) as the reference frame in the mb pred syntax element (SE) for MB1.

The default reference picture list order places the short term frames on top of the list

in decreasing order of PicNum. The long term reference frames are placed below the short

term frames in increasing order of LongTermPicNum. The H.264 standard allows the en-

coder to change the order in the list using Reference Picture List Reordering (RPLR) com-

mands. In the example in Fig. 2.3, we can see that reordering has been performed to bring

the long term frame 4 onto the top of List 0.

2.3.2 Macroblock Skip mode

Residual data in static scenes is very low. Since such static image blocks are commonly

found, the H.264 standard allows the encoder to mark these MB’s as skip. No transform

coefficient data and motion vector data is transmitted for a skipped MB. The decoder re-

constructs the MB image using motion vector prediction (MVP). The reference picture for a

skip MB is always the frame indexed at the top of List 0, i.e. List0(0).

2.3.3 Macroblock QP signaling

The standard provides a combined definition for the transform, scaling and quantization

operations performed on the residual data of a MB (Richardson in [7] provides a very

good description of the transform and quantization computations in H.264). Let matrix X

represent the image content in a MB (under default settings, X is a square matrix with size

4∗4). Let Y denote the output of the transform, scaling and quantization operations. Then,


45424039

Short term

41

Long term

LT 4ST 45ST 40ST 42

List 0

ST 39LT 1

List0(1)

Current frameframe_num = 46

MB reference frame List 0 index

List0(3)

Frames in video sequence (Reference frames are shaded in gray)

MB1

MB2

DPB

44 454342

List 0 has been reordered

frame_num

Figure 2.3: Reference frame list management in H.264

Y = round

(

[Cf ] . [X] .[

CTf

]

◦m (QP%6) .1

215+floor(QP/16)

)

(2.1)

Here Cf is the forward core transform matrix and is primarily responsible for energy

compaction. The scaling and the quantization processes are combined to obtain the re-

maining terms in Eqn. 2.1. m(QP%6) represents a matrix whose element values depend on

the value of QP (Please see [7] for more details). Here, QP is a called as the ‘Quantization

Parameter’. Increasing the value of QP increases the quantization and hence reduces the

quality. As noted in [7], all the arithmetic operations in Eqn. 2.1 can be done using integer


arithmetic.

The H.264 standard allows rate control at the MB level, i.e. the quantization parameter

of each MB can be set by the encoder [19, 20]. QP for each MB is specified in the MB layer

using the delta qp SE. The element delta qp signals a change in the QP from its previous

value. Here, the previous value refers to the QP of the previous macroblock in decoding

order in the current slice (Slice data consists of a series of macroblocks). If there is no

change in QP from the previous value, delta qp is set to 0.

2.4 Bitrate & complexity reduction techniques for video surveil-

lance

The increasing importance of high resolution surveillance camera footage has prompted

researchers to develop surveillance specific bitrate and complexity reduction techniques

[16, 21, 22, 23, 24, 25]. We review some of these methods in this section and briefly

describe advances we make to the existing techniques (detailed comparisons are provided

in later chapters). ‘Skip decision’, ‘Reference frame selection’ and ‘ROI coding’ based bitrate

reduction techniques are most relevant to this thesis. Hence, we present a more detailed

literature survey of these techniques.

2.4.1 Skip detection techniques

Although traditional compression techniques (e.g. DPCM/DCT based methods) remove

spatial and temporal redundancy in surveillance videos, a lot of unwanted information

continues to exist in the encoded bit stream. For example, the noise of the sensor in the

static background image regions leads to increased bitrates. To remove this redundancy,

researches have been proposed to encode only the foreground regions in high quality (i.e.

with low QP setting). Regions in the image which do not contain useful information are

either marked as Skip or are encoded in low quality. These methods can be classified as

follows:


• Segmentation based: Many researches [26, 21, 22, 23, 24] use Background subtrac-

tion [27, 3, 28, 29] to segment the foreground objects. The background regions are

marked as skip or else encoded at low quality. A good comparison of popular and ef-

fective methods for Background subtraction can be found in [30]. In [22], Vetro et al.

propose a MPEG-4 based surveillance video coding scheme which utilizes a two stage seg-

mentation method to detect interesting objects in motion. The Gaussian Mixture Model

(GMM) based background subtraction algorithm followed by image correlation is used to

filter the video frame. However, image correlation computation costs are prohibitive to

implement on low power embedded platforms [31]. In [23], Chien et al. have proposed

a low complexity moving object detector for object based video encoders. The method

in [23] maintains a background frame in a buffer. The difference between all the pixels

in the current frame and the background frame is modeled using a Gaussian distribution.

A threshold is applied on the frame difference to classify a pixel as either foreground or

background. More recently, Jin et al. [24] have proposed a motion detection based ‘Skip’

scheme for H.264/AVC surveillance video coding. The method in [24] uses chrominance

features to decide whether a MB needs to be skipped or coded using mode decision. The

mean values of the pixel chroma components are initially used to detect foreground MB’s.

When the mean pixel chroma values are similar to those in the previous frame, individual

chroma components are compared and threshold’s are applied to decide whether the MB

can be skipped. MB’s are skipped only when the coarse motion search vector is equal to

the PMV (predicted motion vector). In [32], Yang Yu et al. use a codebook based back-

ground segmentation algorithm to determine moving regions. MPEG-4 object coding is

used to encode the foreground objects. The background is encoded using MPEG-4 frame-

based coding. Shih-Chang Hsia et al. [33] propose a segmentation based technique to

perform MPEG-4 surveillance video encoding. The difference image between adaptively

chosen frames is used to determine the shape of the objects. Spatial processing is used to

refine the shape. The objects are encoded using MPEG-4 video object coding. Venkatesh

Babu et al. [34] determine foreground objects using background subtraction. Object-

based motion compensation is performed and the shape adaptive DCT coefficients of the


compensation error is computed. In [35], Hwangjun Song et al. perform frame differenc-

ing followed by median & morphological filtering to determine the foreground regions.

In [36], Pierpaolo Baccichet et al. compute the Mean Absolute Difference (MAD) be-

tween the pixels in the filtered input frame and the previous encoded frame. The MAD

values are thresholded to determine the ROI MB’s. Ching-Yu Wu et al. [37] use back-

ground segmentation for traffic surveillance video encoding. In [38], Liu et al. use the

Mean Absolute Difference (MAD) (with MV set to zero) to determine the ROI. Thomas et

al. in [39] separately transmit the segmented background and foreground object images

along with watermarks. The receiver authenticates the images using the watermarked

data.

• RD cost based: Skip mode selection techniques proposed for generic video content has

been primarily based on thresholding of the RD (Rate-Distortion) cost [40, 41, 42]. In

[41], Zeng et al. determine the threshold using the value of QP. If the RD cost for Skip

mode is less than the value of the threshold, the MB is marked as Skip and other modes

are not processed. In [40], Kannangara et al. maintain a running estimate of the RD cost

for all the modes. If the RD cost estimate for the skip mode is lesser than all the other

estimates, the MB is marked as Skip.

• Motion Vector based: In [43], Kannur et al. group regions based on the MVs into (I)

Moving regions and (II) Static regions. The moving regions are further classified into

multiple regions depending on the motion magnitude. In [44], Hang Li et al. determine

moving regions based on the motion vector values, i.e. MB’s whose motion vector are

high are considered as foreground. The moving regions are encoded at higher quality.

In this thesis, we use a segmentation based technique to determine the background

image regions. We advance the state of the art in two ways. We first develop a speeded up

GMM segmentation algorithm. Next, we combine this segmentation algorithm with a two

stage sampler to efficiently and accurately mark skip MB’s. We also propose to rearrange

data structures to improve cache performance. We test the speeded up GMM algorithm

on standard video datasets. The skip decision technique is tested using an exhaustive set


of surveillance videos that we have captured. The proposed speeded up GMM algorithm

provides speedup (over the GMM algorithm proposed by Zivkovic in [3]) in the range of

1.33 - 1.44 on the standard video datasets. The skip detection algorithm that we propose

provides bit rate savings of up to 94.5% and compression complexity reduction of up to

74.5% (over methods proposed in literature) without increasing the distortion over the

foreground regions.

2.4.2 Background reference frame selection techniques

• Standard non compliant techniques: In [45], Xianguo et al. divide the video sequence

into Super Group of Pictures (GOP). Mean shift is applied on a training set chosen from

the super GOP to generate a background frame. The difference between each input

frame and the background frame is computed and encoded using the H.264 standard.

The background frame for the super GOP is also encoded and transmitted. In [46], Xi-

anguo et al. extend this further by including a background difference prediction model.

They also derive criteria for a block to be predicted by either the short term reference

picture, the background picture or the background difference data. In [47], Manoranjan

Paul et al. determine the background frame using a Gaussian Mixture Model (they refer

to this image as the Most Common Frame in Scene or McFIS). McFIS is encoded as a con-

ventional I-frame. All the frames in the video sequence are encoded using inter coding.

The McFIS along with the previous frame is used as the reference set for inter coding.

In [48], Manoranjan Paul et al. avoid transmitting the McFIS to the encoder. The McFIS

is generated by the decoder independently using the same dynamic background model

used by the encoder. In [49], Totozafiny et al. propose a JPEG2000 standard based en-

coder for road surveillance. The encoder uses the static background as a reference frame.

Video frames from the camera are segmented to determine objects. The segmented data

and the reference frame are transmitted to the decoder. At the decoder, the ROI binary

mask is implicitly inferred using the Maxshift method which is part of the JPEG-2000

standard. The reference frame update procedure is performed across multiple frames.

In [50], Shumin Han propose a background reconstruction based coding technique for


a moving camera. A panorama image of the background is generated. Feature point

pairs in the panorama and the current frame are detected. These point pairs are used

to estimate the global motion transformation matrix which is later use to reconstruct the

background image for the current frame. The background panorama is intra coded using

the MPEG-4 standard.

Although non compliant schemes simplify the encoder complexity, they cannot be easily

adopted into commercial products. The surveillance system comprises of the camera,

network devices, server side software (for visualization and analytics), server side hard-

ware (for decoding) and server size storage. Surveillance installations typically procure

these components from different vendors. Hence, interoperability is a very important

issue. Increasing vendor compliance to the recent ONVIF and PSIA standards [51, 52]

also clearly shows that successful market adoption depends on standard compliance.

• Standard compliant techniques: The MPEG-4 standard [32, 33] has been used by pre-

vious researches for ROI video compression. Here, the background image is transmitted

using MPEG-4 frame-based coding. The foreground regions are encoded using MPEG-4

Visual object coding. However, recent video standards such as H.264 and HEVC are block

based and do not support object based video coding. Instead, they allow motion compen-

sation using multiple long term and short term reference frames stored in the decoded

picture buffer.

Researches on long term reference frame selection for H.264 have proposed to use high

quality long term reference frames (HQF’s) to improve RD performance on generic videos

[53, 54, 55]. Liu et al. [55] proposed a scheme to select the HQF’s for generic video

content based on the predicted error variance of the coded picture (with the HQF set as

the reference). Experimental results of all these methods showed that the PSNR (Peak

Signal to Noise Ratio) of the low quality frames improved by referencing to the long

term high quality frames. However, in static camera surveillance encoders, the position

of the objects in the coded frame would be very different from the position in the long

term reference frames. Hence, coding quality of foreground objects in motion does not


benefit from high quality long term reference frames. However, the selection of long term

reference frames influences the cost to encode uncovered background regions. This has

been recognized by researchers and a few techniques to optimally select the reference

frames have been proposed in [56, 57, 58]. Xianguo Zhang et al. [58] have proposed

a background model based technique for a HEVC surveillance video encoder. A running

average algorithm is used to generate the background frame. The background picture is

encoded using intra prediction. The ‘no display’ option provided by the HEVC standard

is used to transmit the background picture. However, the bitrate of the encoded video

would increase due to this intra coded background frame.

Li et al. in [56] proposed a technique to select reference frames for a High Efficiency

Video Coding (HEVC) video encoder. The method in [56] utilized cloud compute re-

sources to perform optimal reference frame selection for offline coding of generic video

content. Li et al. [57] extended the work in [56] by developing multiple, low-complexity

algorithms in addition to a quality-adjustment scheme for generic video content. The first

technique introduced in [57] is called the ‘r×’ algorithm which is essentially a greedy

strategy. It relies on the assumption that if a picture (if marked as reference) does not

provide benefit to the encoding process of the current frame, then it would not do so for

the following frames as well. Since the computation cost of the r× algorithm is r times

the cost of a normal encoder, they propose 2 lower complexity algorithms called ‘1×’ and

‘2×’.

In this thesis, we propose a standard compliant reference frame selection technique

for H.264 surveillance video coding. The ‘1×’ and ‘2×’ complexity algorithms in [57] are

most relevant to the technique that we propose in this thesis. Although the ‘2×’ algorithm

provides bitrate savings almost equal to that obtained by the proposed technique, its com-

putational complexity is significantly higher (it requires a second encode pass). The 1×

technique has reduced complexity (compared to the ‘2×’ algorithm) but it does not pro-

vide bitrate savings. In contrast, the proposed technique determines the optimum reference

frame with very low complexity 30 − 40µsec/frame. Also, since it avoids the coding of

uncovered background regions, additional compute savings are obtained. The proposed


technique reduces bit rate by up to 24.7% and execution time by up to 7.3%.

2.4.3 ROI coding techniques

QP’s for the MB’s inside the ROI can be determined using a static assignment procedure or

using a rate control model based algorithm. Researches have used ROI coding for movie

content and video conferencing applications. A lot of these ideas are quite generic and

can also be applied to surveillance. Hence, we will also include such references which are

relevant in the context of surveillance. We group various ROI based coding techniques and

present them here. Grois et al. have provided a detailed overview of some of the recent

ROI coding techniques in [59].

• Object detection and/or tracking based methods: Pattern recognition techniques

have been used to perform MB/blob level object detection and tracking [26, 60]. One

of the early methods proposed for ROI coding [61] used block level frequency domain

features to classify ROI. In the first step, ROI region proposals were obtained. These

proposals were used to train a fine detail neural network classifier. A similar system

was also proposed in [62]. In [26], Lai-Tee Cheok et al. use a vehicle/person classi-

fier to detect pedestrians in the scene. The detected objects are tracked. The person

detector output is used to modulate the weights in a MB-level rate control equation.

Pedestrians are assigned higher weight and hence higher quality. In [60], Fernandez

et al. combine MB level background segmentation, temporal and spatial filtering,

MB clustering and tracking to determine ROI’s for surveillance videos. In [63, 64],

Christopher et al. use Viola Jones face detection [65] to detect faces in each frame. An

iterative mean shift based object tracker is initialized for each new detection. Detec-

tions which match state objects are used to update the object representations. In [66],

Ming-Chieh Chi et al. use face detection to mark ROI regions.

• Skin detection based techniques: These methods have been mostly proposed for

video conferencing applications. In [67], Yang Liu et al. use direct frame difference

and skin-tone classification to determine ROI’s for a video conferencing application.


They use a low pass filter to dilate the skin-tone area to accurately mark the ROI.

In [68], Shu-Fen Huang et al. propose a ROI video transcoder. ROI’s are deter-

mined based on the MV value and the skin pixel probability. Pixel level classification

of skin/non skin is done by thresholding CbCr values. In [69], Douglas Chai et al.

overcome limitations of color segmentation based skin detection by combining it with

probability based morphology and luminance regularization. The detected face re-

gions are encoded using the H.261 video encoder.

• Moving camera related techniques: Absence of a static background increases the

complexity of ROI detection. Researchers have proposed to determine the camera

motion and use it to improve the coding efficiency.

– Pan tilt cameras: In [70] Dalei Wu et al. jointly consider the video coding, trans-

mission and camera control tasks for a pan-tilt camera installed in a wireless

network. A network-delay aware Kalman filter based tracker is used to con-

trol the pan tilt camera. Different video coding parameters result in different

packet lengths and packet loss rates, which will lead to different amounts of

transmission-induced distortion. They determine set of coding parameters that

optimize the expected distortion.

– Aerial platforms [71, 72, 73]: In [72], Holger Meuel et al. perform global mo-

tion estimation using the Harris corner detector and KLT (Kanade-Lucas-Tomasi)

tracker. New areas are determined by global motion compensation. Projection

parameters are computed and used to align two frames. Regions in the current

frame, which are projected outside the previous frame, are detected as new ar-

eas (ROI-NA) and need to be encoded and transmitted. The difference image

between the current frame and the motion compensated frame is used to detect

moving objects in the scene.

• Surveillance Operator controlled methods: Multiple researches propose to allow

the surveillance operator to control the ROI for video compression [74, 75, 76, 77]. In

[75], Mavlankar et al. propose an encoder which allows the user to define the region


of interest. The frame is divided into multiple slices hence allowing transmission

of only the ROI. A temporal median filter is used to obtain the background frame

which is later intra coded. The optimal slice size is also determined. ROI can also

be determined based on the actions of the operator. In [71], Hui Cheng propose an

algorithm which analyzes the camera operations such as pan, tilt and zoom control

performed by the operator. Based on this the ROI regions are marked.

• Saliency based techniques:

– Frame center ROI [44, 38]: In [44], Hang Li propose to enhance perceptual qual-

ity of a video conferencing system. The central region of the frame is encoded at

higher quality since they are more important than the marginal regions.

– Eye tracker based: Fadi Boulos et al. [78] determine ROI’s using an eye tracker.

Fixation duration and fixation velocity parameters are thresholded to determine

salient regions. Such regions which are viewed by multiple viewers are marked

as the ROI.

– Saliency model based: With the development of saliency models [79], researches

have proposed to use them to determine regions of interest for bit allocation

[80, 81, 82]. In [81], Laurent Itti et al. use saliency based attention predic-

tion to detect interesting regions in the video. Saliency of the image region is

used to determine the bit allocation. They show improvement of up to over 2 dB

(eye-tracking-weighted PSNR or EWPSNR measure of subjective quality). How-

ever, for surveillance specific coding, the salient regions are the pedestrian face &

the vehicle number plate. Hence, ROI detectors based on object representations

would perform better than saliency model based techniques.

• FMO based techniques: These methods [60, 43, 36] detect ROI objects and encode

them using different slice groups. Flexible MB Reordering (FMO) supported by the

H.264 standard is utilized to achieve this. In [43], Kannur et al. utilize the ‘explicit

slice group ordering’ option in FMO to define slice groups (SG). Here, different slices


correspond to groups of MBs with different motion properties. These SG’s are coded

with different quality.

• Error resilience and encryption: Detection of ROI’s enables the encoder to increase

resilience to data communication errors. The encoder can provide unequal protection

to the MB’s depending on their importance (e.g. face image regions of a pedestrian

are most important in surveillance). In [78], Boulos et al. encode ROI MB’s using the

Intra mode to reduce propagation of errors. Andreas Unterweger et al. [83] study the

impact of slice group coding on post-compression encryption for surveillance appli-

cations. In [84], Sourabh Khire et al. propose to use multiple down-sampled repre-

sentations to improve burst error resiliency. The technique ensures that errors due to

a burst loss does not impair co-located frames of all the representations. This allows

the receiver to conceal the error and improve the picture fidelity.

• Rate control for ROI encoding: When transmission bandwidth of the wired / wire-

less network drops, the surveillance video encoder will need to either reduce the

frame rate or increase the quantization parameters of MB’s in the frame. Video en-

coders employ rate control algorithms [35, 37, 43, 85, 38, 86, 87, 88, 89] to determine

these parameters. In an ROI encoder, the rate control algorithm can preferentially al-

locate bitrate so that the most important regions in the image (e.g. faces in pedestrian

surveillance) are compressed with high fidelity. In [35], frame-layer and macroblock-

layer rate control is performed using a moving-region-weighted MSE based distortion

model. In [38], Liu et al. assign higher bits to MB’s which have higher MAD val-

ues. Yu Sun et al. [89] propose a joint source-channel region based MB level rate

control algorithm for wireless video transport. Bitrate allocated to ROI is higher than

that assigned to non-ROI MB’s. In [90], Chung-Ming et al. use motion detection

and tracking to determine the foreground objects. The background quality is set to

a low value. The neighbours of ROI MB’s are considered as ROI-contour extensions

and are coded at a slightly higher quality. ROI MB’s are given highest priority and

coded with highest quality. Appropriate QP values to the ROI, ROI contour extensions


and background are determined. A recently proposed rate control approach in [91],

although not directly related to ROI encoding, is very interesting in the context of

surveillance applications. Here, the rate control algorithm preserves image features

that are required to perform computer vision tasks such as image retrieval.

• Commercial systems: Having realized the immense benefits of ROI detection and

coding, commercial vendors have included similar techniques in their cameras. The

DINION and FLEXIDOME HDR cameras from Bosch [92] detect objects such as faces,

people and vehicles and control the imager settings (e.g. auto exposure) to ensure

that high picture quality of the objects is obtained. Sony cameras [93] allow operators

to select the portion of an image they want to monitor in 4K resolution. The rest of

the image is streamed in lower resolution. Axis Zipstream technology [94] proposes

to dynamically determine ROI’s to preserve forensic details such as faces and tattoos.

The VideoBANDIT suite [95] by General Dynamics uses Dynamic region-of-interest

coding to transmit video over ultra low bandwidth communication links.

In this thesis, we combine multiple, low and mid level detectors to compute regions

of interest. We also integrate a tracker to reduce the computational complexity. We show

that object detection using only skin and face detection does not provide good ROI seg-

mentation. Hence, the proposed technique uses multiple visual cues to accurately mark the

regions of interest. The proposed scheme provides bitrate reduction of up to 50.2% over the

x264 video encoder. In the context of bitrate reduction under limited compute capability,

the possibility of an optimal processing order of the blobs in a video sequence is suggested.

The complete implementation of a complexity control mechanism for pedestrian ROI video

coding is left to future work.

While the proposed system assumes a static camera to perform foreground segmen-

tation, the techniques can be combined with image registration and adopted in Pan-tilt

surveillance cameras. Surveillance cameras mounted on drones present several challenges

(e.g. rolling shutter correction, image stabilization, limited compute and energy resources)

which we leave to future work. Operator controlled techniques to determine ROI (e.g. faces


and vehicle number plates) are not scalable. Hence, we suggest to use operator assistance

only to mark regions in the images which are guaranteed to have no interesting content.

Infrequent tasks such as marking of control points required to compute the scene geome-

try can be performed by the operator. Saliency models are more suited for movie content

than video surveillance and hence are not discussed further in this thesis. Also, we have

not explored using FMO since it is not well supported by commercially available decoders.

The proposed face ROI encoder can be used to improve error resilience. The proposed

techniques will also need to be combined with rate control algorithms to enable surveil-

lance video streaming over limited bandwidth networks. We briefly discuss these ideas in

Chapter 5.

2.4.4 Mode decision and motion estimation related techniques

Unique characteristics of surveillance videos can be used to reduce the mode decision and

motion estimation complexity of the video encoder. Tong Gan et al. [96] propose a fast

H.264/AVC mode decision scheme for tunnel traffic surveillance where flashing lights pose

significant challenges. Significant change of luminance levels and appearance of new ob-

jects in the scene cause large number of MB’s to be intra coded. Such MB’s are usually

clustered. Hence, if three or more neighbours of a MB are coded as Intra, then the mode

for the MB is also marked as Intra. This reduces the ME compute cost of such MB’s. In [26],

Cheok et al. use a person detector to detect pedestrians in the scene. When a scene change

occurs, MB’s which contain people are encoded using inter prediction.

Muhammad Akram et al. [97, 98, 99] propose three different Motion Estimation (ME)

techniques for surveillance encoders: (I) Selective ME - search is performed only on frames

which have some activity (II) Tracker based ME - Surveillance video tracker results are

used to perform ME (III) Multi frame ME - difference between current and previous refer-

ence frames is computed. Pixel locations where difference is non zero are considered as

candidate locations for matching blocks in the current reference frame.

In moving camera platforms, global motion estimation has been used to reduce ME


computation cost. In [100], Guili Xu et al. propose a block-based motion estimation tech-

nique for pan tilt cameras. Global motion is estimated and Kalman filtering is used to

determine the MV’s. Computation time reduction of about 95% is achieved. In [73, 101],

global motion estimation is applied to video captured by a quadrocopter mounted cam-

era. Global motion compensation based on the projective transform is performed. Only the

blocks which contain moving object detections are coded. At low altitudes, the projective

transform is replaced by a mesh-based global motion compensation technique.

In this thesis, background & uncovered background MB’s are reconstructed from image

content in the DPB. Motion estimation on these MB’s is not required. Hence, we obtain

significant computational savings of upto 74.5% without affecting the foreground image

quality.

2.4.5 Hardware related advancements

Stolberg et al. [102] have developed a single chip solution for surveillance applications. The

chip consists of three cores which can together perform MPEG-4 encoding and object track-

ing. The first core with a 16-datapath SIMD array is optimized for image and general digital

signal processing tasks. The second core is designed to perform macroblock processing for

the video encoder. The third core performs bit stream processing and combined with the

MB processor jointly implement the encoder. The object tracking algorithm is implemented

on the DSP core.

Researchers have integrated many of the necessary surveillance related data processing

elements into the focal plane. Pixel-level capacitors or photo-diode devices are used as

storage elements of the image. In [103], Chi et al. describe a capacitive motion detection

circuit which is built into the pixel. This work was extended further in [104] by using a

18.5 MHz micro-controller which computes the fast binary DCT of image blocks. With a

compression ratio of about 48:1, surveillance events of interest can be discerned. Bo Zhao

et al. [105] integrate on-chip moving object detection and localization capabilities into a

64 × 64 CMOS image sensor. A clustering algorithm is implemented in the image sensor

chip itself. The algorithm can localize up to three moving objects in the scene. Region of


interest picture capture is also supported. In [106], Mizuno et al. utilize the photo-diode

array itself as a frame memory. Ming Zhang et al. [107] propose two CMOS-based motion

detection circuits to perform ROI detection. Nicola Massari et al. [108] have demonstrated

edge detection, motion detection, image amplification, and dynamic-range boosting oper-

ations using pixel level analog processing. The imager uses switched capacitor techniques

to perform the image processing operations over a kernel of 3 × 3 pixels. Multiple re-

searches [109, 110, 111, 112, 113] have integrated compression algorithms on the image

sensor. A comprehensive review of image sensors with on-chip image compression is avail-

able in [114].

Although this thesis does not implement specialized hardware elements, the proposed

techniques indicate several details which can be used to improve the performance of surveil-

lance specific hardware systems. In Chapter 3, we describe a speeded up GMM algorithm

that replaces expensive floating point computation with integer operations. The technique

provides speedup of up to 44% (over techniques proposed in literature) and also reduces

the memory bandwidth by a minimum of 16% for multimodal pixels. In Chapter 4, we

have optimized the cache performance of the sampler based skip detection algorithm. Re-

sults show upto 12.3% reduction in execution time and 30.2% reduction in Last Level Cache

or LLC references.

2.4.6 Distributed video coding based techniques

Rohit & Kannan [18] propose a distributed coding architecture called PRISM which uses

channel coding concepts to shift the motion estimation complexity from the decoder side

to the encoder. The quantized codeword space of the input data is partitioned and the

syndrome of the quantized data is transmitted. Motion estimation is not performed at the

encoder. At the decoder, motion search is performed to obtain candidate predictors. The

decoder recovers the data using the received syndrome and the candidate predictor as side-

information. Encoding performance is shown to be between that of inter and intra coding

modes of H.263+. Chuohao Yeo et al. [115] extend PRISM to support multi-view video

compression. The key idea is to use predictors from other views when few predictors are lost


due to packet drops. If the block to be reconstructed is visible in the other view, its predictor

is used. In this method, the cameras do not need to know about the geometry/positions of

any other image sensors in the scene.

Liu et al. [16] proposed a surveillance specific Wyner-Ziv encoder in which intra frames

used in traditional Wyner-Ziv coding were replaced by backward predictively coded frames

(BP frames). However, as observed by Girod et al. in [17], distributed video coding al-

gorithms continue to lag behind conventional video coding schemes in rate-distortion per-

formance. [16] also requires a backward channel which will prevent adoption in cameras

which store the video in a local memory device. Video compression standards committees

and the surveillance industry also have not adopted these techniques. Hence, we do not

apply distributed coding techniques in this thesis.

2.4.7 Wireless and/or Remote surveillance specific techniques

Wireless commercial systems for homes that run on batteries have become very popu-

lar [116, 117]. They are triggered by motion and offer a complete remote video monitoring

solution. Wireless and remote surveillance systems operate under severe energy and band-

width resource constraints. The communication channel is also error prone. Some extreme

examples of such remote deployments have been described in [118, 119]. Lijuan and Qiang

propose a video surveillance system to monitor a large scale wind farm [118]. Carl Hartung

et al. [119] use a web camera and satellite communication to monitor weather conditions in

rugged wildland fire environments. Yun Ye et al. [120] provide a detailed survey of wireless

surveillance researches.

Such systems can be classified as (I) Event driven and (II) Continuous transmission.

Event detection based systems using passive infrared (PIR) sensors / low level image pro-

cessing to detect motion. When an event is detected, the video encoder is woken up to

encode the video. Lee et al. [21] have used background subtraction as an event detector

for surveillance. A scheduler for the encoder configurations is also proposed to determine

the optimal settings based on the estimate of future events and remaining battery charge.

In [121], Jongpil Jung et al. do not turn off the image sensor. Instead they continuously


capture images, encode them using a JPEG encoder and store them in DRAM. The sys-

tem is designed to store up to 10s of surveillance video. When an event is detected, the

stored sequence of JPEG frames are transcoded to H.264 and transmitted over the wireless

channel.

Optimal allocation of system resources i.e. energy and communication bandwidth is crit-

ical in continuous transmission wireless surveillance systems. Hence, multiple researches

have studied cross-layer control (radio, encoder, imager, PTZ actuator) of such systems

[122, 70, 123, 124, 125, 126]. In [70], Dalei Wu et al. jointly optimize the video coding

quality, transmission bandwidth and camera control for a resource constrained pan-tilt wire-

less surveillance system. Zhihai He et al. [125] develop an analytic Power-Rate-Distortion

(P-R-D) model of the video encoder. The model is used to study the optimum power al-

location between video encoding and wireless transmission. In [126], Malisa Marijan et

al. determine the optimal power allocation among the image sensor, compression, and

transmission modules. A sigma-delta image sensor that allows easy control of P-R-D perfor-

mance of the imager is used. The distortion of the video is minimized under power budget

constraints.

In this thesis, we propose a continuous transmission surveillance system which is more

suited for cities and towns. The proposed skip decision method marks background regions

as skip. The bit cost of marking MB’s as skip is very small and hence the proposed technique

provides bitrate reduction of up to 94.5% (over techniques proposed in literature). Further,

we also propose a face ROI video encoding technique that provides up to 50.2% bitrate

reduction in pedestrian surveillance videos.

2.5 Summary

We have described different video compression techniques used for surveillance. The H.264

standard, which we adopt to implement our proposed techniques, has been described. In

particular, reference frame, skip mode and QP related aspects which are relevant to this

thesis have been discussed in detail. Also, we have categorized and presented the various


techniques proposed in literature. Skip detection, reference frame selection and ROI video

coding techniques have been discussed in more detail since they are more relevant to the

thesis. We have also briefly described some of the mode decision, motion estimation, dis-

tributed coding, wireless surveillance and hardware related schemes in literature that are

specific to video surveillance. We now provide a brief summary of advancements that the

proposed techniques achieve over existing systems.

The proposed speeded up GMM algorithm uses windowed weight updates to reduce

floating point complexity. It provides speedup of 1.33 - 1.44 (over Zivkovic [3]) on standard

video datasets without affecting segmentation accuracy. The speeded up GMM algorithm

is combined with a low computational complexity, sampler based skip detection scheme

to accurately determine skip MB’s. The skip detector combines stratification and adaptive

sampling techniques to achieve up to 94.5% bit rate reduction and 74.5% computational

complexity reduction. We also propose a very low complexity reference frame selection

technique for H.264 video surveillance encoding. Results show that the proposed reference

frame selection method reduces bit rate by up to 24.7% and execution time by up to 7.3%

(compared to the 1× algorithm [57]). Finally, we have proposed a face ROI encoding tech-

nique for pedestrian video surveillance. We combine multiple, low & mid level visual cues

using Bilattice logic to accurately determine face ROI’s. Face image regions are encoded in

high quality. Non face face regions are encode in lower quality and the shadow regions are

marked as skip. The proposed technique has been integrated into the x264 video encoder.

Experiments show bitrate savings of up to 50.2%.

Chapter 3

Speeded up GMM Algorithm for

Background Subtraction

3.1 Introduction

Background subtraction is often the first step in static camera video surveillance applica-

tions. It reduces the computation required by the downstream stages of the surveillance

pipeline which usually comprises of video coding, object detection and tracking. Conse-

quently, it constitutes the most active/resource demanding stage of the surveillance pipeline

since it processes each incoming pixel in the video stream. In Chapter 4, we propose a skip

selection scheme for a H.264 surveillance video encoder. Here, MB’s which contain only

background image content are marked as Skip. This reduces the bitrate of the encoded

video stream. However, accurate detection of foreground regions is essential to prevent

Skip-coding of the objects in the scene.

Shaking trees, foliage, sunlight intensity changes due to active cloud motion, rain have

been the main sources for reduced accuracy of simple background subtraction algorithms.

The Gaussian mixture model(GMM) scheme proposed by Stauffer and Grimson [27] has

been one of the most successful techniques that works well in such uncontrolled outdoor

environments. However the original GMM algorithm [27] suffered from slow learning

rates during the initial phase. KaewTraKulPong and Bowden [127] corrected this using a

two stage learning scheme where the GMM is updated initially using the sufficient statistics

based equations and is later switched to a ‘L-recent window’ version. Lee [2] further

improved upon this by using a modified schedule that gradually switches between the two

update modes.

43

Chapter 3. Speeded up GMM Algorithm for Background Subtraction 44

The good accuracy of the GMM approach comes at the cost of significantly high compu-

tation and memory bandwidth requirements. Benezeth [30] found that the GMM algorithm

with 3 modes is about 3.7 times slower compared to the single Gaussian scheme. Zivkovic

[3] described a significant improvement to reduce the computation time and memory band-

width. He formulated a Bayesian approach to select the required number of Gaussian modes

for each pixel in the scene. In scenes with static background (traffic sequence in their pa-

per), this approach assigns a single mode Gaussian to model most of the pixels which helps

to reduce average processing time by 32%. However in the outdoor video (trees sequence),

results show only a 2% improvement since a significantly large portion of the scene requires

a multi-modal model.

Although real time performance of the adaptive GMM scheme has been demonstrated

on native PC’s, an increasing demand to move the analytics onto the camera itself requires

embedded platforms with low compute resources to support the algorithms. In this chapter,

we propose an orthogonal approach that provides computation time reduction by mini-

mizing floating point computations. We also combine the fast learning of [2] with the

automatic selection of number of modes in [3] to obtain a highly efficient and accurate

scheme.

In the next section, we review the modification proposed by D.S.Lee followed by a

very brief description of the improved AGMM algorithm proposed by Zivkovic. We refer

to [2] & [3] for a detailed discussion of the algorithms. In section 3.3, we present our

proposed improvisation to the GMM algorithm that significantly reduces the computation

time. Detailed experimental results of the proposed algorithm are discussed in Section 3.4.

3.2 Gaussian mixture model

3.2.1 Adaptive Mixture Learning with fast convergence

Each pixel in a frame is modelled using a Gaussian mixture model (GMM). The parame-

ters of the Gaussian mixture model (usually with 3 modes) are estimated using an online


version of the EM algorithm. To prevent the foreground pixels from corrupting the back-

ground model, Stauffer et al. [27] proposed to use modes with low weights to model the

foreground. The description of the algorithm is given below:

The modes of the GMM are arranged in decreasing order of their weights. A predefined

fraction of the weights is used to determine the modes that model the background. This

favors modes with higher weight to be selected as the background. A match of an incoming

pixel to any of the modes is defined to occur if the Mahalanobis distance from the pixel is

less than a predefined threshold Tσ. If the match occurs on one of the background modes,

the pixel is labelled as background, else it is classified as foreground. The following update

equations are applied for the parameters of the Gaussian mode ‘k’ with the highest weight

that matched the incoming pixel x(t):

wk(t) = (1− α)wk(t− 1) + α (3.1)

µk(t) = (1− ηk)µk(t− 1) + ηkx(t) (3.2)

σ2k(t) = (1− ηk)σ

2k(t− 1) + ηk(x(t)− µk(t− 1))2 (3.3)

where ηk is the adaptive learning rate given by:

ηk =1− α

ck+ α (3.4)

ck is a counter which is maintained independently for each mode. Its value is initialized

to 1 for a new mode and is incremented whenever a match with an incoming pixel occurs.

ηk is a parameter that controls the learning rate of the modes.

As can be observed from Eq. (3.4), the learning rate is initially set to match the sufficient

statistics based update. As time progresses, it converges to a L-recent window based update

mode with a fixed learning rate of α. The weight for the remaining modes is updated using

Eq. 3.5.


wk(t) = (1− α)wk(t− 1) (3.5)

If none of the modes match, then the mode with the least weight is replaced with a new

mode having low initial weight and large variance.

3.2.2 Automatic selection of number of components

In the GMM algorithm described above, the weights of the Gaussian mixture represent

the fraction of the data samples ‘x(t)’ that belongs to the particular mode in the model.

Defining nm to represent the number of samples that belong to the mth mode, the weights

of the GMM can be considered to define a multinomial distribution for the nm’s. Instead

of using the ML estimate that results in the original GMM update equation, Zivkovic used

a Dirichlet prior with negative coefficients. This is done with an intention of accepting a

class only if there is enough evidence from the data samples for the existence of the class.

Solving for the MAP(Maximum a posteriori posterior) estimate, the final adaptive update

Eqs. (3.1) & (3.5) are modified as follows:

wk(t) = (1− α)wk(t− 1) + α− αcT (3.6)

wk(t) = (1− α)wk(t− 1)− αcT (3.7)

cT is a parameter that represents the minimum fraction of samples required to support

the existence of a mode (set to be equal to 0.01 in [3]). We need to normalize the weights

after each update so that they add up to one. The modes whose weights become negative

are discarded. New modes are initialized with mean set to be equal to the pixel values that

didn’t match any of the existing modes. The variance is initialized to a large value. The

mean and the variance updates are similar to Eqs. (3.2) & (3.3) with ηk defined to be equal

to α/wk(t) instead of (3.4). This division by the weight significantly improves the learning


rate compared to [27], however as Lee mentions in [2], ηk is unbounded and hence might

lead to divergence.

3.3 Proposed Algorithm

From the description provided in sections 3.2.1 & 3.2.2 we list the main steps involved in

the GMM algorithm as: (A) Sort the Gaussian modes (B) Match the pixel to the modes and

(C) Update the parameters of the modes.

We also note 4 observations in [2] and [3] that suggest our modification:

1. The weights of the Gaussian modes change slowly with time constant of roughly ≈

1/α which is typically of the order of a few hundred frames

2. Set of Background modes also doesn’t change rapidly since the weights change slowly

3. The mean and variance update Eqs. (3.2) & (3.3) are independent of the weight

values

4. A newly formed mode takes a minimum of a few tens of cycles to be removed. For

cT = 0.01 in [3], it takes about the order of 1/cT frames (100 frames) for a newly

formed mode to be removed in the case where none of the pixels match that mode.

Based on the above observations, we propose to update the weights only once in Tw

frames where Tw is a constant set to be equal to 16. The details for the choice of Tw is

discussed in the results section. We refer to Tw as the ‘weight update interval’. The set of

modes that belong to the background are also determined only once in Tw frames. New

modes are allowed to be created for all the frames. However, we determine mode deletions

only once in Tw frames. We refer to the cycle when we perform the true weight update as

the ‘fine update cycle’. To ensure that learning is unaffected, we need to perform accurate

weight updates based on the values of the pixels in the past Tw frames. We enable this

by using a low resolution (4 bit) integer counter to count the number of matches to a

mode that occurs in the Tw frames. The counter values are used to perform an accurate


update during the ‘fine update cycle’ using modified equations derived below. Since we

now update the floating point weight values and determine the set of background modes

only during the ‘fine update cycle’ (once in Tw frames), the computational complexity is

reduced. Expensive floating point computations during the remaining Tw − 1 cycles are

replaced by simple integer increment operations. The derivation for the weight update is

provided now:

The Gaussian mixture distribution of the pixel x (we have dropped the time index t

here only for the sake of clarity) formulated in terms of discrete latent variables z is shown

in Eqn. 3.8 [128]. Here z is a K dimensional binary random variable having a 1-of-K

representation (z = [z1, z2, ....zK ]T ). zk = 1 indicates that the pixel x was generated from

the kth mode of the mixture model.

p (x) =∑

z

p(z)p(x|z) =K∑

k=1

wkN(

x|µk, σ2k

)

(3.8)

From the EM algorithm, the weight update at time instant t is given by Eqn. 3.9 [128].

γ (zk (i)) is the posterior probability of zk (i) = 1 (i is the time index) once we have observed

the incoming pixel x(i). γ (zk (i)) can also be interpreted intuitively as the responsibility that

the mode k takes to explain away the pixel data at time instant i.

wk(t) =

∑ti=1 γ (zk (i))

t(3.9)

=

∑t−Tw

i=1 γ (zk (i)) +∑t

i=t−Tw+1 γ (zk (i))

t(3.10)

Stauffer et al. [27] proposed to set γ (zk (i)) to 1 for the mode k which matched the

incoming pixel. The responsibility for other modes is set to 0. Since Nk is the number of

times the incoming pixel matched the mode k in the time interval Tw, Eqn. 3.10 can be

rewritten as shown in Eqn. 3.11.


wk(t) ≈(t− Tw)wk(t− Tw) +Nk

t(3.11)

≈

(

1−Tw

t

)

wk(t− Tw) +Nk

t(3.12)

(1/t) is set equal to α as proposed in [27]. We also add the weight decay term from [3]

to obtain the final weight update in Eqn. 3.13:

wk(t) = (1− Twα)wk(t− Tw) +Nkα− TwαcT (3.13)

In Fig. 3.1, the weight update is plotted using the original GMM update equation and

the proposed method for the case where the weights are monotonically increasing and

decreasing with Tw = 16. The modified weight update is also shown applied to a realistic

scenario in Fig. 3.2. Here data points generated from a synthetic distribution with mass

function = [0.7, 0.25, 0.05] are used to update the weights using both the original GMM

Eqs. (3.6) & (3.7) and the proposed update Eq. (3.13). The initial weight is set arbitrarily

to [0.4, 0.4 ,0.2]. We observe that the learning rate and weight values obtained using

the proposed technique matches well with those obtained from the original GMM update

equations.

The complete pseudo code is described in Algorithm 1. Here maxModeF lag is used to

indicate that the mode with least weight was replaced during the Tw window. This is used

during the fine update cycle to reset the true weights of that mode. BG is a set that contains

the list of modes that belong to the Background model. This set is updated during the ‘fine

update cycle’. The integer weight counters are represented by weightCounti’s. The ‘fine

update cycle’ is staggered across pixels and in time such that only npixels/Tw receive the

fine update during each cycle (where npixels is the number of pixels in the frame). This

ensures that the processing time is uniform across all the frames.

An alternate derivation using a heuristic is provided in Appendix A.


Algorithm 1: Proposed scheme for a single pixel x

Init: BG = { }, numModes = 1, Reset maxModeF lag, ∀i ∈ {1..maxNumModes}µi =∞, σi = σinit, wi = αData: input pixel x(t)while New Data x(t) do

for i ∈ {1....numModes} do

if x(t) matches mode i then

if i ∈ BG then

x(t) is Background

else

x(t) is Foreground

ci ←− ci + 1 (refers to ci from D.S.Lee)

Update µi, σi using Eqs. (3.2), (3.3) & (3.4)

weightCounti ←− weightCounti + 1

if ∀i ∈ {1..numModes}, x(t) doesn’t match mode i then

x(t) is Foreground

if numModes < maxNumModes then

numModes←− numModes + 1

Initialize new mode j;

else

Replace mode j where j = arg mini{wi}

cj ←− 1, weightCountj ←− 1

µj ←− x(t), σj ←− σinitif numModes = maxNumModes then

Set maxModeF lag

Once in Tw frames:

if t is a multiple of Tw then

if maxModeF lag is Set then

wi ←− α where i = arg mini{wi}Reset maxModeF lag

for i ∈ {1....numModes} do

Update wi using Eq. (3.13)

if wi < 0 then

delete mode

numModes←− numModes - 1

Normalize wArrange modes in decreasing order of wi’s

Determine Set of Background Modes, BG

BG = {1....nBG} where nBG = arg minb{∑b

k=1wk > TBG}


100 150 200 250 300

0.3

0.4

0.5

0.6

0.7

GMMproposed

(a)

100 150 200 250 300

0.3

0.4

0.5

0.6

0.7

GMMproposed

(b)

Figure 3.1: Weight update using proposed update for a monotonically (a) increasing and

(b) decreasing case. The weights are plotted on the y axis with respect to time

0 1000 2000 3000 40000

0.2

0.4

0.6

0.8

1

w1w2w3

(a)

0 1000 2000 3000 40000

0.2

0.4

0.6

0.8

1

w1w2w3

(b)

Figure 3.2: Weight update using (a) original GMM update equations and (b) proposed

weight update for a GMM with weights=[0.7, 0.25, 0.05] and Tw = 16. The weights are

plotted on the y axis with respect to time. Please note that the graph shows that proposed

technique does not affect the learning rate. The increase in frame rate provided by the

proposed technique is illustrated in Fig. 3.4


3.4 Experimental results

We initially describe experiments done to determine the optimum weight update inter-

val Tw. Later, we discuss results of experiments performed to obtain KL divergence on 1-

dimensional synthetic datasets showing good learning performance of the proposed scheme.

Next, we present a quantitative evaluation of the proposed algorithm on a set of 10 standard

videos (5 outdoor and 5 indoor): video4, 6 & 7 from [9], fountain, hall, lobby, shopping-

Mall, bootstrap & campus from [8] and HighwayI from [129]. Dataset [8] provided 20

frames of manually segmented foreground masks from each video set. VSSN 2006 provides

foreground truth for all the frames. Ground truth was generated manually for 10 randomly

chosen frames in the HighwayI sequence. Precision-Recall curves for the proposed algo-

rithm is compared with those obtained using [2] and [3].

We measure the frame rate on a Core i5 processor running at 2.53Ghz with 4GB of

system memory. All the programs are single threaded and have been compiled in Release

mode using Microsoft Visual C++. The following parameter values were found to work

well on all the videos: α = 0.004 & TBG = 0.8. The maxNummodes was set to 3 (two for

the background and one for the foreground).

3.4.1 Weight update interval Experiment

Fig. 3.3 shows the measured frame rate plotted as a function of Tw for the VSSN06 dataset

videos using the modified weight update technique. We observe that the speedup saturates

for Tw in the range of ≈ 12 - 18. The maximum error % of the proposed method is also

plotted where the error is defined as the maximum difference between the weights obtained

using ‘per cycle update’ Eqs. (3.6) & (3.7) and the weight obtained using the proposed

update Eq. 3.13 at the end of Tw frames. All possible combinations of matches during the

Tw frames are considered and the maximum deviation is plotted. The initial weight at the

beginning of the Tw frames is set to a realistic value winit = [0.7, 0.25, 0.05]. We can

observe that the error increases linearly as Tw is increased. Since speedup saturates for

Tw in the range of ≈ 12 - 18, we choose Tw to be 16. Choosing a higher Tw increases the


0

0.5

1

1.5

2

2.5

100

120

140

160

180

200

1 2 4 6 8 10 12 14 16 18

Err

or %

Avg

. Fra

me

rate

(fp

s)

Weight update interval (Tw)

Average Frame Rate Maximum Error %

Figure 3.3: Frame rate (fps) and error % are plotted with respect to the weight update

interval Tw

error without any benefit. On a dedicated hardware system, a weight update interval of 16

results in compact 4 bit counters for the coarse weight updates. We find that the small error

in weight doesn’t have any impact on the accuracy in real dataset videos. Detailed accuracy

data is described below for the chosen weight update interval of 16.

3.4.2 Adaptive Mixture Learning Experiment

The accuracy of the proposed method is first validated on one dimensional synthetic data.

Fig. 3.4a shows a typical pixel intensity distribution (plotted against frame count) observed

in surveillance videos. Here, a pixel which was initially unimodal (e.g. pixel belongs to the

‘sky’) changes to a multimodal process (wind causes tree leaves to vacillate on a static ‘sky’

background). Fig. 3.4b shows the KL divergence achieved by the original update equations

in [2] and by the proposed method during the phase where the model is learning the pa-

rameters for the new mode. We find the learning achieved by the proposed method to be

very similar to that obtained using the original update equations. The divergence has been


500 1000 1500 20000

50

100

150

200

250

(a)

800 1000 1200 1400 1600 1800 20000

0.01

0.02

0.03

0.04

0.05

D.S.Leeproposed

(b)

Figure 3.4: (a) Synthetic distribution based on commonly observed surveillance videos (b)

KL divergence achieved by the proposed and the original method [2]

computed using Monte Carlo sampling averaged over 5 datasets. Similar experiments per-

formed on slowly varying illumination models showed that the accuracies of the proposed

method matched well with the original scheme in [2].

3.4.3 Background subtraction experiment

The precision-recall curves for the 10 videos listed in section 5.2 have been determined. We

observed that the accuracy of the proposed algorithm matches the accuracy of the GMM

formulations of Lee [2] and Zivkovic [3] in all the videos. The average precision is plotted

against the recall rate in Fig. 3.5 showing no degradation of accuracy with the proposed

scheme. Since a false negative or a ‘miss’ is undesirable in surveillance applications, we

only show recall rates varied from 0.65 to 0.95. However, we verified that the proposed

scheme doesn’t impact accuracy at lower recall rates as well.

Fig. 3.4 shows the frame rate of the different methods averaged over 5 trials with

Tσ = 3. The average frame rate computed as the reciprocal of the average computation

time is listed in Table 3.1. We find that the proposed scheme provides significant speedup

for the case where there are multiple modes required to model a significant fraction of the

scene. In Figs. 3.5a, 3.5b & 3.4d, we observe high speedups since frequent foreground


Table 3.1: Average frame rate using proposed scheme, [2] & [3]. Average speedup ob-

tained using proposed scheme measured over [3] (Zivkovic)

Average frame rates (fps)

Dataset D.S.Lee Zivkovic ProposedAvg.

Speedup

HighwayI 116 145 211 1.44

Campus 287 351 432 1.23

Hall 265 326 401 1.22

Lobby 307 375 440 1.17

Mall 115 153 204 1.33

Fountain 342 423 457 1.08

Bootstrap 291 333 458 1.37

video4 119 149 182 1.22

video6 116 143 185 1.28

video7 137 174 191 1.09

motion results in continuous creation of new modes. Similarly, large background motion in

the ‘campus’ sequence causes a significant fraction of pixels to require a multimodal model.

Hence, the proposed method provides speedup in this sequence as well. In Fig. 3.5c, we

notice that the initial speedup is low. This is due to the relatively static scene during the

initial phase of video4. Since a significant fraction of the scene requires only a single mode,

Zivkovic’s scheme itself provides high speedup dominating the proposed method. However,

we observe that the speedup of the proposed method over Zivkovic’s scheme improves

beyond frame 400 since the number of multimodal pixels increase (due to shaking leaves

and appearance of foreground objects).

The proposed technique also provides a minimum memory bandwidth reduction of 16%

for a pixel which has more than 1 mode. This is because the floating point weight variables

are fetched from the memory only once in 16 frames. Extra memory required (worst case)

to store weight counters, background set and a flag is equal to 11bits/pixel (0.42MB for a

VGA resolution input).


0

50

100

150

200

250

16 80 144 208 272 336 400

D.S.Lee Proposed Zivkovic

(a) HighwayI

0

50

100

150

200

250

16 176 336 496 656 816 976 1136


(b) Shopping Mall

0

50

100

150

200

250

16 144 272 400 528 656 784


(c) video4


0

100

200

300

400

500

600

16 496 976 1456 1936 2416 2896


(d) bootstrap

Figure 3.4: Instantaneous frame rates plotted against frame count using proposed scheme,

[2] & [3]

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.65 0.7 0.75 0.8 0.85 0.9 0.95

Pre

cisi

on

Recall

Proposed Zivkovic D.S.Lee

Figure 3.5: Average precision-recall curves obtained using proposed scheme, [2] & [3] for

the 10 dataset videos

Ch

ap

ter

3.

Sp

eed

ed

up

GM

MA

lgorith

mfo

rB

ack

gro

un

dSu

btra

ction

58

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.6: Detection results on the Hall [8] and video7 [9] dataset videos. (a) & (e) are original images from the Hall and

video7 datasets respectively. (b) & (f) are the corresponding Ground truth images. (c) & (g) are the segmentation masks

obtained using Lee [2]. (d) & (h) are the segmentation masks obtained using the proposed scheme


3.5 Summary

The computational complexity of modelling the background pixels using adaptive GMM

proposed in [3] can be significantly reduced for highly active pixels by our proposed scheme

of windowed weight updates. This method reduces processing time without affecting seg-

mentation accuracy. Experimental results shows a speedup of up to 44% in scenes where

a large fraction of the pixels require multimodal Gaussian models. The proposed modifica-

tions are also quite suitable for a hardware implementation. In the next chapter, we will

adopt the speeded up GMM algorithm to perform skip selection in the H.264 surveillance

video encoder.

Chapter 4

Skip decision & Reference Frame

Selection for H.264 Surveillance

Coding

4.1 Introduction

A substantial fraction of the Macroblocks (MB’s) in a static camera surveillance video stream

usually do not contain any objects of interest. Coding such MB’s using the Skip mode pro-

vided by the H.264/AVC standard provides a significant reduction in coding cost. Hence, it

is essential to accurately classify the MB’s into 2 sets: (1) MB’s which contain foreground

objects of interest (FG MB’s) and (2) MB’s which do not contain objects of interest (back-

ground MB’s or BG MB’s). Objects of interest in a surveillance scene typically are in a state

of motion (e.g. humans, cars, handbags). This naturally motivates the adoption of a mo-

tion detection algorithm to perform skip decision. However, the motion detection algorithm

needs to be computationally simple to operate on low power embedded camera platforms.

The method also needs to be accurate since skipping regions of interest will drastically

impact the utility of the encoded video stream. In this chapter, we describe a spatial sam-

pler based skip detection algorithm. Sampled pixels in the frame are segmented using a

speeded up GMM (Gaussian Mixture Model) algorithm that we proposed in the previous

chapter. The data structures of the GMM are rearranged to improve cache performance.

The remainder of the chapter is organized as follows: A high level overview of the pro-

posed sampler based skip decision technique is provided in Section 4.2. Different sampling

60

Chapter 4. Skip decision & Ref. Frame Sel. for H.264 Surveillance Coding 61

techniques used in practice are reviewed in Section 4.3. The proposed low cost skip de-

tection method is described in Section 4.4. MB classification & the H.264 skip signalling

procedure are provided in Sections 4.6 & 4.7 respectively. Experimental results of the pro-

posed technique are reported in Chapter 5.

4.2 Proposed Architecture

GMM based Motion

Detection

Background model

M = {B(n)}Reference frame selection & signalling

H.264/AVC encoderVideo stream from camera

Transmit

Skip signalling

{Si}

Ref. Pic. Buffer RPB = {R(n)}

Bcurrent

nR

{B(n)}

H.264/AVC Decoder

Ref. Pic. List 0

Ordering specified by

List 0

Decoded Pic. BufferDPB = {R(n)}

Ref. Pic. List 0

Spatial Sampler

GMM parameters

Skip Detection

Figure 4.1: Proposed surveillance specific video coding architecture

A block diagram of the proposed surveillance processing pipeline is shown in Fig. 4.1. A

spatial sampler is used to select pixels in the camera-captured image. Pixel level background

subtraction is performed at the sampled locations using the GMM algorithm. Motion detec-

tion is performed based on the segmented output from the GMM algorithm. Macroblocks

which do not have foreground pixels are considered as ‘Regions of no interest’ (RONI) and

are marked as BG MB’s. Accurate determination of blocks containing objects of interest

in the scene is essential to reduce bandwidth without distorting the foreground image.

However, the computational cost and power required to execute foreground segmentation

algorithms should be low enough to be feasible on embedded platforms. The proposed


method achieves this using a combination of 4 techniques: (A) Pixels in the frame are seg-

mented using a 2-Step adaptive sampling process (B) Spatio-Temporal priors are used to

bias the decision of marking a MB as BG or FG (C) The speeded up GMM based pixel level

background subtraction algorithm that we proposed in [130] is utilized to further reduce

the computational cost of motion detection (D) Data structures of the GMM are rearranged

to improve cache performance. A detailed description of the proposed Skip detection tech-

nique is provided in Section 4.4.

Let Bcurrent denote the set of indices of macroblock’s that are marked as RONI (i.e. BG

MB’s) in the current frame. Si = {Mode,Ref,MV,Residual} is a set consisting of coding

decisions for the ith MB in the current frame. Here Mode ∈ {P16× 16, SKIP} and MV

refers to the motion vector used to perform motion compensation utilizing the reference

frame indexed by Ref . The Residual is added to the motion compensated image to obtain

the reconstructed MB. Let N be the maximum number of reference frames allowed in the

reference picture buffer and RPB = {R(n) : 0 ≤ n < N} denote the ordered list of reference

frames R(n) present in the buffer. The ordering of R(n) in RPB is defined by the reference

picture ‘List 0’, i.e. R(0) denotes the frame in the buffer referenced by the top of ‘List 0’. The

set of indices of background MB’s in the nth reference frame of ‘List 0’ is denoted by B(n).

The ordered list M = {B(n) : 0 ≤ n < N} contains information of all the background MB’s

present in the reference buffer and is called the background model.

The proposed scheme takes the set of BG MB’s (determined using ‘Motion detection’) as

input and attempts to recreate the uncovered background regions using predicted data from

the frames in the Reference Picture Buffer. Since small residual errors in the background

regions do not impact surveillance systems, residual coding for BG MB’s is skipped. If the

reference frames in the Reference Picture Buffer contain the BG MB’s uncovered in the

current frame, the MB’s can be transmitted without residual coding and they incur a low

coding cost. Hence choosing the right pictures as reference is important to reduce bit rate.

An adaptive reference frame selection technique is proposed in Sec. 4.8 to optimally mark

encoded frames as Reference.


4.3 Sampling techniques

In our proposed technique, a sampler is used to determine foreground MB’s. Sampling

theory has been studied extensively in literature [131, 132, 133]. Different sampling tech-

niques have been used for various practical applications [133]. A few representative exam-

ples are (I) Environmental data collection which is further used to determine contamination

risks and (II) Estimation of oil reserves for which carefully chosen sample holes are drilled.

Increasing the accuracy of the estimate and reducing the cost to determine the estimate are

the main challenges of such sampling operations. For example, in the case of environmental

sampling, accurate contamination data is very critical. In the case of estimating oil reserves,

drilling holes is very expensive and hence optimal sampling is very important.

4.3.1 Basic sampling techniques

Basic sampling techniques that are commonly used are shown in Fig. 4.2. We now briefly

describe them here:

• Simple random sampling: As described by Steven K. Thompson in [132], distinct

units are selected from the population such that all possible combinations of the sam-

pled units are equally likely. This is effective when the population is homogeneous.

However, simple random sampling can be expensive than other designs if the cost of

obtaining the sample is high in the randomly chosen locations (e.g. in the case of

estimating oil reserves).

• Cluster sampling: The population is split into primary and secondary units. Each

primary unit consists of one or more secondary units which are clustered in the space

or time. Whenever a primary unit has been chosen as a sample, all its secondary units

are also included in the collection.

• Systematic sampling: Samples are chosen at regularly spaced intervals (in space

or time). This sampling pattern provides the largest coverage in a region for a fixed

number of units. Although systematic and cluster samplers appear to be very different,


they share an underlying design principle [132, 131]. The selection of the systematic

sampler can be considered to be the selection of a primary unit that constitutes the

whole sample. For example, we can divide the input 2D image data into four large

sets (or primary units) of pixels where each set consists of pixels sampled on a grid

which is offset from the other primary units. Sampling using the systematic sampler

involves the selection of one of these primary units. Systematic sampling has been

shown to be very effective in natural populations [131]. However, the accuracy of

this technique can reduce due to periodicity in the population.

We now consider the variance of the unbiased estimator for the population-total ob-

tained using a systematic sampler. The variance is related to sampled data parameters

as shown in Eqn. 4.1.

var(τ) ∝ σ2[

1 + (M − 1)ρ]

(4.1)

Here ρ is the within-primary-unit correlation coefficient. M is the average number of

secondary units in a primary unit. σ2 is the variance of the data (Please refer to [132]

for more details).

Eqn. 4.1 shows that in the case of estimation, it is optimal to sample such that the

within-primary-unit correlation coefficient is low. In natural populations, similar data

characteristics are found in samples which are clustered in space and/or time. Sys-

tematic sampling which spreads the secondary sample points apart causes the within-

primary-unit correlation coefficient to be low. Hence, the systematic sampler has been

found to be very effective in real world applications.

• Stratified sampling: Here, the population is partitioned into sets which are called

strata. The strata are chosen based on existing prior information about the process

or from domain experience (e.g. ecological surveys stratify based on soil type and

vegetation). The sampler design is chosen differently for each strata.


4.3.2 Adaptive sampling

Here, sampling pattern decisions are made based on the data that has been analyzed thus

far. Such designs can utilize the sampled dataset characteristics to refine future patterns.

They can be designed to very efficiently determine rare elements in the data. For example,

in pollutant estimation, the region is initially sampled sparsely. Units which are considered

interesting (i.e. pollution levels are greater than a threshold) will result in inclusion of

neighbouring points in the sample set. Natural populations are typically aggregated and

hence, such adaptive sampler designs significantly improve performance.

Different adaptive designs include (I) Adaptive cluster sampling (II) Systematic and

strip adaptive cluster sampling & (III) Stratified adaptive cluster sampling. We propose to

utilize a modified version of the stratified-adaptive-cluster sampler to perform skip detec-

tion. Hence, we describe only this technique here. The reader is referred to [132] for a

comprehensive description of other sampler designs.

• Stratified Adaptive Cluster Sampling

Stratified adaptive cluster sampling combines the ideas of adaptive and stratified

sampling [132]. Adaptive sampling utilizes sampled-unit characteristics to efficiently

choose future data points. In comparison, stratified sampling uses prior information

or domain knowledge to decide sampler patterns. Hence by combining both these

techniques, stratified adaptive cluster sampling improves the performance of the sys-

tem.

An example of this sampling technique is shown in Fig. 4.3. The population is first

sampled using the stratified design technique in Fig. 4.3a. When sampled units which

are of interest are found, additional samples from their neighbourhood are chosen.

This is shown in Fig. 4.3b where the added sample points are marked with a cross. In

the next section, we will describe the proposed skip detection algorithm based on the

Stratified adaptive cluster sampling technique.


(a) Random (b) Cluster

(c) Systematic (d) Stratified

Figure 4.2: Basic sampling techniques (a) Random (b) Cluster (c) Stratified (d) Systematic


(a) (b)

Figure 4.3: Stratified Adaptive Cluster Sampling (a) First stage (d) Second stage


4.4 Sampler based Background MB detection

The duration in which foreground objects appear in a surveillance scene is typically a small

fraction of the total time. As a consequence, the energy consumption of the system is

strongly related to the energy required to detect foreground objects in the video sequence.

We propose to combine stratification and adaptive sampling techniques described in the

previous section to efficiently detect the background MB’s.

The input image is initially sampled sparsely. The sampled pixels are classified as ei-

ther background/foreground using the GMM algorithm. The regions surrounding the fore-

ground pixels are considered to be salient. These salient regions are further sampled using

a dense sampler. The sampled pixels are segmented to verify the presence of foreground

objects. MB’s which do not contain foreground pixels are included in BCurrent (the set of in-

dices of MB’s which contain only background objects). We abbreviate the proposed sampler

based motion detection scheme used to determine BCurrent as ‘GMM S-MD’. A flowchart of

the proposed method is shown in Fig. 4.4. A more detailed explanation of the proposed

GMM S-MD algorithm is provided below.

The motion detection process can be considered to be a cascade of two stages: (1)

Salient MB detection and (2) Background MB detection.

(1) Salient MB detection: Fig. 4.5 shows a macroblock (marked on a foreground object)

and the expanded view of the pixel grid. A set of sparsely located pixels A1 is obtained by

uniformly sampling the input image with inter pixel spacing set to Dsparse. The pixel posi-

tions are offset by a distance equal to Ddense to obtain multiple sparse pixel sets A2, A3, ...

Fig. 4.5 shows one such sampling pattern with 4 sets of pixels A1, A2, A3 and A4 inter-

spersed on a regular grid. In the first stage, the image is sparsely sampled by selecting only

one set of pixels i.e. A1, A2, A3 or A4. The selection is performed in a sequential fashion i.e.

pixels belonging to set A1 are sampled at frame n, pixels belonging to set A2 are sampled at

frame n+1 and so on. Background subtraction is performed on the sparsely sampled set of

pixels Ai using the GMM algorithm that we proposed in [130]. Chang et al. [134] showed

that sampling does not impact the learning performance of the GMM model. We denote


Sparse Sampler + BGS

Fsparse

Morphological Dilation using a 3x3

elementF’sal

Dense Sampler +

BGSIFG

Erosion using a 2x1 element

� FsalImage BCurrent

Fprev = (Bprev)C

Fprev

1 frame delayBprev

Salient MB detection BG MB detection

Fsparse: Set of indices of MB’s which were detected as foreground by the sparse sampler

Fsal: Set of indices of MB’s which are candidates for dense sampling (before dilation)

F’sal: Set of indices of MB’s which are candidates for dense sampling (after dilation)

Fprev: Set of indices of MB’s which were marked as foreground in the previous frame

IFG: Output binary image obtained by segmenting densely sampled pixels

BCurrent: Set of indices of MB’s in current frame which contain only background content

Notations:

Figure 4.4: GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-MD’)

flow chart

the threshold applied by the GMM algorithm on the sparse pixels as Tsparse. If one or more

pixels in MBi are classified as foreground, then the index i is included in the set Fsparse,

i.e. Fsparse ← Fsparse ∪ {i}. Fsparse contains the set of indices of MB’s which are detected

as FG MB’s by the sparse sampler (Fsparse is cleared before each frame in the input video

sequence is processed). The presence of foreground objects in the MB during the previous

frame increases the probability of the MB to contain foreground pixels in the current frame.

Hence, the union of Fsparse and Fprev (set of indices of FG MB’s in the previous frame)

is computed to obtain Fsal (set of indices of MB’s which have a large probability of con-

taining foreground objects). Likewise, the presence of foreground pixels in a MB increases

the probability of its neighbors to contain foreground objects. Hence, the final set of MB

candidates to be considered as salient, i.e. F ,sal is constructed by including the indices of

neighboring MB’s. This is performed by applying a morphological ‘dilation’ operator using

a 3x3 element.

(2) Background MB detection: F ,sal contains the indices of the MB’s which are consid-

ered to be salient. Background subtraction is performed on the pixels of all the four sparse

Ch

ap

ter

4.

Skip

decisio

n&

Ref.

Fra

me

Sel.

for

H.2

64

Su

rveilla

nce

Cod

ing

70

A1 A2

A3 A4

A1

A3

A1 A2 A1

MB boundary

8x8 block

��

��

��

��

��

��

��

��

��

��

��

��

��

MB

Dsparse = 8

Ddense = 4

�1 �21 N1

w1 w2 w3

Cache line 1

�2 �22 N2Cache line 2

�3 �23 N3

�1 �21 N1

w1 w2 w3

�2 �22 N2

�3 �23 N3

�1 �21 N1

w1 w2 w3

�2 �22 N2

�3 �23 N3

�1 �21 N1

w1 w2 w3

�2 �22 N2

�3 �23 N3

Cache mappingPixels in a MB

Figure 4.5: Figure shows the sampling pattern of pixels in an image. The sampled pixels are partitioned into 4 sparse sets A1,

A2, A3, & A4. Also shown are the GMM data structures of pixels mapped onto different cache lines to improve cache locality.

The models of the dominant modes are arranged in a contiguous manner. Also, the data elements belonging to a single set

of pixels are present in a contiguous array.


sets (A1, A2, A3 and A4) in the salient MB’s. We denote the GMM threshold used by the

dense sampler as Tdense. The classified output is represented using a binary image IFG

where IFG (x, y) = 1 indicates that the pixel at location (x×Ddense, y ×Ddense) in the in-

put image is a foreground pixel. A morphological ‘erosion’ operator is applied on the image

IFG using a 2x1 element to filter out the noise. MBi is marked as a BG MB if all the sam-

pled pixels in the filtered output which belong to MBi are 0. Fig. 4.6 shows a set of salient

MB’s detected on a human in the scene. The figure also shows the pixels that are sampled

to detect the set of background MB’s.

Salient MB’s

Sampled pixels

Figure 4.6: Salient MB’s and Sampled pixel plot

4.4.1 GMM S-MD as a Stratified-Adaptive-Cluster sampler

Although the proposed sampler would appear to be different from the Stratified-Adaptive-

Cluster sampler, the two techniques indeed share a same set of underlying concepts. The

set of MB’s (in the current frame) that were marked as FG in the previous frame, and

their neighboring macroblocks constitute stratum 1. The remaining MB’s in the current

frame constitute stratum 2. Strata 1 & 2 have been sampled using a systematic sampler


with stride parameter set to Ddense & Dsparse respectively. Stratum 1 has been sampled

densely since the likelihood of MB’s in this image region is high. Stratum 2 image regions

are sparsely sampled to detect new objects appearing in the scene. Sparsely sampling the

stratum 2 image regions reduces the computational complexity of the detector. Sampled

units in Stratum 2 which are marked as foreground result in further sampling of pixels in

their neighboring MB’s. This adaptive sampling technique helps to reduce the ‘miss rate’ of

foreground detection. The final set of pixels marked as foreground by the dense sampler is

filtered to reduce the false alarm rate.

Combining stratification and adaptive sampling helps to reduce the computational cost

of the skip detector without sacrificing on accuracy. If we adopt only adaptive cluster sam-

pling, isolated FG MB’s (for example, FG MB’s of small objects) that are incorrectly marked

as BG by the sparse sampler would have been skip coded. However, since GMM S-MD uses

stratification based on the results of the previous frame, such isolated MB’s that were de-

tected in the previous frame would be considered as salient. Also, using only stratification

based on previous frame results does not cause all the MB’s of newly entered objects to be

detected. Even if a few MB’s on the object are not detected by the sparse sampler, adaptive

sampling will include these MB’s in stratum 1 (Densely sampled set). This is because FG

MB’s detected by the sparse sampler will cause their neighboring MB’s also to be marked as

salient. In Chapter 5, we will present experimental results which show the effectiveness of

combining stratification and adaptive cluster sampling techniques for skip detection.

Along with the standard Stratified-Adaptive-Cluster sampling techniques adopted, GMM

S-MD also incorporates features specific to skip detection, namely (I) Spatio-temporal priors

and (II) Cache performance optimization, which we will discuss here.

4.4.2 Spatio-temporal priors

The accuracy of the proposed multi stage sampler based BG MB detector depends upon

the sampling parameters of Strata 1 & 2 and the accuracy of the pixel level classifiers (i.e.

sparse and dense pixel classifiers). The precision and recall values of the pixel level classifier

depends upon the GMM thresholds. Reducing the GMM threshold, the false positive rate


increases and the recall rate improves. Similarly, increasing the sampling stride reduces

the recall rate. A detailed analysis of these relationships is provided in Appendix B. GMM

S-MD allows different thresholds for the sparse and dense samplers. As we will later show

in Chapter 5, to detect camouflaged small objects, we reduce the threshold of the dense

sampler. The spatial prior introduced by this configuration of the GMM S-MD skip detector

closely resembles the priors imposed by doubleton clique potentials (spatial smoothness

priors) in Markov Random Field based image segmentation tasks. When a pixel is marked

as a FG pixel by the sparse sampler, a lower GMM threshold is applied on the neighboring

pixels, hence biasing the pixel to be marked as a FG. If any of the neighboring pixels are

also marked as FG, the morphological filter output marks the MB as FG. Since GMM S-MD

considers FG MB’s detected in the previous frame when computing the set of salient MB’s,

it also incorporates temporal priors along with the spatial bias.

As explained above, we observe that the GMM S-MD algorithm incorporates Spatio-

Temporal priors to influence the final classification of a MB (as FG/BG), i.e. an MB which

contains foreground-pixel-detections biases its spatial and temporal neighbors to be marked

as FG. This 2-Step sampling process improves the accuracy of Skip detection. A large frac-

tion of the MB’s belonging to the background are filtered by the sparse sampler. The dense

sampler is applied only on the set of Salient MB’s. Hence, the inclusion of Spatio-Temporal

priors is achieved with low computational cost. The five parameters of GMM S-MD are

Tsparse, Tdense, GMM learning rate α, Ddense and Dsparse. We provide a detailed discussion

on the selection of these parameters in Chapter 5. We also present experimental results

which show that the Spatio-Temporal priors introduced by GMM S-MD helps to detect small

objects in surveillance videos.

4.4.3 Cache performance optimization

In our application, we note that a large fraction of MB’s in a video sequence usually do not

contain foreground. Hence the only memory accesses involved in such cases is fetching the

parameters of the dominant modes which represent the background image (mode with the

highest weight). Hence we maintain the models of the dominant modes in a contiguous


manner (in memory) as shown in Fig. 4.5. The other observation is that the modes of only

one of the sets of pixels (A1, A2, A3 or A4) are fetched to perform skip decision of MB’s

which contain only background images. Hence, the modes of the pixels are reordered to

ensure that data elements belonging to a single set of pixels are present in a contiguous

array.

4.5 Reference frame selection

Consider a sequence of frames in display order as shown in Fig. 4.7. The first frame

in the sequence is an IDR frame. Let FC be the current frame being encoded. Let the

mth frame in the sequence be denoted by Fm. Let P denote the ‘key frame’ period in the

sequence. The set of reference frames available in the reference picture buffer when coding

the current frame are shaded in grey. In static camera surveillance videos, the previous

frame in display order typically provides the best prediction (least R-D cost) for foreground

MB’s. Hence every encoded frame is marked as reference for the successive frame and is

placed as the first entry of reference picture ‘List 0’ using Reference Picture List Reordering

(RPLR) commands. Let FR(n)m indicate that the mth frame in the video sequence is the

nth reference frame in the reference picture buffer. As a consequence, the previous frame

is denoted by FR(0)C−1 . After encoding the current frame, the encoder needs to replace a

reference frame in the buffer by the current frame. The index of the frame in the picture

buffer to be replaced is denoted by nR. Selection of the value of nR is discussed in Sec. 4.8.


IDR frame 0

IDR frame 1

Current FrameReference Frames in RPB

( )1−NnRxF ( )2−NnR

yF PF1−PFCF)0(1

RCF −1F0F

Previous Frame is always a reference frame

Time

Figure 4.7: Sequence of frames in display order

4.6 Macroblock Classification

Fig. 4.8 shows a pictorial description of a surveillance video sequence. Consider a Mac-

roblock MBi (ith MB in raster scan order) in the current frame which contains moving

foreground objects (e.g. MB1 in Fig. 4.8). The MB will not be skipped and the H.264/AVC

encoder will perform motion estimation (ME) and mode decision to determine the optimum

mode. Let CFGi denote the number of bits required to code the FG macroblock MBi. Since

almost all FG MB’s obtain predicted data from the previous frame, CFGi is only dependent

upon the encoder complexity parameters (e.g. motion search range) and not on the selec-

tion of reference frames. MB2 contains only background objects and hence is marked as

‘Skip’. The coding cost to mark such MB’s as ‘Skip’ is very low and is denoted by CBG→BGi

(for the ith MB). Macroblock MB3 contains only background objects in the current frame.

However, the encoder cannot mark it as ‘Skip’ since the collocated block in the previous

frame contains foreground objects. The encoder will need to refer to other frames in the


MB1

MB2

MB3

MB4

( )2 frame Ref. nRxF ( )1 frame Ref. nR

yF

( )) frame

(Previous 0 frame Ref.01

RCF −

CF frameCurrent

Figure 4.8: Macroblock reference assignment

RPB in which the collocated block is a background MB (i.e. the MB does not contain fore-

ground objects). If the background MB is available in the RPB, the current macroblock can

be coded using the inter mode with motion vector and residual set to 0. This would result

in a low coding cost CFG→BG〈A〉i (for the ith MB). However if none of the reference frames

contain a collocated background macroblock for MBi (e.g. MB4 in Fig. 4.8), then the

encoder will need to perform motion estimation and mode decision. It will also have to

encode the residual and will incur a bit cost denoted by CFG→BG〈U〉i . The notations used to

classify the macroblocks are summarized in Table. 4.1

The total coding cost (in bits) for the current frame is given by:


Table 4.1: Notations

MBi type Description

FG MBi is a FG MB in the current frame

BG→ BGMBi is a BG MB in both the current and

previous frames

FG→ BGMBi is a FG MB in the previous frame and a

BG MB in the current frame

FG→ BG 〈A〉MBi is a FG→ BG MB

∃ n such that 0 < n < N & i ∈ B(n)i.e. a collocated BG MB is 〈A〉vailable in RPB

FG→ BG 〈U〉MBi is a FG→ BG MB

i /∈ B(n) ∀n such that 0 < n < Ni.e. a collocated BG MB is 〈U〉navailable in RPB

cost =∑

i∈IFG

CFGi +

∑

i∈IBG→BG

CBG→BGi

+∑

i∈IFG→BG〈A〉

CFG→BG〈A〉i +

∑

i∈IFG→BG〈U〉

CFG→BG〈U〉i

(4.2)

Here, the first summation represents the total bit cost of all foreground image regions

in the current frame. The second term represents the cost to mark BG → BG MB’s as

skip and is hence very small. The third term represents the total cost to encode uncovered

background MB’s which can be directly reconstructed (without residual) using reference

pictures in the DPB. Since such MB’s do not need coding of residual information, this cost

is very small. The last term represents the coding cost of uncovered background MB’s that

need residual coding (due to unavailability of image content in the reference pictures in the

DPB).

Let CFG, CBG→BG, CFG→BG〈A〉 and CFG→BG〈U〉 denote the bit cost summations for


FG, BG → BG, FG → BG 〈A〉 and FG → BG 〈U〉 macroblock’s respectively in Eq. 4.2.

The total cost of coding FG→BG MB’s, CFG→BG is equal to the sum of CFG→BG〈A〉 and

CFG→BG〈U〉.

4.7 Skip Signalling

MB’s which have been determined to contain only background image content (BG MB’s)

can be marked as Skip based on the availability of background MB’s in the model M . The

appropriate coding decisions Si for macroblock MBi are obtained as shown in the flowchart

in Fig. 4.9. Here, IFG, IBG→BG, IFG→BG〈A〉 and IFG→BG〈U〉 denote the sets of all indices

of FG, BG → BG, FG → BG 〈A〉 and FG → BG 〈U〉 macroblock’s in the current frame

respectively. MV and MVP represent the motion vector and motion vector predictor of the

MB respectively.


Yes

Entropy coding

No

Yes

Ref = 0

No

Yes

No

No

Yes

Motion estimation and Mode decision

H.264/AVC encoder

?ABGFGIi →∈

?BGBGIi →∈

?FGIi ∈

?0==MVP

iMBfor Signalling Skip

SKIP=Mode( ) NrrBi <≤∈ 1 ,

P16x16,0

,0

==

=

Mode

Residual

MV

End

*

* Residual and quantized coefficients for the entire 16x16 MB are set to 0

Ref = r such that

Figure 4.9: Skip Signalling


4.8 Optimum Reference Frame selection

In Chapter 2, we noted that high quality long term reference frames (HQF’s) do not reduce

the coding cost of FG objects (i.e. CFG) in surveillance video encoders. However, the cost

to encode uncovered background regions CFG→BG〈U〉 depends upon the set of reference

frames in the DPB. We now propose a H.264 standard compliant reference frame selection

technique to reduce the cost of coding uncovered background regions in surveillance videos.

Later, in Chapter 5, we implement different reference frame selection strategies in Matlab

and analyze the performance of the proposed scheme. We also compare the proposed

technique with the 1× and 2× algorithms using real world surveillance videos.

4.8.1 Proposed Adaptive Reference Frame Selection Technique

From the discussion in Sec. 4.6, we observe that the coding cost of uncovered background

regions, i.e. CFG→BG, is dependent on the reference frames present in the RPB. Also, we

noted in Sec. 4.5 that, after coding FC (the current frame), the encoder will need to replace

an existing picture in the RPB with the current frame. The choice of the frame in the RPB

to be replaced will decide the set of reference pictures available for future frames and will

hence determine CFG→BG. The optimal selection procedure will attempt to maximize the

number of FG→BG macroblocks which can be reconstructed from the collocated positions

in the reference frames without coding any residual. However, this would require the en-

coder to ‘look ahead’ and would also be computationally very expensive. We propose a low

computational cost reference frame selection algorithm to mark the reference picture in the

RPB which will be replaced by the current frame. We obtain the theoretical upper bound

in Section 5.6 and show that the performance of the algorithm is very close to the upper

bound.

Consider the state of the RPB after the current frame has been encoded. The complete

set of MB’s in the background model is: B = ∪N−1n=0 B(n). B is referred to as the background

set. If the encoder marks the current picture as the nthR reference picture, then the updated

background set would be ∪N−1,n 6=nR

n=0 B(n) ∪ Bcurrent. The marking decision is made so as


Init: C = 0, M = ∅, B(n) = ∅ ∀n ∈ {0...(N − 1)}Data: Input FrameFC

while New FrameFC doGMM S-MD: DetermineBcurrent

if FC is an IDR framethenClear RPBM ←− ∅

B(n)←− ∅ ∀n ∈ {0...(N − 1)}nR ←− 0;I-frame coding ofFC

elseRPLR : Set previous frameFC−1 asR(0)(first entry in Ref. list 0). Apply samereordering to Mfor i ∈(1....NMB) do

Skip Signalling forMBi (See Fig. 6)end for

nR ←− argmaxn

∣

∣

∣

∣

∪N−1,i6=ni=0

B(i) ∪Bcurrent

∣

∣

∣

∣

end ifB(nR)←− Bcurrent (Update Background Model)R(nR)←− FC (Insert current frame into RPB)C ←− C + 1

end while

Figure 4.10: Pseudo Code of Proposed Reference Frame Selection Scheme

to maximize the number of background macroblocks in the set B. The pseudo code for the

proposed method is shown in Fig. 4.10. Here, NMB denotes the number of MB’s in a single

frame. The background model, {Bi : 0 ≤ i < N} is initialized to the empty set ∅ at the start

of the video sequence. It is updated after every frame has been encoded and the reference

frame marking decision has been completed. If the current frame is marked as an IDR frame,

the background model is reset to ∅ (since the DPB is flushed by the decoder when an IDR

frame is received). The worst case computational cost of the proposed algorithm is equal to

N(N−1)NMB logical bitwise OR operations (required to compute ‘Set Unions’) followed by

conditional increment operations (required to compute the cardinality of the ‘Set Union’).


However, we show in Sec. 5.6 that the coding performance approaches the upper bound

when 2 reference frames (i.e. N = 2) are used. Hence, the proposed algorithm has a very

low computational cost and can be implemented on embedded platforms. The method also

provides computational cost reduction by avoiding the coding (Motion estimation, mode

decision and residual coding) of several FG→ BG MB’s.

4.9 Summary

In this chapter, a low computational complexity, sampler based architecture has been pro-

posed to detect foreground MB’s in static camera surveillance videos. A brief introduction

to relevant sampling techniques has been discussed. A multi stage sampler that combines

stratification and adaptive sampling techniques has been developed. The proposed scheme

reduces the complexity without affecting the accuracy of the detector. H.264/AVC standard

compliant skip signalling techniques for background MB’s have also been described. We

also proposed a reference frame selection technique for a static camera surveillance video

encoder. The proposed scheme maximizes the number of BG MB’s available in the DPB and

hence reduces the cost of coding uncovered background regions. In chapter 5, we present

RD performance results of the proposed scheme on real world surveillance videos.

Chapter 5

Results: Skip Decision and Reference

Frame Selection

5.1 Introduction

In Chapter 4, we have described the skip decision and reference frame selection algorithms

that we propose to reduce the bitrate of surveillance videos. We now present experimental

results validating the performance of the proposed techniques. We initially describe the

experimental setup and the test video dataset. Next, the RD performance of the proposed

GMM S-MD algorithm is compared with other techniques in literature. Rate distortion

curves of 6 videos have been plotted. Bit rate reduction data obtained on ‘No activity’

datasets (videos which do not contain FG objects) is provided. Complexity reduction re-

sults of the proposed encoder (i.e. encode time reduction) is also described. We also show

encoded output frames for a few videos in the dataset. Performance of the GMM S-MD al-

gorithm in challenging conditions such as, presence of obscured/camouflaged small objects

and low lighting is analyzed. The necessity to incorporate spatio-temporal bias in the skip

detector architecture is illustrated using sample test cases. The impact of varying the learn-

ing rate and threshold parameters of the GMM D-MD algorithm on the bitrate and accuracy

is described. Next, we show that the proposed technique increases the distortion computed

over the entire image but does not affect the utility of the encoded surveillance video. We

also show that slow and fast varying lighting conditions do not cause any ‘false miss’ out-

puts in the skip detection algorithm. Next, we provide an analysis of the proposed adaptive

reference frame selection algorithm. We then compare it with a recently proposed reference

frame selection technique. Finally, a summary of the experimental results is provided.

83

Chapter 5. Results: Skip Decision and Reference Frame Selection 84

5.2 Experimental Setup

Sixteen uncompressed 720p (1280x720 resolution) surveillance videos with a wide variety

of characteristics (indoor, outdoor, no foreground activity, fast motion, small objects, per-

sisting foreground, low lighting conditions, different white balance and exposure settings,

lighting change) have been collected at 10fps in 4:2:2 format. The entire dataset along with

the encoded videos have been published on the Internet1. The videos are down-sampled

to the 4:2:0 format and used as the test set. Sample snapshots of the dataset are shown

in Fig. 5.1. Along with this dataset, we also use three videos from the PETS 2009 video

dataset (PETS-1: View 001 sparse crowd, PETS-2: View 006 & PETS-3: View 001 dense

crowd) [135] and one video from the CDW dataset (wetSnow) [136] (we resize/crop the

images to PAL resolution i.e. 768 × 576 pixels). 100 frames of each video sequence are

encoded (larger number of frames are encoded for the ‘Parking lot’ and the ‘Evening fade’

datasets) and QP values are varied to obtain different sample points on the R-D (Rate-

Distortion) plane. Foreground pixels for 25 randomly selected frames (in each video) are

manually annotated for 10 videos and the distortion is computed over their Luma values

(We use the ground truth of all the 100 frames provided in the CDW dataset). The settings

used for the GMM S-MD parameters are discussed in Section 5.4.

In this work, we use adaptive memory control to manage the DPB. To evaluate the

benefits of multiple reference frames, we have integrated the proposed methods into the

H.264/AVC reference software JM 18.0 [137]. The proposed techniques have been im-

plemented in the C + + programming language. The Windows operating system has been

used to perform all the experiments. Rate distortion optimization (RDO) has been enabled.

Main profile with P slices and CABAC (Context-adaptive binary arithmetic coding) entropy

coding is used for all the experiments. To measure speedup, we use the highly optimized

x264 video encoder [11]. Single pass mode with IPPP coding structure is used for low de-

lay and low complexity encoding. RD mode decision for all frames and fast skip detection

on P-frames has been enabled. Single threaded mode is chosen and the computation time

1http://chips.ece.iisc.ernet.in/index.php/Pushkar G


is measured on a Core i5 processor (having 2x64KB L1, 2x256KB L2 and 3MB L3 caches)

running at 2.53Ghz with 4GB of system memory.

(a) (b)

(c) (d)

Figure 5.1: Snapshots from the video dataset (a) Entrance (b) Parking Lot (c) Access Door

(d) Backyard1

5.3 Skip Selection using GMM S-MD

Figs. 5.2 & 5.3 compares the RD performance of ‘Skip detection’ using the proposed GMM

S-MD technique with those in [24], [41] and JM [137]. GMM S-MD encodes a large number

of background MB’s as ‘Skip’ and hence provides a significant increase in R-D performance

of up to 2dB at high bitrates (‘Bridge’ dataset). However, at low bitrates, we find that the

average reduction in data rate across the video dataset is not high. This is because the R-D

cost of the skip mode (at low bitrates) is low and hence the RDO based encoder chooses


the skip mode for most of the background MB’s. However, GMM S-MD provides bitrate

reduction of 27.3% compared to [24] & [41] on the ‘No Activity1’ sequence (at low bitrate;

QP set to 32). On the same sequence, GMM S-MD also provides execution time reduction

of 40.8% compared to [24] & [41] (measured using x264 [11] with QP set to 32). Figs.

5.3 shows that the proposed technique reduces bitrate by 29.2% & 30.3% on the PETS-1 &

CDW videos (with QP set to 24). Experiments also show that GMM S-MD provides 11.4%

& 4.9% bitrate reduction on PETS-2 & PETS-3 videos respectively. Since R-D data indicates

that the methods in JM, [24] and [41] do not skip a significant number of BG MB’s, we

studied the impact of reducing the thresholds for skip selection. Tc and Te values in [24]

and the value of Tlow in [41] were increased. We found that this causes a few foreground

regions to be incorrectly marked as ‘Skip’ hence reducing the foreground PSNR.

Table. 5.1 lists the reduction in encoding time obtained by adopting the proposed

method. The proposed method provides up to 74.5% reduction in encoder execution time

over [41] & [24] (measured using x264 [11]). We observe that [41] provides good com-

putational complexity reduction on indoor scenes which contain objects with little texture.

However, it does not skip a large number of BG MB’s in scenes with rich texture (e.g. in

the ‘No activity1’ dataset). [24] provides nominal reduction in scenes which are brightly

illuminated. However, under relatively low lighting conditions, the increase in the pixel

noise causes a significant number of BG MB’s to be marked for mode decision. The variance

parameters of the GMM model used in the proposed approach track changes in the pixel

statistics and hence reduce the number of such false alarms across all the video datasets.

We note that a large number of surveillance cameras are monitoring scenes with little or

no foreground objects for a significant fraction of the time. Hence reducing the false alarm

rate is important to reduce the average bandwidth and the average power consumption.

Table 5.2 shows the bitrate reduction on two surveillance videos which have no foreground

activity. We find that the proposed skip decision method provides bitrate reduction of up to

94.5% (over [41] & [24]) by reducing the number of false alarms. The 2x1 morphological

erosion operation performed on IFG (to remove noise) was found to provide a 53.9% re-

duction in bitrate in the ‘No Activity1’ dataset. To quantify the benefit of the ‘cache aware


Table 5.1: Average execution time reduction of encoder (QP set to 24)

Note: ∆Encode Time in % is with respect to x264 [11]

SequenceZeng [41] Jin [24] GMM S-MD

∆Encode time ∆Encode time ∆Encode time

Entrance 26.4% 4.9% 42.4%

Walkway 37.4% 11.3% 51.7%

Access Door 51.7% 4.5% 56.3%

BackYard1 11.4% 14.4% 31%

BackYard2 31.9% 22.4% 58.6%

Parking Lot 21.8% 0.1% 77.9%

Bridge 9.7% 1.4% 30.1%

No Activity1 18.9% 23.1% 80.4%

No Activity2 78.8% 6.9% 82.3%

CDW 13.2% 52.7% † 36.5%

PETS-1 15.6% 27.3% 61.3%

PETS-2 40.4% 17.7% 44%

PETS-3 15.6% 3.6% 26.8%

† Large number of FG MB’s in CDW are marked as ‘Skip’ by Jin [24]

placement’ of GMM parameters, we have coded GMM S-MD with cache optimization en-

abled and disabled. We have also measured the last level cache (LLC) references using CPU

counters. We find that cache optimization provides 12.3% reduction in execution time (in

the ‘No Activity2’ dataset) and 30.2% reduction in LLC references. We note that larger exe-

cution time savings would be obtained in embedded platforms (which typically do not have

L3 caches) since such LLC references would have to be serviced by the main memory.

Table 5.3 shows that the execution time of the proposed GMM S-MD method is in the

range of 1ms-3.6ms. The table also lists the computation time required when the sampler

is disabled i.e. pixel segmentation is performed using the modified GMM algorithm pro-

posed by Zivkovic [3] followed by a 2x2 morphological erosion operation. The proposed

GMM S-MD method provides speedup in the range of 22 - 33 over [3] for video datasets

containing foreground objects. The computation time of the proposed method measured


Table 5.2: Performance comparison of proposed GMM S-MD on ‘No activity’ datasets

Note: Bit rate reduction in % computed with respect to JM

Sequence

JM [137] Jin [24] Zeng [41] GMM S-MD

Bitrate (kbps) ∆Bitrate ∆Bitrate ∆Bitrate

No Activity1

2596 22% 0.95% 86%

776 14.4% 0.5% 72.9%

277 7.2% 0.2% 55.3%

139 3.4% 1.1% 29.8%

No Activity2

1458 6.1% 35.4% 96.4%

233 4.1% 59.2% 89.6%

34 1.4% 55.9% 58.2%

21 33.3% 34% 34%

on the ‘No Activity’ datasets is very low (1ms - 1.5ms) since most of the MB’s are classified

as ‘Non Salient’ and are hence sparsely sampled. The low computation cost of the GMM

S-MD algorithm enables execution on low power embedded camera platforms. Park et al.

proposed a random-sampler based method for background subtraction in [138]. A set of

sparsely sampled pixels are segmented as either foreground/background. The regions sur-

rounding the sampled pixels marked as foreground are further classified. Based on the

number of foreground pixels around the sampled locations, further spatial expansion is

performed. In [139], Lee et al. further improved upon [138] by classifying pixels in an

inter weaved order. They showed speedup in the range of 2.3-3.4 over [3]. In comparison,

GMM S-MD utilizes a fixed 2-Step sampling structure to efficiently incorporate MB-level

Spatio-Temporal priors for Skip decision. The 2-Step sampling structure of GMM S-MD al-

lows different GMM thresholds to be applied on pixels of ‘Salient’ and ‘Non Salient’ MB’s. In

Section 5.4, we show that this enables accurate detection of small obscured objects. GMM

S-MD also samples pixels over a regular grid and hence enables cache performance opti-

mizations through data-structure rearrangements. Chang et al. [134] proposed to modify

the sampler density based on the foreground probability. However, the computational cost

required to determine the foreground probability model and the sampler map limited the


speedup obtained (over [3]) to 3. Guo et al. [140] used a hierarchical, block & pixel level

segmentation technique to reduce computational complexity. Unlike in GMM S-MD, spatial

samplers are not adopted and hence every pixel in the frame is accessed to compute the

block level features. Spatio-Temporal priors are also not used to bias block-level decisions.

Hence, a low threshold is applied on the block level features to ensure that foreground

detections are not missed. They showed speedup of 5.7 over GMM.

Table 5.3: Average execution time of Skip detection

Sequence

Zivkovic [3] Jin [24] GMM S-MD

Time (ms) Time (ms) Time (ms)

Entrance 78.1 3.9 3.4

Walkway 66.8 3.5 2.3

Access Door 70.8 4 2.6

BackYard1 67.5 3.4 2.8

BackYard2 53.8 3.4 1.6

Parking Lot 68.4 4.4 2.9

Bridge 75.7 4 3.6

No Activity1 53.6 3.4 1.5

No Activity2 52.4 3.9 1

CDW 24.4 1 1.56

PETS-1 19.6 1 0.8

PETS-2 22.5 1 1

PETS-3 22.4 1.3 1.2

We now analyze the relation between noise, resolution and bitrate of the encoded video.

We mentioned in Chapter 1 that, for scenes which are not well illuminated, we need to

increase gain. Fig. 5.7 shows one such example where the gain is increased to improve

visibility of the corridor. Fig. 5.7 also shows 100 RGB sample values of a pixel from the video

plotted in the 3D RGB space. We can see that the background GMM mode has accurately

modelled the probability distribution of the pixel. Due to the increased gain, the noise is

high. The bitrate of the video encoded using JM is measured to be 220kbps. Using GMM

S-MD, the required reduces to 33kbps. When the resolution of the video is reduced to 800×


480 pixels (using bilinear interpolation), the bitrate of the JM coded output bitstream drops

to 30kbps. At this reduced resolution, the GMM S-MD output video bitrate is measured to

be 14kbps. The bilinear interpolation operation filters the noise and causes the drastic

reduction in bitrate. A large number of background MB’s are marked as skip since the

residual after quantization reduces to 0. However, the proposed technique can encode

the full resolution (1280 × 720 pixels) video at almost the same bitrate. Also, at lower

resolution, i.e. 800 × 480 pixels, GMM S-MD provides bitrate reduction of 53.3% over JM.


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a)

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

��

��

��

(b)

Figure 5.2: RD data for the (a) Bridge & (b) Walkway video sequences


��

��

��

��

��

��

�

��

��

�

��

��

��

��

��

��

��

��

��

��

(a)

��

��

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

(b)

Figure 5.3: RD data for the (a) Access Door & (b) Entrance video sequences


��

��

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

(a)

��

��

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(b)

Figure 5.4: RD data for the (a) PETS-1 & (b) PETS-2 video sequences


(a)

(b)

Object can be noticed in the video

(c)

Figure 5.5: Encoded frames of the (a) Light Switch (b) Bridge and (c) Low light video

sequences


(a)

(b)

(c)

Figure 5.6: Encoded frames of the (a) CDW (b) PETS-2 and (c) PETS-3 video sequences


3040

5060

70

30

40

50

60

70

25

30

35

40

45

50

55

60

65

70

75

R

G

B

P

Figure 5.7: Figure on the left shows a poorly lit corridor scene with increased camera gain

settings. Also, on the right, 100 RGB sample values of a pixel (pixel P in the image) from

the video are plotted in the 3D RGB space. In the same picture, the background GMM mode

is shown, i.e. the points on the sphere are at a distance of 2.5 σ (Mahalanobis distance)

from the mean value of the mode.


5.4 Analysis of GMM S-MD Performance

In this section we will initially discuss the nominal parameter-settings of the GMM S-MD

algorithm and their impact on its performance. These settings can be applied to videos in

which (a) Objects cover more than 3 - 4 MB’s and (b) Lighting conditions are good (i.e.

directly illuminated scenes unlike in Fig. 5.6c). Later, we discuss specific tuning of the

parameters required for videos which do not satisfy these constraints .

• Nominal parameter settings: Experiments show that setting Tsparse = Tdense = 2.5,

Dsparse = 8, Ddense = 4 and learning rate α = 0.004 works well. The GMM S-MD

classifier performance is not very sensitive to the precise values of Tsparse, Tdense & α and

works well within a reasonable range of settings. To verify the robustness of the sampler,

we have encoded the videos (sequences 1-5 in Table 5.3) with different values of Dsparse

and determined the number of FG MB’s that were incorrectly marked as BG. The value of

Ddense was set to 4. We observe only one mis-detection in the ‘Access Door’ video when

Dsparse was increased to 12 (and 3 mis-detections when Dsparse was set to 20). No mis-

detections were observed in all the other video datasets even when Dsparse was set to a

high value of 20. This is due to the fact that even a single detection will trigger dense

sampling in the salient MB’s (around the foreground detected pixel).

Fig. 5.8 shows the impact of varying the sparse sampler threshold Tsparse on the perfor-

mance of the GMM S-MD algorithm for the ‘Walkway’ and ‘Backyard2’ video sequences.

In the case of the ‘Backyard2’ sequence, none of the foreground MB’s are wrongly marked

as ‘Skip’. In scenes with high environment noise, e.g. ‘Walkway’ sequence, increasing the

Tsparse threshold reduces the number of false alarms (i.e. background MB’s marked as

FG). However, it also causes a few miss detections. Hence, we recommend setting Tsparse

to a conservative value equal to 2.5.

• Obscured and camouflaged small objects: (occupying area less than 3-4 MB’s and

having similar appearance with the background): In such cases (e.g. ‘Parking lot’ &

‘Bridge’ videos), we reduce the GMM threshold of the dense sampler i.e. we set Tdense =

2. As a result, the probability that salient MB’s are marked as FG increases. Salient


MB’s classified as FG in the current frame further cause their neighbors to be marked as

‘Salient’ in the next frame. Hence, even a few sparse sampler detections on the object in

the current frame would ensure successful detection in succeeding frames. Reducing the

value of Tdense (and not Tsparse) does not result in a drastic increase in the bitrate since a

large fraction of background MB’s are filtered out by the sparse sampler. For example, on

the ‘Parking lot’ video, setting Tsparse = 2.5 & Tdense = 2 resulted in detection accuracy

equal to the case when Tsparse = Tdense = 2. However, the bitrate in the former case

(when only Tdense was reduced to 2) was 41.5% lower compared to the case when both

the thresholds, i.e. Tsparse & Tdense were set to 2. To demonstrate these findings, we have

executed GMM S-MD with different combinations of Tsparse and Tdense. Fig. ?? shows

the encoded frames. We observe that when both Tsparse and Tdense are set to a high value

(i.e. 2.5), some regions of the foreground object are missed. By reducing only Tdense, all

the foreground objects are correctly detected. Reducing both Tsparse & Tdense increases

bitrate significantly as mentioned above.

As described in Section 5.4, the Spatio-Temporal priors incorporated in the GMM S-MD

algorithm assist in continuous detection of small objects. To analyze the importance

of the Spatio-Temporal bias in the GMM S-MD design, we disable it and determine its

impact, i.e. we set F ,sal to be equal to Fsparse. Fig. 5.9 shows that without the Spatio-

Temporal bias, we miss detection of one foreground object. With the Spatio-Temporal

bias enabled, GMM S-MD detects all the foreground objects accurately.

The maximum speed of images of small objects ( < 50 pixels wide) in surveillance videos

(measured in pixels/time) is typically about (10 × 16 pixels/sec) i.e. 1 MB width/frame

at 10fps. Hence, the Spatio-Temporal priors incorporated by the GMM S-MD algorithm

are found to successfully assist in continuous detection of small objects. We do note that

the sampler occasionally misses FG MB’s when objects with area smaller than that of a

MB (16× 16 pixels) move amidst foliage (in the ‘Parking lot’ video sequence). Detection

capability of such small objects is not required by a large fraction of surveillance systems.

However, systems which require such detection capabilities will need to use a higher

sampling density (e.g. Ddense = 2 detects very small objects in the ‘Parking lot’ sequence),


albeit with higher computational complexity and memory requirements.

• Irregular environment noise: Fig. 5.2a & Fig. 5.4b show that the proposed GMM S-

MD skip detection provides 27.1% & 30.3% bitrate reduction on the ‘Bridge’ & ‘CDW’

videos (QP set to 24). However, further analysis of the MB skip maps show that despite

the significant bitrate savings obtained, a large number of dynamic background image

regions are not marked as skip. This is due to the highly irregular motion of the back-

ground objects. In The ‘Bridge’ video, the large motion of vegetation causes these MB’s

to be marked as FG. On the frame shown in Fig. 5.6b, GMM S-MD was found to provide

31.4% reduction of bit count compared to JM. To determine the best achievable bitrate

reduction, we manually annotated the frame and measured the bit cost of only the true

FG MB’s. These measurements show that the maximum achievable bitrate reduction is

89.3%. Similarly, in the ‘CDW’ video, irregular noise due to rain and snow are not mod-

eled by the GMM and are incorrectly marked as foreground. More elaborate techniques

can be adopted to improve the accuracy, albeit with greater computational cost.

• Persisting foreground: Continuous occlusions of the background scene due to objects

(as in the ‘Bridge’ sequence) cause false inclusions in the background modes of the GMM.

Similarly, slowly moving objects (as in the ‘Slow motion’ sequence) also introduce errors

in the background model. Fig. 5.12a shows that 3 FG MB’s are marked as Skip when α is

increased to 0.006 in the ‘Bridge’ sequence. Hence, to avoid wrongly marking foreground

regions as ‘Skip’ in such cases, we cannot set the learning rate α of the GMM algorithm

to a high value. However, increasing α for noisy background pixels (e.g. shaking foliage

in the ‘Bridge’ video) reduces the bitrate due to improved learning performance of the

GMM. On the ‘Bridge’ sequence, results show that the bitrate drops by 10% when α is

increased to 0.006 but 3 MB’s are wrongly marked as BG MB’s. From this discussion,

we clearly observe a tradeoff that exists in the choice of α. We observe that α set in the

range of 0.002 - 0.005 works well on all the videos (including ‘Slow Motion’ & ‘Bridge’).

We report results for the ‘Bridge’ video with α set to a conservative value of 0.002.

Fig. 5.12b shows that varying the learning rate on the ‘Backyard2’ sequence does not


have a large impact on the bitrate. It also does not cause any foreground MB’s to be

marked as BG. This is because of the absence of persisting foreground objects and low

noise characteristics of the ‘Backyard2’ sequence.

The tradeoff described above has been studied by multiple researchers in the past. Lin et

al. in [141] provide a comprehensive list of related work and also introduce an adaptive

learning-rate control scheme to resolve this tradeoff. The algorithm (in [141]) uses dif-

ferent learning rates for pixels at different locations. We note that this technique in [141]

can be easily adopted by GMM S-MD to improve performance.

• Dense foreground object presence: When number of FG objects in the scene increases

and the noise in the background image regions is not high, the technique proposed by

Zeng et al. provides good bitrate reduction. Hence, the savings obtained over their

technique is reduced. As an example, on the PETS-3 video (in which large number of

pedestrians walk across the scene), GMM S-MD provides 5% reduction in bitrate over the

RD cost based skip detection technique by Zeng et al. [41]. The computation required

to perform skip detection also increases as shown in Table 5.3 (PETS-3 is captured at

the same location as PETS-1 but with higher foreground motion and hence increased

execution time required for the skip detector). However, since GMM S-MD uses a 2 step

sampler based technique, the average execution time for skip detection (per frame) is

only 1.2ms. Interestingly, when the number of FG MB’s in motion is high, the savings

obtained by the proposed reference frame selection technique increases. We discuss this

in Sec. 5.7.

• Very low lighting: In such conditions, camera gain is usually increased to improve per-

ceptibility. Due to this, noise in low light surveillance videos is high. Hence, low light

conditions are particularly challenging. We need to lower Tsparse and Tdense to maintain

accuracy (Tsparse is set to 2 & Tdense is set to 1.5). The exposure control routine of cam-

eras can be easily adopted to lower the threshold values when illumination reduces. Fig.

5.6c shows an encoded frame from the ‘Low light’ sequence. We found that a significant

fraction of the background regions in the frames are not skipped. However, we note that


GMM S-MD provides 30% bitrate reduction compared to [24] and [41].


�

�

�

�

�

�

�

�

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

(a)

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(b)

Figure 5.8: Impact of varying Tsparse on GMM S-MD performance shown for the (a) Walk-

way and (b) Backyard2 sequences


(a) (b)

Figure 5.9: Encoded frame from the ‘Parking lot’ video (a) Without Spatio-Temporal bias

(object is missed) (b) With Spatio-Temporal bias (object is detected)

(a) (b)

Figure 5.10: Encoded frame from the ‘Parking lot’ sequence with (a)Ddense = 4 (b) Ddense

= 2


(a)

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

(b)

Figure 5.11: (a) Correctly detected foreground objects (marked in yellow) and (b) RD data

for the ‘Parking lot’ video


�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

Learning rate ' �'

��

��

(a)

�

�

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

Learning rate ' �'

��

��

(b)

Figure 5.12: Impact of varying learning rate on GMM S-MD performance on (a) Bridge and

(b) Backyard2 sequences


5.5 Background PSNR and its Impact

We note that the proposed approach achieves bitrate reduction by skipping MB’s which

belong to the background. While this would result in a reduction of the total PSNR, we

have shown in Section 5.3 that the PSNR computed on the foreground objects remains

unaffected. An informal study was also performed to verify that the proposed method does

not cause any distraction/irritation to viewers. The encoded videos were presented to five

subjects who were informed that the task was to monitor the surveillance zones. Subjects

indicate that they do not observe any distraction/irritation in the videos encoded using

GMM S-MD. Visual attention which has been very actively studied as a part of experimental

psychology is known to be highly dependent upon the task and context [142]. Numerous

experiments have also concluded that attention is completely masked by foreground objects

in motion [143]. Hence, the context, task and motion in the video masks small distortions

introduced in the background.

The reduction of the total PSNR due to the proposed approach was found to be high in

the ‘Walkway’ datasets. Hence, we show 2 images in Fig. 5.13 from the ‘Walkway’ dataset,

one obtained by encoding using the JM encoder and the other based upon the proposed

GMM S-MD encoder. Fig. 5.13 also shows the total PSNR and the foreground PSNR (of

the displayed frame) plotted against the bit rate. We clearly find that the proposed method

does not impact the utility of the encoded video. Small Blocking artifacts can be noticed if

viewed carefully. However, these artifacts are masked by the motion of foreground objects

as already mentioned.

Since surveillance video footage is increasingly being monitored by automatic computer

vision based methods, it is important that the proposed encoding scheme does not reduce

the performance of such algorithms. Most of the successful object detection algorithms such

as ‘Discriminatively Trained Deformable Part Models’ or DPM [144] use gradient features

(e.g. HOG or ‘Histogram of Oriented Gradients’). To verify that the block artifacts around

the object do not impact the accuracy of algorithms such as DPM, we have performed object

detection tests on the videos compressed using JM and the proposed scheme. Results show


that the proposed scheme does not affect the accuracy of DPM. Fig. 5.13 shows sample

detections in the ‘Walkway’ dataset obtained using the ‘person final’ model.

We also note that lighting changes do not impact the quality of the foreground images.

Encoders place IDR (Instantaneous Decoder Refresh or key) frames typically once in 5-10s

(50-100 frames at 10fps) to provide fast-seek and error recovery capabilities. Slowly varying

light changes are updated by the IDR frames and hence the MB’s which are skipped contain

pixels which appear very similar to those in the current frame. The ‘Evening Fade’ video (in

which illumination gradually reduces over a duration of 3 minutes) has been used to verify

that no noticeable distortion is introduced in the encoded frames. Very fast lighting changes

(e.g. switching on a light) are not tracked by the GMM model and hence the MB’s will be

marked as FG. Fig. 5.6a shows the encoded video frame (of the ‘Light Switch’ dataset)

immediately after the light switch was turned ON. Fig. 5.15 shows the encoded frames

from the Sunlight variation video captured at two different time instants. We can observe

that the foreground MB’s have been coded without any artifacts.


(a)

(b)

Figure 5.13: Snaps of the encoded ‘Walkway’ dataset coded using (a) JM and (b) GMM

S-MD show that the proposed method does not produce any conspicuous distortion in the

background. The DPM detections (yellow rectangles) are overlaid on the images.


��

��

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

(a)

��

��

��

��

��

��

��

�

�

��

��

��

��

��

��

��

��

��

��

��

��

(b)

Figure 5.14: Figure shows the PSNR plots for the ‘Walkway’ dataset frame in Fig. 5.13.

The PSNR plots have been computed over the (a) entire frame and (b) Foreground regions.

Although the proposed technique reduces the total PSNR, it significantly improves the RD

performance for foreground image regions.


(a) (b)

Figure 5.15: (a) and (b) Show two encoded frames (with different sunlight intensities) in

the ‘Sunlight variation’ video. We observe that the proposed scheme does not wrongly mark

FG MB’s as ‘Skip’ under fast illumination changes.

(a) (b)

Figure 5.16: Slow reduction in illumination observed in the ‘Evening fade’ video. Encoded

frames captured (a) before and (b) after the reduction


5.6 Analysis of the Proposed Adaptive Reference Frame Selec-

tion Technique

We first provide an analysis of the proposed reference frame selection scheme to gain some

insight into its performance. Next we provide detailed RD data and compare it with one of

the recently proposed reference frame selection technique.

The best possible reference frame replacement policy involves a combinatorial search

over all possible selection options in the video sequence and hence is computationally in-

feasible even for an offline analysis. Instead, we obtain bounds and show that the perfor-

mance of the proposed reference frame selection algorithm (using 2 reference frames) is

very close to the optimum. We have obtained the foreground maps based on ‘GMM S-MD’

for all the frames in the video datasets. Different reference frame selection strategies have

been implemented in Matlab and analyzed as follows: Let us assume a hypothetical video

encoder which marks every frame after the IDR picture as a reference frame. We see that

the number of FG → BG MB’s which can be skipped, i.e. the number of FG → BG 〈A〉

Macroblock’s in this case is greater than the number of FG → BG 〈A〉 MB’s in any other

practical encoder (which only marks a limited number of frames as reference). The hypo-

thetical encoder would also maximize the number of MB’s in the set B, i.e. MB’s containing

only background pixels. Hence, we can obtain the upper bound for the number of MB’s in

set B (and likewise the lower bound for FG → BG 〈U〉 MB’s) by maintaining every frame

as a reference which can contribute to the background model. Fig. 5.17 shows the number

of background MB’s available in the background set B for the ‘Entrance’ video. We observe

that FG→BG MB’s for a large fraction of the scene is not present in B when only 1 refer-

ence frame (i.e. the previous frame) is used. The use of two previous frames for reference

does not provide any significant improvement in coverage. However, when the proposed

algorithm is enabled to select the second reference frame, it marks a frame which does not

contain foreground objects as Reference. This enables the encoder to skip coding for a large

number of FG→BG MB’s. Hence bit rate savings of up to 16.3% (over the single reference

frame encoder) is obtained. Further increase in the number of reference frames does not


provide benefit as the number of background MB’s has already reached its upper bound.

Fig. 5.18 shows the number of FG→BG MB’s that could not refer to the reference frames

in the RPB due to unavailability. If the MB’s which were unavailable in the background set

contain rich texture (as in this video), coding them will require large number of bits and

will result in increased bit rate.

10 20 30 40 50 60 70 80 90 10030

40

50

60

70

80

90

100

Frame number

Num

ber

of M

B’s

in B

(%

of N

MB)

1 Ref. (Previous)2 Ref. (Previous)2 Ref. (Proposed)Upper bound

Figure 5.17: No. of MB’s in the set B as a percentage of NMB (Total No. of MB’s in a frame)

for the ‘Entrance’ video


10 20 30 40 50 60 70 80 90 100

0

50

100

150

200

250

300

Frame number

No.

of F

G−

>B

G M

B’s

whi

chre

quire

non

−ze

ro r

esid

ual c

odin

g

1 Ref. (Previous)2 Ref. (Previous)2 Ref. (Proposed)Lower bound

Figure 5.18: No. of FG→BG MB’s which require non-zero residual coding (i.e. No. of FG→BG 〈U〉 MB’s) in the ‘Entrance’ video

5.7 Performance of the Proposed Adaptive Reference Frame Se-

lection Technique

In this section, we compare the proposed technique with existing reference frame selection

algorithms. To the best of our knowledge, no surveillance-specific reference frame selec-

tion algorithms for H.264 has been published in literature. The ‘1×’ and ‘2×’ complexity

algorithms in [57] are closest in relevance to our method. The authors have also iden-

tified occlusion as an important aspect of reference frame selection. Hence, we compare

the proposed scheme with these methods. However, we acknowledge that the 1× and 2×

algorithms have been developed without assuming a static-camera setup and hence can be

used for generic video content as well.

Before we present the data, we briefly describe the 1× and 2× algorithms. More details

can be obtained from the original reference ( [57]). Following notation we introduced in

Chapter 4, let FC denote the current picture and FR(n)m denote the nth reference frame in


the DPB. The 2× algorithm estimates the cost of discarding FR(n)m after the completion of

coding the previous frame FC−1. To obtain this estimate, the current picture FC is encoded

assuming that the previous frame FC−1 and all the reference frames of FC−1 are present

in the DPB for motion compensation. The percentage of blocks used in the reconstruction

of the current frame from each of these frames is noted. We can consider this value as the

utility of the reference frame (it is denoted by β in [57]). The reference frame that has the

least utility (i.e. has least utilization for motion compensation) is discarded. FC is again

encoded using the newly computed set of reference frames to obtain the final output. The

cost of the 2× algorithm is high since a two pass coding is required to determine β. Hence,

et al. [57] propose to estimate the utility of the frames using statistics of the previously

encoded picture i.e. FC−1. Strong correlation assumptions between FC−1 & FC are assumed

and the values for β (utilization ratios for the current frame FC), are approximated to be

equal to the utilization ratios obtained when coding the previous frame. As in the case of

the 2× algorithm, the reference frame in the DPB with least utility is evicted from the DPB.

5.8 RD performance results

We now discuss the RD performance of the proposed technique and also determine the num-

ber of reference frames required for optimum performance. From Table 5.4, we find that

the proposed assignment procedure reduces bitrate by 13.1% - 24.7% (for the ‘Entrance’

and ‘Backyard1’ sequences) compared to the case when the 1× algorithm is used to select

the second reference frame. On the PETS-2 video, the bitrate reduction obtained by using

the proposed technique was measured as 3.6%. In comparison, bitrate reduction was 1.3%

when two previous frames were used as reference. The proposed assignment increases the

number of BG MB’s in set B and hence reduces CFG→BG. Since the number of background

MB’s in set B increases, the number of FG → BG MB’s which can be marked as ‘Skip’ also

increases. This also helps to reduce the computational complexity of the encoder. Using a

second reference frame marked by the 1× algorithm reduces bit rate by up to 2.3% (com-

pared to the single reference encoder). Analysis revealed that the 1× algorithm always


marked the previous frame FC−1 along with FC as the reference set for the next frame

FC+1. As also mentioned in [57], this is due to the strong temporal correlation between the

consecutive frames FC−1 and FC . Hence, the reference frame structure is identical to the

anchor in which consecutive previous frames are used as reference. Multiple consecutive

previous frames used as reference do not provide reduction in bit rate. This is due to the

fact that consecutive previous frames do not provide reduction in either CFG or CFG→BG.

The FG MB’s choose the previous frame over the other reference frames as a result of lower

R-D cost (due to smaller motion vectors and greater similarity in content). The cost to code

the FG → BG MB’s (CFG→BG) also does not reduce since consecutive previous frames

have foreground objects present in almost the same regions in the picture.

Using the 2× algorithm to select the second reference frame provided good bitrate re-

duction of up to 24.8% compared to the single reference frame encoder (in the Backyard1

sequence). However, the high computational complexity due to the second encode pass

(of the order of tens of milliseconds) prevents its applicability to low power embedded en-

coders. In comparison, the proposed method provides higher bitrate reduction of up to

25.9% and requires only 30 − 40µsec/frame to perform reference frame selection. Since

a larger number of uncovered BG MB’s can be skipped, the proposed method also achieves

computational cost reduction of up to 7.3% over the 1× algorithm.

We however note that the bit rate reduction obtained for the ‘Walkway’, ‘Access Door’

and ‘Backyard2’ sequences using the proposed scheme is not as significant as that obtained

for the ‘Entrance’ and ‘Backyard1’ sequences. We identify 3 reasons for this observation. A

detailed discussion of each of these is provided below:

1. GMM S-MD performance: We observe that in the case of the ‘Walkway’ dataset,

adding a second reference frame provides a very small improvement in performance (us-

ing either the previous frames as reference or using the proposed algorithm to mark the

reference frames). The reduced accuracy of skip detection due to complex shadows and

shaking vegetation was determined as the cause for this reduction in gain. Large number

of macroblocks which contained background objects were marked as foreground by ‘GMM

S-MD’ and hence were coded.

Ch

ap

ter

5.

Resu

lts:Skip

Decisio

nan

dR

efe

ren

ceFra

me

Sele

ction

11

6

Table 5.4: Performance comparison of reference frame selection algorithms

Sequence

Baseline: 1 Ref. Frame 2 Ref. Frames 2 Ref. Frames 3 Ref. Frames

GMM S-MD GMM S-MD + 1× [57]§ GMM S-MD + Proposed sel. GMM S-MD + Proposed sel.

Bitrate FG PSNR ∆Bitrate† FG PSNR ∆Bitrate† FG PSNR ∆Time¶ ∆Bitrate† FG PSNR

(kbps) (dB) (dB) (dB) (dB)

Entrance

2421 47.08 0.9% 47.07 14% 47.03 4.8% 14.1% 47.02

1339 44.83 0.6% 44.84 15.4% 44.78 5% 15.2% 44.8

811 42.83 1.1% 42.81 15.9% 42.79 5.2% 15.7% 42.8

500 40.26 0.8% 40.28 16.9% 40.21 5.3% 16.8% 40.21

Backyard1

2938 46.12 2% 46.11 21% 46.06 6.9% 21% 46.05

1816 43.51 1.9% 43.49 22.5% 43.42 7% 22.3% 43.42

1167 41.04 1.4% 41.04 22.8% 40.98 7% 23% 40.96

739 38.13 1.5% 38.11 25.9% 38.03 7.3% 25.2% 38.06

Backyard2

468 43.44 2.3% 43.46 8% 43.45 4.4% 8.8% 43.44

284 40.08 1.3% 40.07 6% 40.05 4% 6.5% 40.04

182 37.05 1.1% 37.06 4.9% 37.03 3.7% 5.6% 37.02

113 33.75 0.9% 33.78 6.5% 33.73 3.2% 6.6% 33.72

Continued in next page

Ch

ap

ter

5.

Resu

lts:Skip

Decisio

nan

dR

efe

ren

ceFra

me

Sele

ction

11

7

Table 5.4: Performance comparison of reference frame selection algorithms

Sequence

Baseline: 1 Ref. Frame 2 Ref. Frames 2 Ref. Frames 3 Ref. Frames

GMM S-MD GMM S-MD + 1× [57]§ GMM S-MD + Proposed sel. GMM S-MD + Proposed sel.

Bitrate FG PSNR ∆Bitrate† FG PSNR ∆Bitrate† FG PSNR ∆Time¶ ∆Bitrate† FG PSNR

(kbps) (dB) (dB) (dB) (dB)

Access Door

1010 46.14 0.7% 46.14 2.3% 46.11 4.7% 3.6% 46.11

523 43.75 0.9% 43.73 1.4% 43.72 4.4% 2.2% 43.73

305 41.53 -0.1% 41.53 -0.1% 41.53 4.2% 0.4% 41.51

183 38.78 0.2% 38.78 0.2% 38.76 3.9% 0.4% 38.77

Walkway

1628 44.93 1.1% 44.89 3.3% 44.88 3.6% 3.7% 44.87

943 42.27 0.6% 42.22 2.5% 42.22 3.5% 2.5% 42.23

567 39.72 0.6% 39.71 2% 39.71 3.2% 2.5% 39.71

328 36.74 0.6% 36.72 2% 36.73 2.9% 1.7% 36.74

§ We obtain identical results when we use 2 Previous frames as Reference

† Bit rate reduction in % computed with respect to baseline: 1 Ref. Frame + GMM S-MD (measured using JM)

¶ Execution time reduction in % (measured using x264) computed with respect to 2 Ref. Frames + GMM S-MD + 1× [57]


2. Number of FG→ BG MB’s: When the number of foreground objects moving across

the scene is high or when objects are close to the camera, a large number of FG→ BG MB’s

are created. Since the proposed scheme reduces RD cost of FG→ BG MB’s, significant

savings are observed for such videos. As a consequence, bit rate reduction due to the

proposed scheme for the ‘Entrance’ and ‘Backyard1’ datasets is higher compared to the

other sequences (e.g. ‘Backyard2’ in which the number of FG→ BG MB’s is low).

3. Background texture of uncovered regions: Presence of complex texture in the back-

ground results in increased bit rate if those regions are coded as FG→ BG 〈U〉MB’s. Since

the proposed method reduces the number of MB’s coded as FG → BG 〈U〉, greater bit

rate reduction is found in datasets with rich background texture (e.g. ‘Entrance’ and ‘Back-

yard1’). As a consequence of the relatively low quantity of background texture in the ‘Access

door’ video, adding a second reference frame for this sequence does not provide bitrate re-

duction (compared to the encoder which uses previous frames as reference). However, the

proposed algorithm reduces execution time by up to 4.7% by avoiding mode decision for a

few FG→BG MB’s.

As noted earlier, the computational complexity of the reference frame selection al-

gorithm is dependent only on the maximum number of reference frames in the DPB. It

was measured to be 30 − 40µsec/frame when 2 reference frames were used and 60 −

70µsec/frame when the third frame was added. However, using 3 reference frames does

not provide a significant improvement in all the 5 video sequences. This is in agreement

with the analysis performed earlier which showed that 2 reference frames are sufficient to

maximize the number of background MB’s in the background set B. We also performed an

experiment in which the QP of the BG MB’s were assigned a large value of 40. However this

did not provide any further bitrate reduction and instead caused severe blurring/washout

of the background picture.


5.9 Summary

A surveillance specific distortion metric was computed to quantify the performance of the

proposed skip decision and reference frame selection techniques. The proposed algorithms

have been compared with relevant methods in literature. Experimental data shows that

the proposed skip selection technique reduces bit rate by up to 94.5% and computational

complexity by up to 74.5% without affecting the foreground image quality. The skip detec-

tion algorithm requires 1-3.6ms on a single core and hence can be easily implemented on

embedded camera platforms. Data also shows that coding cost of uncovered background

MB’s in static camera surveillance videos is not insignificant and depends upon the selection

of reference frames. We have implemented different reference frame selection strategies in

Matlab. Results showed that the number of BG MB’s in the DPB when the proposed tech-

nique is adopted is close to the upper bound. Results show that the proposed reference

frame selection method reduces bit rate by up to 24.7% and execution time by up to 7.3%.

Chapter 6

ROI video coding for Pedestrian

Surveillance

6.1 Introduction

In Chapter 4, we proposed to use foreground segmentation to perform skip detection of

background MB’s. All the foreground macroblocks were encoded with uniform quality.

However, in pedestrian surveillance, the facial features are most useful to perform recogni-

tion and identification tasks. High image detail of non face regions is not as important as

the features of the face regions. Setting equal quality parameter settings to all MB’s results

in sub optimal bitrate allocation. To illustrate this point, we have encoded a test video and

measured the number of bits allocated to the different regions. In Fig. 6.1, this data is

shown for different regions of a single frame in the video. The number of bits allocated to

the shadow and non face regions (torso, arms and legs) is almost 17× the number of bits

utilized to encode the face region. Further analysis of the encoded bit stream shows that

the high bitrate of non face regions is primarily due to four reasons:

• High bit cost of ‘FG border’ MB’s: Due to the block based coding architecture of

the H.264 standard, the encoder cannot effectively combine background and inter

predicted foreground image content of FG border MB’s. For example, in Fig. 6.1,

MB2 which is FG border MB requires 187 bits. In comparison, MB3 (which is not on

the border) requires only 10 bits.

• Deformations of textured clothing: Deformations of clothing reduces similarity be-

tween adjacent frames. Hence, such MB’s cannot effectively utilize inter prediction

(e.g. MB1 in Fig. 6.1).

120

Chapter 6. ROI video coding for Pedestrian Surveillance 121

• Shadows on highly textured background regions: High frequency content of the

image (due to the texture) increases the energy of the residual.

• Strong shadows on background regions: Efficient inter prediction is not possible

due to significant change in the image. Hence, such MB’s (especially those on the

border of the shadow region) cause increased bitrate.

Shadow MB’s~ 9000 bits

Face MB’s ~ 1700 bits

Non Face MB’s~ 20000 bits

MB1 ~ 217 bitsMB2 ~ 187 bitsMB3 ~ 10 bits

MB1

MB2

MB3

Figure 6.1: Number of bits required to encode MB’s of a surveillance frame at uniform

quality

By this analysis, it is clear that we can significantly reduce the bitrate of the encoded

video by differentially assigning QP values to MB’s covering the face and non face regions

of the pedestrians. MB’s covering the face regions are encoded with a low QP (i.e. high

quality). Higher QP is assigned to non face FG MB’s. Shadows on background surfaces can

be marked as skip to further reduce the bitrate. In Chapter 2, we have already reviewed

techniques based on this idea that have been published in literature [26, 60, 61, 62, 26,

63, 64, 66]. Most of the previously published methods [61, 62, 66] are targeted for video

telephony applications. For example, [66], Ming-Chieh Chi et al. use skin color based face


detection to mark ROI regions in video teleconferencing videos. However, such techniques

do not work on real world surveillance videos. To gain a better understanding of these

challenges, we use the OpenCV adaptive skin detector to determine the regions of interest

in a surveillance video. The OpenCV algorithm is based on the technique proposed by

Farhad et al. in [145, 146]. Fig. 6.2 shows skin detections (pixels marked as yellow)

obtained on a sample frame in the video. We can clearly see that using only skin detection

to perform ROI marking would not be accurate. Also, variation of skin tone under different

lighting conditions reduces the accuracy of such techniques.

Figure 6.2: Skin pixel detection in a surveillance video frame

We now discuss two object detector based ROI coding techniques proposed in literature

(first technique proposed by Christopher et al. [63, 64] and the second scheme introduced

by Lai-Tee Cheok et al. [26]) and compare it with the proposed method. Christopher et al.

[63, 64] use the Viola Jones detector [65] to detect faces in each frame. An iterative mean

shift based object tracker is initialized for each new detection. Detections which match state

objects are used to update the object representations. Face ROI MB’s are encoded at lower

QP using the H.264 encoder. However, this work has been applied to video conference

applications. In surveillance videos, low resolution and poor lighting conditions prevent


the adoption of face-feature based detectors. To study the performance of ROI coding using

face detection for surveillance, we use the Viola Jones face detector in OpenCV. Fig. 6.3

shows that face detection based ROI marking is also not accurate on real world surveillance

videos. Also, running the detector and updating the object representations on each frame

is computationally very expensive. Instead, as we show in this chapter, we only need to run

a detector once in a few frames (we set the interval to 1 second). Also, we do not update

the tracker model since we do not need to maintain identities across severely occluded

sequences.

Figure 6.3: Face region detection using the Viola Jones detector

In [26], Lai-Tee Cheok et al. use the output of the video analytics module to modulate

the bit allocation to different image regions. Segmentation is used to determine foreground

blobs. A multi-class classifier is used to label the blobs as either pedestrian, vehicle or

animal. A tracker is used to track the blob labels. Blobs containing pedestrian images are

encoded at higher quality. However, as we have seen in Fig. 6.1, larger savings can be

obtained by using different QP’s for MB’s within a blob (i.e. larger QP for non face regions

and skip mode for shadow regions).


Low resolution, occlusion & poor lighting conditions pose significant challenges to ac-

curate ROI detection. Also, in surveillance applications, intruders typically attempt to avoid

appearing in the camera field of view. The number of frames in which the faces are visi-

ble would be lower in such cases. Hence, the ‘miss’ probability of the ROI detector should

be low. Clearly, using simple object detectors does not give good performance in uncon-

strained environments. Joint reasoning based on multiple cues is required. A large body

of work to detect and segment skin regions, face regions and directly pedestrians exists

in the computer vision literature. However, these researches have not been studied in the

context of ROI video coding. ROI video coding for block based encoders like the H.264

does not require pixel level segmentation. In this chapter, we use mid level super pixel

segmentation representations to efficiently and accurately determine the ROI, RORI and

RONI regions. We propose to combine pedestrian detection with skin and shadow detec-

tion to accurately mark ROI’s. We also integrate a tracker to reduce the ‘miss’ probability.

The tracker also serves to reduce the computational cost since ROI marking of successfully

tracked objects (objects with high association scores) does not require computation of de-

tector scores. Bilattice based logical reasoning is used to effectively combine all the detector

scores to accurately determine the ROI, RORI and RONI regions.

The remainder of the chapter is organized as follows: The architecture of the proposed

Region of Interest video encoder for pedestrian surveillance is described in Section 6.2. In

Section 6.3, we describe low and mid level segmentation. Computations of shadow and skin

scores on super pixels are described in Section 6.4 & 6.5 respectively. Section 6.6 describes

the DPM pedestrian detector based score computation. Geometry is described in Section

6.7. Section 6.8 describes the proposed ‘detection by tracking’ technique. The technique

proposed to infer the face, non face and RONI regions is described in Section 6.9. Section

6.10 describes the ROI, RORI & RONI marking and QP signalling. Experimental results of

the proposed technique are provided in Section 6.11.


6.2 Proposed architecture

Fig. 6.4 shows the high level block diagram of the proposed technique. As already men-

tioned in the previous section, various visual cues are combined to perform inference. We

partition the system into low level and high level inference components.

6.2.1 Low level inferencing

The incoming image is first segmented using the sampler based technique proposed in Chap-

ter 4. Image regions tagged as RONI by users are not processed. Blob detection and su-

per pixel marking is performed on the foreground pixels. Shadow scores of super pixels,

i.e. probability of super pixels covering shadow image regions, is computed. Independent

shadow probability scores are computed using physics based and texture based features.

The physics based features include the illumination attenuation and the angular orienta-

tions between the pixel and the background cluster center in RGB color space (More details

are provided in Sec. 6.4.2). The skin probability map is used to determine skin probability

scores of all super pixels.

6.2.2 High level inferencing

The Deformable Part Model or DPM is used to determine head-shoulder, torso and leg part-

scores of foreground image regions. Inconsistent pedestrian hypotheses based on geometry

are pruned. A tracker is initialized for successfully detected pedestrians. Isolated pedes-

trians are localized using a simple blob geometry based tracker. For interacting groups of

pedestrians, an optical flow based tracker is combined with a Kalman filter to determine

association. The tracker reduces the miss probability and the computational complexity of

the ROI detector. The ROI, RORI & RONI assignment task is formulated as a super pixel

labelling problem. Bilattice logic reasoning is used to determine the set of ROI, RORI &

RONI super pixels. This labelling of super pixels is used to assign QP values to macroblocks.


��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

!��

��

��!��!�

"�!�

#��

"�!�

�$� �

��

��

%��

!�

& �� $�

'��

�� (�

��)'��"!�� (�

*��

� ��

� ��

��!��

'��!��

%!��

��+��

��

�+��

��!

��,

�� !�-�

��!�

�� .&/�0!(�

�� !��

��

��

��

!�

.&��!

��$�� !

��$��

.1�!�

��$��

)��"�"�!��

��$��

2��$!��

!��!!3

&��

��$��

2��$!��

��!!3

)��"�

��

.&�(�.&.��4�

.&/��

�!!��

��

��

��$� �

.�� +��

��! �� !

��!��!!��

��

!$��

��

��

Figure 6.4: Architecture of the proposed ROI, RORI and RONI detector


6.3 Segmentation

Since we consider a static surveillance camera, we reduce search space for ROI detection

by detecting FG blobs. Also, the shape of the FG blob is used by the shape based pedestrian

detector which we describe later in Sec. 6.6.2. The final ROI, RORI & RONI inference is

performed on super pixels. Hence, we compute super pixels over the entire foreground

image regions. We discuss each of these operations in this section.

• Sampler based foreground segmentation: We use the sampler architecture that we

proposed in Chapter 4. We set Ddense as 2 to obtain finer FG masks. The pedestrian

detector and tracker outputs are used to prevent absorption of stationary foreground

object pixels into the background model. GMM parameter update of pixels on these

pedestrian detections is not performed.

• Blob detection: Connected components are used to extract foreground blobs in the

input image. The blob area is thresholded to remove very small blobs. The contours

of the blobs are extracted and are used for HOG feature edge enhancement (which is

explained later in Section 6.6).

• Super pixel marking: The term ‘Super pixels’ (which was introduced by Xiaofeng

Ren and Jitendra Malik in [147]) is regarded as a collection of perceptually mean-

ingful image regions. Super pixel representations are increasingly being used in very

successful vision algorithms, e.g. in [148]. Recently, Meuel et al. [149] have also

used them in a ROI coding system for aerial vehicles. Super pixels provide compact

image region representations which we use to perform higher levels of inferencing.

We determine the super pixels on only the foreground image regions. We adopt the

efficient Simple Linear Iterative Clustering or SLIC algorithm developed by Achanta

et al. in [150]. Computational complexity of the SLIC algorithm depends upon the

search distance and the number of k-means iterations. As we show later, ROI & RORI

marking is performed on macroblocks and hence does not require pixel level accurate

segmentation. Hence, we reduce the number of iterations of the k-means clustering


procedure to minimize the computational complexity. Fig. 6.5 shows super pixels

computed on a sample image.

Figure 6.5: Super pixels detected in a surveillance video frame

6.4 Shadow detection

As we have already described, we mark face detections (computed using the DPM detec-

tor) as ROI. Non face regions of pedestrians are marked as RORI. Super pixels around the

pedestrian bounding box (obtained using the DPM detector) need to be classified as either

shadow / non shadow. MB’s that intersect with only shadow super pixels are marked as

skip. We now describe the procedure to determine these shadow super pixels.

Cast shadow detection in static camera videos has been extensively studied by researchers

[151, 152, 153, 154, 155, 156]. In [157], Sanin et al. provides a recent review of shadow

removal techniques. The techniques are classified as follows:

• Chromaticity based: assumes color constancy, i.e. shadows cause reduction in lumi-

nance but the chromaticity value is mostly invariant.

• Physical methods: use physical models of lighting (e.g. dichromatic reflection model)

to improve accuracy. Ambient light and multiple light sources are considered.


• Geometry based methods: use estimated geometric relations between the cast shad-

ows and the objects.

• Texture based methods: perform texture correlation between input image regions

and the background picture. Texture features are highly discriminative and are not

affected by shadows. However, this technique does not work on regions without

strong texture points.

In this thesis, we modify implementations of the physics based and texture based ap-

proaches (provided online by [157]) to generate shadow scores for each super pixel. A

weak shadow detector is used initially to generate shadow candidates. The physics and

texture based detectors are run on only these shadow candidate pixels.

6.4.1 Weak shadow detector

When light from a source incident on a surface is obstructed, the luminance of the pixel

reduces. However, the color (or chromaticity) of the shadow pixels would be almost similar

to the original values. A simple filter based on these observations can be used to reject

background pixels which do not satisfy these conditions, i.e. if either luminance increases,

or if the chromaticity change is large.

Fig. 6.6 shows the RGB space and a cone constructed with base centered at BG pixel

and apex at the origin. Pixels whose values lie inside the shaded region are considered as

shadow candidates. Here d1 & d2 are set equal to λ1dBG & λ2dBG respectively. Constants

θmax, λ1 and λ2 are threshold parameters of the candidate shadow detector.

6.4.2 Physics based shadow detection over super pixels

To visualize the change in pixel values, we plot them in Fig. 6.7. The pixel values were

obtained from a video sequence in which objects occlude a light source. Early shadow

detection methods [156] have assumed that these shadow points lie on the line between

the origin and the illuminated pixel value. However, later approaches have incorporated

physics based principles [153, 154, 158] capable of detecting shadows in scenes illuminated


B

R

G

Background pixel value

Shadow pixel value

�max

dBG

d2

d1

Figure 6.6: The shaded volume shown in the RGB color space is considered as shadow pixel

values by the weak shadow detector

by multiple light sources. Features are extracted and parametric / non parametric statistical

methods are used to classify pixels.

We use the physics based implementation by Sanin et al. [157] which is based on the

work by Huang et al. [158]. Following the notation used in [157], we denote the vector

joining the shadow pixel and the center of the background cluster as vn. We write the

feature vector fn of the nth pixel in the image as:

fn = [αn, θn,Φn] (6.1)

Here αn denotes the illumination attenuation. θn & Φn represent the angular orienta-

tions of the vector vn in 3D space (in spherical coordinates).

A Gaussian mixture model is learnt for the feature vector of each pixel using an online,

Winner-Takes-All version of the EM algorithm. The probability that the pixel value was

generated by the shadow cluster c is computed using the GMM and is represented by pphyn,c .

The probability of the pixel belonging to the shadow region is given by


50100

150200 100

150

200

250

100

150

200

250

B

R

G

BackgroundPixels

ShadowPixels

Figure 6.7: Pixel values of a surface is plotted from a video sequence. Intermittent fore-

ground object motion causes shadows on the surface.

pphyn =C

maxc=1

pphyn,c (6.2)

Here C is the number of shadow clusters.

RONI marking of shadow regions does not require pixel level segmentation. We only

need to perform MB level classification. Shadow detection scores at the pixel level can be

aggregated to determine the probability that a macroblock contains only shadow pixels.

However, instead of assigning labels directly to MB’s, we aggregate shadow scores over

super pixels. Higher level reasoning of ROI, RORI & RONI regions is performed using these

scores computed over super pixels (this is described in Sections 6.9 & 6.10). We find that

the SLIC algorithm bins pixels of the shadow & feet boundary regions into separate super

pixels with good accuracy. Due to this, the number of true shadow pixels inside a shadow

super pixel is high. Hence, the aggregated shadow score of the super pixel provides good


discriminability for higher level inference.

The shadow score pphy for a super pixel is computed in Eqn. 6.3 as the average of the

scores of all its pixels. Here N is the number of pixels in the super pixel. Fig. 6.8b shows

the physics based super pixel scores obtained for a sample image.

pphy =1

N

N∑

i=1

pphyi (6.3)

6.4.3 Texture based shadow detection

Texture features of surfaces are mostly invariant to shadows. Hence, they can serve as very

useful cues to perform shadow detection. Leone et al. [159] use Gabor features computed

on small image patches. Qin et al. [160] use the scale invariant local ternary pattern as

a texture descriptor. They also use a Markov Random Field to incorporate positive spatial

correlation. Sanin et al. [161, 157] showed that classifying large areas (as shadow / not

shadow) using gradient matching improves performance when rich texture is not present

in the entire background image. A weak shadow detector is used to propose these large

regions. In this thesis, we use the gradient matching based technique. However, we choose

the super pixels as the regions over which we aggregate gradient matching scores. The

fraction of the pixels with similar gradients is considered as the shadow score for the super

pixels.

Let the angle between the gradient vectors of the ith pixel (i is the index of the pixel

in the super pixel data structure) in the current image and the background frame be repre-

sented by θgradi . Let N denote the number of pixels in the super pixel. The probability that

the super pixel lies only in the shadow region is denoted by ptex and is computed as follows

ptex =1

N1

N∑

i=1

Iimi (6.4)

Here, mi is set to 1 if θgradi is less than a threshold. Ii is set to 1 if the gradient magnitude

of the pixel in the frame is greater than a threshold. N1 is the number of pixels in the super

pixel whose gradient is greater than the threshold, i.e.


N1 =N∑

i=1

Ii (6.5)

This gradient based technique works well when the background surface is textured.

When the image does not have texture (i.e. N1 is small), we ignore the texture based scores.

Figure 6.8c shows the texture shadow scores of SP’s. Texture and physics based shadow

features are complementary. Hence, integrating them improves the overall accuracy.


(a) Surveillance video frame

(b) Shadow scores of super pixels obtained using the physics based detector

(c) Shadow scores of super pixels obtained using the texture based detector

Figure 6.8: Shadow scores of super pixels plotted for a surveillance video frame


6.5 Skin detection

The unique appearance of human skin is a very useful cue to improve face detection per-

formance. The blending of the colors of the blood and melanin content decide the skin

tone. As a result, the range of hues is restricted. The appearances of skin can be broadly

categorized based on ethnicity as Asian, African and Caucasian. In [162], Elgammal et al.

provide density plots of skin pixels for these categories and show that they form clusters in

various color spaces. However, color of skin is similar to that of a few naturally occurring

surfaces (e.g. sand). Also, the accuracy of appearance based skin detection depends upon

lighting conditions. Hence, skin based detectors need to be integrated with other inference

algorithms to improve reliability. In Section 6.9 later in this chapter, we will describe the

proposed technique to integrate skin probabilities with other scores. In this section, we will

describe computation of skin probabilities for foreground super pixels in the input image.

Skin detection has been studied extensively by researchers and successfully applied to

perform face detection, objectionable image filtering and gesture recognition tasks. In

[163], Jones et al. proposed a histogram based approach to compute the probability of

skin pixels. They generated a large skin dataset which was used to obtain the histogram.

Greenspan et al. [164] proposed a parametric GMM technique. In [165], Phung et al.,

after a detailed analysis, showed that the accuracy of the Bayesian classifier based on the

histogram technique is higher than parameteric methods using Gaussian models. They also

showed that the accuracy does not vary with different choices of the color spaces.

We choose the histogram based Bayesian classifier to generate pixel level skin probabil-

ities since it is fast and accurate. Also, as we will see later in Section 6.9, the probability

scores that are generated by the Bayesian approach facilitate easier integration with other

detector scores.

Let the probability of a pixel with color ‘c’ capturing a skin image be represented as

p(skin/c) .

p (skin/c) =p (c/skin) p (skin)

p (c/skin) p (skin) + p (c/nonSkin) p (nonSkin)(6.6)


Here p (c/skin), p (c/nonSkin), p (skin) and p (nonSkin) are obtained from the histogram

table as follows:

p (c/skin) =HS (c)

TS(6.7)

p (c/nonSkin) =HN (c)

TN(6.8)

p (skin) =TS

TS + TN(6.9)

p (nonSkin) =TN

TS + TN(6.10)

Here HS (c) is the skin histogram bin count for color c and HN (c) is the non skin his-

togram bin count for color c. TS & TN represents the sum of all entries of the ‘skin’ & ‘non

skin’ histograms.

We use the skin probability map provided in the LTI-Lib library [166]. This map was

generated using the Compaq Cambridge research lab image-database [163] with bin size set

to 32. The probability values for each bin are precomputed and stored in memory. Hence,

computation of the pixel skin probability only involves indexing into the 32 ∗ 32 ∗ 32 array

of floating points numbers. Fig. 6.9 shows the super pixel skin scores obtained on the test

image in Fig. 6.8a.

Figure 6.9: Skin scores of super pixels in a surveillance video frame


Shadow pixel skin score pskin is defined as the average score of all pixels it contains.

pskin =1

N

N∑

i=1

p (skin/ci) (6.11)

Here ci represents the color of the ith pixel in the super pixel and N represents the total

number of pixels in the super pixel.

6.6 Pedestrian detection

Detection of pedestrians based on shape is more robust compared to techniques that only

use facial features. In this section, we will describe the proposed scheme to obtain pedes-

trian part scores. The high level inference engine places requests for these part scores on

foreground regions which do not have tracker associations. The details of the integration

of the pedestrian detector with the tracker is discussed later in Section 6.9.

Motivated in part by the large number of important applications for pedestrian de-

tection, many methods have been proposed in literature. Techniques such as the use

of improved features [167, 168, 144, 169, 170], efficient classifier learning algorithms

[171, 172, 169, 170, 173], motion cues [174, 175] and deep learning [176, 177, 178, 179]

have steadily improved detection accuracy over the past decade. A detailed survey of the

state of the art pedestrian detection methods can be found in [180]. In this work, we

adopt a modified version of the DPM technique. DPM uses a part based model and hence

allows higher level inference techniques to perform occlusion reasoning. Also, it mod-

els deformations which improves accuracy of pedestrian detection. DPM was one of the

best performing technique before the emergence of deep learning based detectors. Recent

techniques of integrating deformation models into deep architectures (e.g. Deep ID net by

Ouyang et al. [178]) have considerably improved detection accuracy. However, the compu-

tational complexity of deep architectures is considerably higher than that of DPM. Reducing

complexity of these techniques is a very active area of research. Deep learning based de-

tectors can be integrated into the proposed ROI encoder when they become feasible (on

embedded platforms) in the future.


6.6.1 DPM based pedestrian detection: A brief review

The DPM technique models part positions as latent variables. The unknown part locations

are learnt using a supervised latent-SVM framework. Please refer to [144] for details about

latent-SVM based model training. Here, we will only briefly sketch details of using a trained

model to detect pedestrians. We use notations of [144] with few modifications.

DPM uses the sliding window technique where the base detector operates on an image

window which is scanned across the entire frame. Detection at multiple scales is performed

by rescaling the input image. Detection of a pedestrian in a window involves two stages:

(I) Feature computation and (II) Score computation.

(I) HOG feature generation:

The HOG features used in [144] is conceptually similar to the original method proposed

in [168]. The image is partitioned into square cells of edge length equal to 8. A set of 2× 2

adjacent cells is called a block. Fig. 6.10 shows this division of the image into cells and

blocks.

h1 hi hi+9

Cell (8x8 pixels)

Pixels in the yellow region of the center cell will vote (weighted voting) into cell histograms in this block

Contrast insensitive histogram

Contrast sensitive histogram

Block (2x2 cells)

Image

Figure 6.10: HOG feature computation

The 32 dimensional feature vector is composed of:

• 18 contrast sensitive histogram bins


• 9 contrast insensitive histogram bins

• 4 values capturing the overall gradient energy in the four blocks containing the cell

• 1 placeholder entry

Gradients of pixels are computed and they vote into histogram bins of neighbouring

cells, i.e. in Fig. 6.10, the pixels inside the yellow region vote into four cell histograms:

(I) The cell it belongs to and (II) The three adjacent cells of the yellow region. Gradient

energy of each block that contains the cell is also stored separately as 4 entries in the 32

dimensional HOG feature vector. These 32 dimensional features for the cells are computed

at multiple scales to form a multi scale feature map which we denote by H. In the original

DPM implementation, the dimensionality of the features was reduced using PCA. This is

particularly useful when a large number of classes need to be tested, i.e. when the filter

score computation cost is very large. We found that we do not require this in our application

since we only detect pedestrians.

(II) Score computation:

Pedestrians are detected by measuring the response of filters applied on the feature

maps. Felzenszwalb et al. use two sets of filters:

• Root filters: Root filters model the overall shape of the pedestrian. They operate

on a rectangular window of the 32 dimensional feature vectors in the feature map.

Let F0 denote the concatenated root filter vector (concatenated in row major order).

Response of the root filter at position (x, y) in pyramid level l is given by:

R0(x, y, l) = F0.φ(H, (x, y, l)) (6.12)

Here, φ(H, (x, y, l)) is the vector obtained by concatenating feature vectors of the

rectangular window (in row major order) in the feature map with top left corner at

(x, y, l).


• Part filters: Part filters capture the more detailed shapes of individual parts of the

pedestrian image. They are computed over the feature map at a higher resolution to

capture finer details. Response of the ith part filter at position (x, y) in pyramid level

l is given by:

Ri(x, y, l) = Fi.φ(H, (x, y, l)) (6.13)

Along with part appearance modelled by the filters, DPM also takes into account the

feasible arrangements of parts of the pedestrian image. This geometric arrangement

is specified in the model using anchor locations of parts with respect to the root filter.

The anchor position for the ith part relative to the root position is denoted by the

vector vi = (vi,x, vi,y). DPM also allows the parts to be displaced from the anchor

positions. The cost of a deformation equal to (dx, dy) is denoted by φd(dx, dy) where

φd(dx, dy) = (dx, dy, dx2, dy2) (6.14)

A local search around the anchor position for the best location of the part is per-

formed. The cost of the part with index i is given as follows:

Di(x, y, l) = maxdx,dy

[Ri(x+ dx, y + dx, l)− diφd(dx, dy)] (6.15)

To model different poses, three components of root filters and the corresponding sets

of part filters are obtained using a latent SVM training framework [144]. Fig. 6.11

shows the part filters of the three components. The vertically mirrored counterparts

of these components are also included in the final model.

The total score of the pedestrian hypothesis at position (x,y,l) is given by the sum of the

root and part filter responses as

score (x, y, l) = R0(x, y, l) +n∑

i=1

Di (2x+ vi,x, 2y + vi,y, l − λ) + b (6.16)


(a) (b) (c)

Figure 6.11: DPM part filters

The score is thresholded to obtain the final set of detections in the image.

6.6.2 Proposed modifications to DPM

To reduce computational cost, HOG features are computed on only the foreground blocks.

Also, the detector sliding window is placed only over the foreground blobs. Windows that

do not have sufficient foreground support are ignored. High scoring detections are marked

as hypothesis for further inference (described later in Section 6.9).

Foreground edge enhancement:

We use the foreground blob edges to improve the performance of the HOG classifier. Let

the contour pixel coordinates of the blob be (x0, y0), (x1, y1), . . . , (xN−1, yN−1). The tangent

vector of the ith pixel in the contour is computed as (xi+L − xi−L, yi+L − yi−L) (modular

arithmetic is used on indices here, i.e. x0−1 = xN−1). The gradient vector is perpendicular

to the tangent. We skip (2L − 1) pixels in the contour sequence to reduce the effects of

noise. Fig. 6.12 shows the gradient vector for a contour with L = 2. Similar to image

gradients, the edge based gradient vectors also vote into the cell histograms. We found that

edge enhancement significantly improves detector performance.


dy

dx

Contour pixels

Foreground blob

Pixel under consideration

Gradient vector

Figure 6.12: Edge enhancement of blob boundaries

Cascade:

Instead of considering each sliding window as a candidate for part based inference, we

prune the number of hypothesis by using a two stage cascade. In [181], Felzenszwalb et

al. designed a DPM cascade for object detection. However, we use part filter scores to later

perform occlusion reasoning. Hence, we do not adopt the full cascade of [181]. Instead,

we use a two stage cascade shown in Fig. 6.13. The first stage of the cascade is based on

the most significant filter in the ordering determined by [181]. The score of the first stage

is given as follows:

R(x, y, l) = F.φ(H, (x, y, l)) (6.17)

Through visual inspection, we group the filters as ‘left head shoulder’, ‘right head shoul-

der’, ‘torso’ and ‘legs’ parts. We note that this grouping is only an approximate representa-

tion, i.e. the ‘left head shoulder’ filter group could also respond to gradients in the entire

head image. The part scores are determined using the deformation search procedure de-

scribed earlier in Eqn. 6.15. We repeat Eqn. 6.15 here for convenience.

Di(x, y, l) = maxdx,dy

[Ri(x+ dx, y + dx, l)− diφd(dx, dy)] (6.18)

The scores of filter group parts can be written as:

spart =∑

i∈Gpart

Di(x, y, l) (6.19)


Here, ‘part’ can refer to (I) left head shoulder (II) right head shoulder (III) torso or (IV)

legs. Gpart refers to the set of filters in a part.

��

��

��

��

��

��

��

��

��

ImageOutput scores

��

Figure 6.13: Proposed DPM cascade for pedestrian detection

The second stage is based on the responses of the filters that model the shape of the

head and shoulder regions, i.e. the left head shoulder & right head shoulder part scores.

Hypotheses whose part scores is less than a threshold are rejected. All the part scores are

computed for hypotheses that pass through both stages. Platt’s scaling [182] of these scores

is performed to obtain probability estimates p(part) as follows:

p(part) =1

1 + eAspart+B(6.20)

Here, A & B are constants determined using the algorithm proposed by Lin et al. in

[182]. The ‘Head’ bounding box is determined using regression based on head, shoulder

and torso filter locations. Fig. 6.14 shows the part filter locations and the bounding box of

the head region for a pedestrian.


Figure 6.14: Sample result of DPM cascade

6.7 Geometry

Fig. 6.16 shows a few sample surveillance video snapshots. Pictorial representations of

pedestrian hypotheses are also overlaid on the images. We can clearly observe that the

hypothesis overlaid on the image in Fig. 6.16c is infeasible. Such infeasible candidates

can be rejected by considering constraints imposed by the ground planes. This reduces the

computational cost and also improves the accuracy of the pedestrian detector. For a given

pivot location of the bounding box in the image, we need to determine the set of discrete

scales (denoted by S) for which we need to run the DPM detector. If the ground is a planar

surface, then the set of scales for a pivot S would be equal to {Smin, . . . , Smax}. Here, Smax

depends on the geometry of the scene (camera tilt & height, ground plane), the optics of

the imaging system and the physical size of the pedestrian. Smin is set equal to 1 (i.e. index

corresponding to the unscaled image).

However, if there are surfaces at different elevations, we will need to specify multiple


ranges, i.e. one for each elevation. This can be seen clearly in Fig. 6.16 where for a given

pivot position in the video frame, the pedestrians in Fig. 6.16a & Fig. 6.16b are supported

by two different ground planes. Fig. 6.15 shows that the ground plane at an elevated

position will require inclusion of scales that subtend angles in the range [θmin2 , θmax

2 ]. The

set S can be determined using the camera intrinsic & extrinsics matrices and the parameters

of the ground planes.

C

Tallest pedestrian (6.5 feet)

Image of human maps to smallest pedestrian DPM filter

�1max

�2min

�2max

Figure 6.15: Geometry of the surveillance camera system showing ground planes at differ-

ent elevations.

Automated camera calibration and ground plane estimation based on vanishing point

estimation have been studied by many researchers [183, 184, 185]. Sudowe et al. [185]

showed that the ground plane homography and normal vector projection is sufficient to

determine the set S. The intrinsic parameters of the camera are assumed to be known.

Commercial video analytics developers also have created simple calibration tools [186]

that help the user to calibrate the camera during setup. In this work, we divide the image

into non overlapping blocks of size equal to 32× 32 pixels. The set of scales for each block

is assumed to be known (determined using any of the techniques that we reviewed here).

Since the geometry of the scene is static, we store the set of scales S for each 32× 32 block


in a table which needs to be updated only when the camera location is changed. We run

the DPM detector at scales specified by the set S. It has to be noted however that only

bounding boxes inside foreground blob regions are considered as valid hypotheses. Hence,

a large number of scales would have been eliminated based on foreground segmentation.

The geometry based scale selection is useful only when there are large regions of foreground

objects.


��

Feasible pivot

(a) Feasible hypothesis

��

Feasible pivot

(b) Feasible hypothesis of pedestrian at a higher ground plane

��

Infeasible pivot

(c) Infeasible hypothesis

Figure 6.16: Sample surveillance video snapshots showing feasible and infeasible pedes-

trian hypothesis


6.8 Detection by Tracking

Accuracy of pedestrian detection from a single image has been steadily increasing since

the past decade. However, significant improvement in performance can be obtained if the

temporal associations across multiple frames is also exploited. This is particularly useful

in real world surveillance scenarios where state-of-the-art detectors (including DPM) fail to

detect some images of the same pedestrian in a video sequence. This is due to variations

in pose, occlusion and lighting as the pedestrian moves in the scene. A large number of

researchers have proposed tracker algorithms. An exhaustive survey of recent techniques

has been provided by Smeulders et al. in [187].

In this thesis, we use a tracker to make associations of the pedestrian images across

scenes. Detections obtained using the DPM part scores are stored in the state for tracking in

future frames. We find that this significantly reduces the miss rate. Also, as we discuss later,

the computational cost of the DPM detector is high. To reduce this cost, we avoid running

the DPM detector over image regions which are supported by tracked pedestrian detections.

Pedestrian detections present in the state are initially associated with image regions in the

current frame by the tracker. Regions which are not supported by tracked pedestrians are

considered as candidates for DPM based detection.

6.8.1 Components of a tracker

The key components of a tracker typically include:

• Appearance model: Invariance of certain features in the image of the object across

frames is the key component of a tracker. Hence, a lot of attention has been given

to develop such invariant image representations. These image representations are

constructed and stored when the tracker is initialized using a detector. The appear-

ance models can be computed over different image structures such as blobs [188],

contours [189], patches [190] or super pixels [191]. Various features such as raw

intensity values, color histograms, HOG, 2D binary patterns, haar wavelets, SIFT &

SURF have been used as visual cues for tracking.


• Target search: To determine the position of objects in a new frame, a search using the

appearance model is performed to determine the best match. Few popular techniques

such as the Lucas Kanade [192, 193] tracker and the mean shift tracker [194] pose the

target search problem as an optimization task which is solved using gradient descent

methods. Uniform search around the location of the object in the previous frame

is also a popular technique (e.g. fragtrack [190]). A motion model based on the

Kalman filter also is commonly used to reduce the search space [195, 196]. Due to

scene clutter, occlusion & heavy tailed noise, real world tracking problems exhibit

Non Gaussian and multi-modal, posterior and filtering distributions. Particle filtering

techniques have been adopted to solve this [197].

• Appearance model update: The appearance of objects changes over the video se-

quence due to variation in scale, pose, lighting and viewpoint. Hence, trackers update

the appearance model to avoid drift. The MIL (multiple instance learning) tracker by

Babenko et al. [198] updates the model with a bag of image patches. The ensemble

tracker by Avidan [199] uses a set of weak classifiers which is updated in an online

fashion.

6.8.2 FG blob based tracking

In the case of isolated pedestrians, we use the foreground blob geometry to determine the

tracked bounding box locations. The blob geometry based tracker has very low complexity

and accurately tracks isolated pedestrians. A Kalman filter is initialized for each pedestrian.

The Kalman filter prediction is used to mark a search region over the head shoulder image.

This search region is shown in Fig. 6.17a. The y coordinate of the top of the target ‘head

rectangle’ is determined by vertically scanning the FG blob for the presence of FG pixels.

The scanning procedure is performed in a top-down fashion and it terminates when n con-

secutive pixels are found in a row. We set n to 3. To determine the left and right bounds, we

use rectangular window filters similar to those used by Viola & Jones. Fig. 6.17b shows the

positive and negative filters applied on a sample FG blob. Let I+ & I− denote the number


of FG pixels in the positive & negative filter respectively. To determine the x coordinate of

the left edge of the ‘head rectangle’, the filters are moved over the FG image towards the

right. The x coordinate for which the difference I+ − λI− is maximum is considered as

the left boundary of the ‘head rectangle’. Here, λ is set to 10 to heavily penalize FG pixels

inside the negative filter. A similar procedure is followed on the right side of the pedestrian

search rectangle. I+ & I− are computed using the integral image of the FG mask. A fixed

aspect ratio of the head is used to determine the lower bound of the ‘head rectangle’. The

displacement vector of the ‘head rectangle’ is applied on the pedestrian bounding box to

mark the pedestrian in the frame. The percentage change in the width of the ‘head rectan-

gle’ is used by the inference engine to determine tracking errors. The difference between

the Kalman predicted displacement and that computed by the blob tracker is also used to

indicate errors.

Search region

Target ‘head’ bounding box

(a) Search region

+ +- -

(b) Positive and negative filters

Figure 6.17: (a) Search region is initialized using the Kalman filter prediction. (b) Positive

and negative filters are applied on the FG blob to determine the left and right bounds of the

head region.

6.8.3 Optic flow based tracker

In the case of pedestrians whose bounding boxes overlap or are close to each other, we

cannot track them using only the blob geometry. Hence, we use a Lukas-Kanade optic

flow based tracker. Since the optic flow computation considers a patch around the tracked

positions, it can successfully handle image noise. Also, consecutive images of pedestrians in


surveillance videos typically satisfy the requirements of brightness constancy and constant

flow. Cost to compute Lukas-Kanade optic flow vectors is smaller compared to the cost of

detecting and matching expensive feature points such as SIFT. We do not use multi modal

particle filtering since we do not need to maintain identity of pedestrians during severe

occlusion. When the pedestrian exits from the occlusion region, the detector would initialize

a new tracker. Also, we do not update the template model of the tracked pedestrian. Hence

the computational complexity of the proposed tracker is low.

Tracker initialization: Detections obtained by the DPM detector are used to initialize the

tracker. The tracker model of the pedestrian is composed of the image of the pedestrian and

its associated super pixels. The image pyramid required for optical flow is also computed

and stored. Uniformly sampled points inside the pedestrian bounding box are included

in the model. We denote this set of sampled points as S. The points are chosen only

in the regions that are not occluded by other pedestrians or objects. A Kalman filter is

also initialized using the first detection and its bounding box association in the consecutive

frame.

Target search:

We use a modified version of the median flow tracker (proposed by Kalal et al. in [200])

to associate pedestrian images to its template. The iterative Lucas-Kanade algorithm with

four levels of image pyramid is used to detect optical flow of the points sampled in the

template. The Kalman predicted correspondence vectors are used as initial solutions for the

Lucas-Kanade algorithm. The Normalized correlation coefficient or NCC is computed on

image patches centered on the sampled points. The points are arranged in the increasing

order of their NCC values. Points in the lower half of this ordered set, i.e. points with

low NCC values are discarded. Fig. 6.19 shows the correspondences between points in the

template and the current image. Points marked in red are those that were discarded based

on the NCC score. Here, the current image and the template are spaced apart in time by 10

frames.

Five overlapping part templates T1, T2, T3, T4 & T5 are defined as shown in Fig. 6.18.

Let the set of sampled points inside the part-template Tp be represented by Sp. Each part


template Tp is associated with a correspondence vector vp. vp is computed as the median

of the optic flow vectors of the points in the set Sp. The median is computed independently

in both the x & y dimensions. The likelihood of a part template is obtained by combining

feature matching scores of tracked points inside the part. We use NCC values and super pixel

histogram difference as the two sets of matching scores. The average of the NCC values of

points in the set Sp is represented by pNCCp . Here, pNCCp represents the similarity between

the part-template images in the model and the current frame. Occlusion of a part-template

causes the NCC value of the part to reduce. This can be observed in Fig. 6.19b where the

score for the right-upper-body part template is low (in comparison with other scores).

(a) Left-

upper-

body

(b) Right-

upper-

body

(c) Head-

shoulder

(d) Torso (e) Upper-

body

Figure 6.18: The five part-templates are shown here. Feature matching scores are accu-

mulated over these part-templates. Correspondence vectors are computed for each part-

template.

Let SP tempi be the super pixel (or SP) in the template associated with the sampled pixel

si. Similarly, let SP curri be the super pixel in the current frame associated with the sampled

pixel si. The color histogram differences between SP temi & SP curr

i is computed for all SP’s

associated to pixels in a part template. The average value of these histogram differences is

considered as the second feature matching score.

Model update:

As we show in the next section, the inferencing procedure uses the NCC and SP based

feature matching scores of the part templates to select one vector v from the set of template

correspondence vectors vi. It could also reject all the vectors if the track is lost. If the

tracking is considered as successful, the DPM detector scores of the pedestrian image are


not computed. The model, i.e. the template, super pixel data and the sampled points are

not updated for successfully tracked pedestrians. However, to prevent tracker drift due to

scale and appearance change, the DPM detector is executed after 10 frames (even if the

feature matching scores indicate successful tracking). After associating the detection with

the tracked pedestrian, the existing model of the tracked pedestrian is discarded and the

new model is computed.

We note that in this current application of pedestrian detection to ROI video coding,

identity switches do not affect the performance of the system. Hence, we do not attempt

to obtain accurate pedestrian correspondences between pedestrian images in the video se-

quence.


(a)

(b)

Figure 6.19: The template and the current frames (separated in time by 10 frames) are

shown. NCC scores of the five part-templates for the image in (a) is (0.77, 0.9, 0.87, 0.8,

0.81). The order of the scores is (left-upper-body, right-upper-body, head-shoulder, torso,

upper body). NCC scores for the image in (b) is (0.93, 0.69, 0.88, 0.8, 0.83). Here, the score

of the right-upper-body template is lower due to occlusions.


6.9 Inference

Early work by John McCarthy and others attempted to use logic to solve artificial intelli-

gence tasks. However, logic based systems could not model the uncertainties of the real

world. Following the seminal book on ‘Probabilistic Reasoning in Intelligent Systems’ by

Judea Pearl, Bayesian methods were developed and were very successfully applied to solve

AI problems. Probability could conveniently handle uncertainties which logic was incapable

of achieving. To overcome limitations of logic, symbolic approaches have been extended to

incorporate uncertainties.

Statistical relational learning (SRL) is one such example which addresses issues of rep-

resentation, inference and learning. SRL allows statistical analysis over a set of relations.

Complex relations are modelled using first order logic. For example, in Markov logic net-

works, the logic network serves as a template representing the relations. When the formulas

are grounded to form a Markov network, the distribution over the probabilities is defined

by the weights of the links. In [201], Antanas et al. apply SRL for hierarchical image under-

standing. They define a language that consists of (I) Visual entities e.g. window (II) Spatial

relations between visual entities, (III) Composite units that consist of a set of visual entities

(IV) Membership relations between visual and composite entities. Composite entity selec-

tion is formulated as a maximum weighted independence set problem. They successfully

recognize higher-level structures in street view images.

Another technique to combine probability and logic was introduced by Ginsberg in

[202]. Algebraic structures called bilattices were defined and were used to perform in-

ference under uncertainty. In this thesis, we use bilattice based reasoning inferencing tech-

nique to perform ROI detection. Bilattice logic based reasoning allows contradictory data.

For example, the shadow detector could wrongly assign a high score to a super pixel but

the tracker contradicts the hypothesis based on the pedestrian bounding box. The final in-

ference scores can be efficiently computed using the set of logic rules. Also, Bilattice logic

reasoning allows us to combine inference relations with rules specified by the surveillance

operator. When extended to activity recognition, Bilattice logic reasoning can be used to


mark non-face image regions as ROI, for example, when a luggage is left, the MB’s over the

object can be marked as ROI. Another example of a dangerous activity is when a car enters

a lane in the wrong direction. All MB’s over the car image can be encoded at high quality.

We provide a review of the bilattice logic approach in Appendix C.

6.9.1 Bilattice logic for ROI, RORI & RONI super pixel inference

The goal of the proposed region-of-interest encoder is to classify MB’s as ROI, RORI or

RONI. However, super pixels provide better representations compared to macroblocks since

they preserve natural boundaries. This property of super pixels has made them a popular

choice for segmentation applications like in [203]. In this section, we describe the proposed

inferencing technique to classify super pixels as ROI, RORI or RONI. In the next section, we

use the super pixel class labels to determine the coding mode and the QP parameter of the

macroblocks. The proposed super pixel labelling task is a three class classification problem

which we solve in a sequential manner. We first use the DPM part scores and the tracker

scores to infer pedestrian bounding boxes. Super pixels inside the ‘head’ regions in these

bounding boxes are marked as ROI. The detected & tracked pedestrian results are combined

with the skin & shadow scores to classify the remaining super pixels as RORI or RONI.

• Pedestrian bounding box inference: Shet et al. [173] proposed a partitioning of

object pattern grammar specifications into component based, geometry based and

context based rules. Following [173], we also adopt a similar design procedure.

However, we also include reasoning of shadows, skin detector and tracker outputs.

We now illustrate the reasoning process using a few representative rules.

For an isolated pedestrian, i.e. a pedestrian image which does not have any neigh-

bours, the filter scores are used to infer the score of the hypotheses. We show sample

rules for two filter scores here.


φ(ped(X,Y, S)← head left(X,Y, S)) (6.21)

φ(ped(X,Y, S)← torso(X,Y, S)) (6.22)

φ(ped(X,Y, S)← FG support(X,Y, S)) (6.23)

Here, ped(X,Y, S) denotes the existence of a pedestrian at position (X,Y) in the image.

S represents the scale of the pedestrian. These rules are combined with facts to

perform reasoning. Fig. 6.20 shows a representative example in which rules are

combined with the facts to obtain the final score of ped(X,Y, S). Here, the facts

represent the detector scores, for example, torso(X,Y, S) is equal to 〈p(torso), 1 −

p(torso)〉 where ptorso has been obtained earlier using the DPM detector. torso(X,Y, S)

represents the torso part score computed at the anchor position associated with the

pedestrian sliding window at location (X,Y,S) in the image pyramid.

Along with the rules that validate the hypothesis, we also add terms that negate it.

For example, a low head part filter score will result in the rejection of the hypothesis.

The corresponding rule for this is

φ(¬ped(X,Y, S)← ¬head left(X,Y, S)) (6.24)

Geometry inconsistency rules are not required for DPM detected pedestrians since

they are applied before computing the part scores. Pedestrian hypotheses which are

not isolated require occlusion reasoning as shown in Eqn. 6.25. Here, the occlusion

term is computed as the overlap between the bounding boxes of the parts. Circular

dependencies are avoided by performing the inference of pedestrians in decreasing

order of their Y values (The top-left corner of the image is assumed to be the origin).

φ(ped(X,Y, S)← not(head left(X,Y, S)), head left occluded(X,Y, S)) (6.25)


tf

� k

� t

Belief axis

<1,0>

<1,1>

<0,0>

<0,1>

Left Head part score

= <0.8, 0.2>

Super pixel skin score

= <0.65, 0.35>

Facts:pq

Figure 6.20: Figure shows detector scores of a pedestrian on the Bilattice square

Fig. 6.20 shows a representative example of the reasoning procedure. Here, left-head

part score and super pixel skin score are shown for a pedestrian image. Let q represent

the hypothesis (i.e. ped(X,Y, S) where X, Y and S correspond to the location & scale

of the pedestrian shown in Fig. 6.20). Let us assume that the weight of the rule for the

left-head DPM score that entails q is 〈0.9, 0.1〉. Let the weight of the rule that indicates

the absence of the left-head part (i.e. φ(¬ped(X,Y, S) ← ¬head left(X,Y, S))) also

be equal to 〈0.9, 0.1〉. Also, let the weight of the rule for the skin score be 〈0.7, 0.3〉. We

now show how these rules are combined using the logic rules to perform reasoning.

The contribution of these rules that entail q can be computed as

〈0, 0〉 ∨ [〈0.8, 0.2〉 ∧ 〈0.9, 0.1〉]⊕

〈0, 0〉 ∨ [〈0.65, 0.35〉 ∧ 〈0.7, 0.3〉] (6.26)

= 〈0.72, 0〉⊕

〈0.455, 0〉 (6.27)

= 〈0.8474, 0〉 (6.28)

Similarly, the contribution of rules that entail ¬q is computed as


〈0, 0〉 ∨ [〈0.2, 0.8〉 ∧ 〈0.9, 0.1〉] (6.29)

= 〈0.18, 0〉 (6.30)

These scores are combined using Eqn. 6.31. The final bilattice score is thresholded to

obtain the set of pedestrian detections.

cl(φ)(q) = 〈0.8474, 0〉⊕

¬〈0.18, 0〉 (6.31)

= 〈0.8474, 0.18〉 (6.32)

The DPM based inference is performed on newly detected pedestrians. Detections

obtained in past frames are tracked using the foreground blob and optic flow based

trackers that we described earlier. The feature matching scores of the tracker are

combined to obtain tracked pedestrian bounding boxes. Since the optic flow based

tracker computes 5 tracking vectors (based on the part templates), we need to select

one vector which is assigned to the tracked pedestrian. This is done by computing

the Bilattice values using the NCC and SP scores for all the 5 part-templates. The

vector associated with the highest inference score is considered as the pedestrian

displacement vector. Also, a threshold is applied on this score to determine lost tracks.

In the case of losing the track of a pedestrian, the DPM based detection is performed.

φ(ped(n, vp)← NCCscore(n, vp)) (6.33)

φ(ped(n, vp)← SPscore(n, vp)) (6.34)

Here, n represents the index of the pedestrian in the state.

• Super pixel inference:


Fig. 6.21a shows a sample pedestrian detection and super pixels in a blob. The super

pixels are labelled using a sequential decision procedure. Super pixels overlapping

with the ‘head’ rectangle of pedestrian detections are marked as ROI. Super pixels

inside the torso regions (that have not been marked as ROI by other pedestrian detec-

tions) of the pedestrian bounding box are marked as RORI. Discrimination between

the RORI and RONI super pixels in the leg region is slightly harder since the lower

bounds of the pedestrian bounding box is not accurately marked by the DPM detector.

To accurately label these super pixels, we define a prior RORI term based on distance

from the torso. Fig. 6.21b shows one such super pixel which is assigned a prior RORI

score equal to max(0, c(1− (ySP /hleg))). Here c is a constant that biases super pixels

closer to the torso to be marked as RORI. Rules based on the prior term, physics based

shadow score pphy and texture based shadow score ptex are defined as shown in Eqns.

6.35, 6.36 & 6.37. They are combined using the bilattice logic inference to determine

the set of RONI super pixels (i.e. shadow super pixels). Super pixels not marked as

shadow in the ‘leg’ rectangle are labelled as RORI.

φ(¬RONISP (xSP , ySP )← prior(xSP , ySP , hleg)) (6.35)

φ(RONISP (xSP , ySP )← shadowphy(xSP , ySP )) (6.36)

φ(RONISP (xSP , ySP )← shadowtex(xSP , ySP )) (6.37)

The unmarked super pixels close to a pedestrian detection that do not have sufficient

FG support are marked as RORI. This is particularly effective is accurately marking

MB’s that cover articulated parts of the human body i.e. the arms and the legs. The

super pixels which continue to remain unassigned are classified based on the skin

score, i.e. super pixels having very low probability of containing skin image regions

are marked as RORI. Also, super pixels which have a high probability of containing

shadow image regions are marked as RONI. All the remaining unmarked super pixels

in the image which have not been assigned a label are marked as ROI.


Shadow super pixel

Skin super pixel

Pedestrian detection

(a)

ySPhleg

(b)

Figure 6.21: (a) Figure shows a pedestrian detection and different super pixels in the blob

(b) The pedestrian bounding box is divided into face, torso and leg rectangles. The super

pixels in the leg region are assigned a prior RORI score based on the distance ySP .

6.10 Macroblock mode and quality parameter assignment

As mentioned in Chapter 2, video coding standards such as H.264 and HEVC allow the

encoder to perform block level ROI coding, i.e. the encoder can specify the slice level and

MB level QP parameters. In this thesis, we use a fixed QP assignment for ROI, RORI & RONI

MB’s. We denote the QP values assigned to ROI, RORI & RONI MB’s as QPROI , QPRORI

& QPRONI respectively. MB’s that overlap with super pixels marked as ROI are assigned

a QP equal to QPROI . Similarly, MB’s that overlap with super pixels marked as RORI are

assigned a QP equal to QPRORI . Coding of all the other MB’s needs to be skipped. This is

performed by following the Skip signalling procedure discussed in Chapter 4.

In [204, 205], Gao et al. analyzed the impact of increasing quantization in video coding

on feature analysis, object detection, and face recognition algorithms. They showed that

increasing QP up to 34 - 36 does not impact object recognition performance. However,

the face recognition task exhibits a continuous reduction in performance with increasing

QP. This agrees with the intuition that tasks such as face recognition require finer details

of image features. Unlike face recognition, object recognition algorithms are based on


object shape and hence are more resilient to video compression noise. Based on these

observations, we set QPROI to 20 and QPRORI to 34. We now discuss two important

components of practical surveillance encoders relevant in the present context of ROI video

coding:

• Rate control: Although, we use a fixed assignment of QP values to ROI’s and RORI’s

in this thesis, we can easily incorporate rate control techniques into the proposed

architecture. As an example, when channel bandwidth reduces, QPRORI can be in-

creased. Under severe bandwidth loss, the RORI regions can be marked as skip.

• Error resilience: Along with reducing the bitrate, ROI inference can also be used to

increase network error protection to the regions of interest. Error protection requires

additional data to be embedded in the bit stream. This increases the bitrate of the

encoded surveillance video. Hence, providing protection to only regions of interest

improves coding efficiency of such error resilient encoder systems. For example, the

Intra mode can be chosen to code the ROI MB’s as proposed in [78].

6.11 ROI, RORI & RONI video compression results

6.11.1 Experimental Setup

The proposed ROI detection system has been implemented in C++. We modify the SLIC

implementation provided by the authors [150] to generate super pixels on foreground re-

gions. We have ported the Matlab code of the DPM detector released by Felzenszwalb et al.

to C++ [144]. We have modified the implementation for physics based shadow detection

provided by Sanin et al. [157]. We have developed the implementation for the FG blob

and optic flow based trackers in C++. We have used the OpenCV library for optical flow

and other low level image processing tasks. We have integrated the proposed ROI detector

into the highly optimized x264 H.264/AVC encoder software [11]. The encoded videos


have been published on the Internet1. Main profile with P slices and context-adaptive bi-

nary arithmetic coding (CABAC) entropy coding is used for all the experiments. Single pass

mode with IPPP coding structure is used for low delay and low-complexity encoding. RD

mode decision for all frames and fast skip detection on P-frames has been enabled. Single

threaded mode is chosen and the computation time is measured on a Core i7 processor

running at 2.4 GHz with 16 GB of system memory.

6.11.2 Bitrate reduction and accuracy

To validate the proposed technique, we have marked the face regions on 40 frames in

two videos, ‘Entrance road’ & ‘Porch’. Fig. 6.22 shows the bitcount and PSNR (computed

over the face image region) of the ‘Entrance road’ video compressed using the proposed

ROI encoder. For comparison, we also plot the data obtained using the FG skip detector

based encoder. The proposed technique reduces bitrate by 37.2% compared to the FG

skip detection based encoder. Also, the proposed technique accurately detects face image

regions. Hence, it maintains good quality of the face image regions. Fig. 6.23 shows the

frames encoded using the proposed ROI encoder and the FG skip detection based encoder.

We can clearly see that the image quality of the face region is unaffected. Similar results

has been obtained for the ‘Porch’ video and is shown in Figs. 6.24 & 6.25. The proposed

ROI encoder provides bitrate reduction of 50.2% on this video.

Fig. 6.26 shows the enlarged image of a pedestrian in the ‘Porch’ video. We can clearly

see that the fine texture and cloth deformation features in the image have been removed by

the proposed ROI encoder. This reduces the bitrate of the compressed video stream. The

figure also shows that the quality of the face region image remains unaffected. We also

manually verified that none of the true RORI MB’s were marked as RONI by the proposed

encoder.

1http://chips.ece.iisc.ernet.in/index.php/Pushkar G


�

�

��

��

��

��

� � ��

��

��

��

��

��

(a) Bitcount

��

��

��

��

��

��

��

��

�

� � ��

��

��

��

��

��

(b) Face region PSNR

Figure 6.22: Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the ‘En-

trance road’ video. The comparison results have been obtained using (I) Proposed method

and (II) Only skip detection. The overall bitrate reduction using the proposed technique is

37.2%. The total face region distortion metrics using the proposed method and the FG skip

detection encoder were both measured as 40.8dB


(a) x264 + skip det.

(b) Proposed

Figure 6.23: Figure shows frames from the ‘Entrance road’ video compressed using (a) Only

skip detection (b) Proposed ROI encoder.


�

�

��

��

��

��

��

��

� � ��

��

��

��

��

��

(a) Bitcount

��

��

��

��

��

��

��

��

�

� � ��

��

��

��

��

��

(b) Face region PSNR

Figure 6.24: Figure shows the (a) Bitrate and (b) Face region PSNR of 40 frames in the

‘Porch’ video. The comparsion results have been obtained using (I) Proposed method and

(II) Only skip detection. The overall bitrate reduction using the proposed technique is

50.2%. The total face region distortion metrics using the proposed method and the FG skip

detection encoder were both measured as 40.9dB


(a) x264 + skip det.

(b) Proposed

Figure 6.25: Figure shows frames from the ‘Porch’ video compressed using (a) Only skip

detection (b) Proposed ROI encoder.


Finer details of clothing are

removed by the ROI encoder

Proposed ROI encoderx264 + skip det.

Face image features are

retained

Figure 6.26: Figure shows that the proposed ROI encoder removes finer details in the RORI

MB’s but maintains image quality of the face region.


6.11.3 Impact of detector errors on ROI encoder performance

The MB labelling errors of the ROI encoder can be categorized based on their consequence

as follows:

• Errors resulting in quality degradation: As an example, MB’s over the face image

region could be incorrectly labeled as RORI. In such cases, identification of the person

in the encoded video would not be possible.

• Errors resulting in increased bitrate: An example of this is, a RONI region incorrectly

marked as ROI. In such cases, the bitrate increases without any increase in the utility

of the encoded video. If the bandwidth of the communication channel is insufficient,

the rate control unit will reduce the bitrate by lowering the quality of the encoded

video.

The figure below shows a graphical representation of the different errors and their con-

sequence (cells are color coded to signify the severity of the consequence). The accuracy of

the MB labeling scheme is directly related to the performance of the individual components,

i.e. DPM detector, shadow detector, tracker and the inference algorithm. We now discuss

this in detail.

��

��

��

��

��

��

��

Figure 6.27: The table shows the different MB labeling errors and their consequences (cells

are color coded to signify the severity). Here, the rows correspond to true MB labels and

columns to the MB labels assigned by the ROI detector.

Experimental results show that the DPM detector does not detect pedestrians on a few

images. Low resolution, low contrast and occlusion are the main reasons for such miss de-

tections. Fig. 6.28 shows two pedestrians (A & B) in a single blob that are not detected due


to occlusion and low contrast (of the head region of ped. B). If they were not being tracked

from previous frames, then the inference procedure will mark the RORI regions (associated

with these pedestrians) to be encoded at low QP i.e. high quality. As a consequence, the

bitrate reduction achieved will reduce. For example, when the two pedestrians are not de-

tected, the image region occupied by pedestrian B requires 19.9 kbits in the compressed

video (ROI QP = 20). When the RORI MB’s are accurately detected, bit consumption (for

region covering image of pedestrian B) reduced by 79% (RORI QP = 34). In contrast, ac-

curately detecting pedestrian A will provide a bit count savings of only 1kbit savings. This

is because the number of RORI MB’s of pedestrian A that are visible is very small due to

occlusion by pedestrian A. These observations suggest that in a compute limited platform,

an optimum scheduling of the computing resources to different target image regions can

improve performance. We introduce these ideas in Sec. 6.11.5.

Ped A

Ped B

Figure 6.28: Figure shows that the DPM detector has failed to detect pedestrians A & B.

Pedestrian A is severely occluded by B. The head region of pedestrian B has poor contrast.

Accurate detection of pedestrian B would have reduced bit cost of the frame by 15kbits. In

contrast, detection of pedestrian A would reduce bit cost by only 1kbit.

Also, we find that small pedestrian images are not detected occasionally. Fig. 6.29a

shows one such example. Detection of these require inclusion of scene context and time

sequence analysis in the inference algorithm. However, in the current context of pedestrian

detection for ROI video coding, we show later in Sec. 6.11.5 that such missed detections do

not have a large impact on the performance of the system. Also, the tracker helps to reduce


the miss detection rate. Once a pedestrian has been detected, the tracker provides labeling

data for subsequent frames (until the tracker model requires an update). Hence it reduces

the overall miss rate of the ROI encoder. Fig. 6.29b shows one such example in which the

pedestrian (enclosed in a green bounding box) has not been detected by the DPM detector

but the tracker has localized the position in the image (shown by the green bounding box).

(a) (b)

Figure 6.29: (a) Figure shows few more DPM detector failures on small pedestrians (b)

Here, the tracker has tracked the pedestrian (bounded by the green box) based on a previ-

ous detection. If only the DPM detector was applied on the current frame, the pedestrian

would have been missed.

Along with missed detections, false positives also result in increased bitrate. However,

since the DPM scores are computed only over foreground regions, the false positive rate

is reduced. Incorrect detections mostly include two cases: (I) Torso regions being marked

as head shoulder and (II) False detections in large foreground blobs. Also, in some cases,

severe localization errors cause RORI regions to be marked as ROI. Fig. 6.30 shows a few

examples of such errors. Also, the false detections are tracked in future frames. In the case

where the torso image region is marked as the head-shoulder part, NCC scores will be high

in subsequent frames. Hence, the false detections will persist in future frames until the next

tracker model update. Periodic verifications of the tracked objects using the DPM detector

can be used to prune such false positives. However, this will increase the computational


complexity of the system. We leave the study of this tradeoff to future work.

(a) (b)

Figure 6.30: (a) Figure shows localization error of the DPM detector. The detector has

included the shadow regions below the pedestrian (due to incorrect shadow detection) in

the bounding box (b) The torso has been detected as a head shoulder region. Again, the

shadow region has been included in the bounding box due to incorrect detection.

The errors we have discussed caused a increased bitrate. However, a more severe error

is when a ROI MB is encoded as RORI/RONI and a RORI MB is encoded as RONI. Since

such errors affect the quality of ROI/RORI image regions in the encoded video, it is im-

portant to minimize them. In the proposed technique, we mark RORI and RONI MB’s only

when a pedestrian is detected. Hence, such errors (that affect quality) can occur only when

the ROI/RORI MB of an undetected pedestrian intersects with the RORI/RONI regions of

a detected pedestrian. Such errors are not common but we show few cases where they

appear. In 6.31b, the image of a small child appears adjacent to the image of a pedestrian.

The DPM detector has detected the larger pedestrian but missed detecting the child. Hence,

the head/face region of the child is marked as RORI. However, in this case, the tracker

(that was initialized during a previous detection of the child) was tracking the image of the

child. However, if the tracker was not yet initialized, the face region of the child would be

encoded using a high QP and hence would not be recognizable. Clearly, these errors are

particularly high when small pedestrians emerge from occlusion. Detection of small and


occluded pedestrians continues to remain a challenge that needs to be addressed by the

computer vision community. Since surveillance images of children are particularly impor-

tant to encode with high resolution, a multi-camera network based system can be adopted

to ensure high accuracy. Multiple camera views can improve the detection performance of

occluded pedestrians. We leave this to future work.

(a) (b) (c)

Figure 6.31: Figure shows three frame a, b & c (that are temporally ordered) in which the

DPM detector has detected the child before (i.e. in (a)) and after (i.e. in (c)) the occlusion.

However, the detector has failed during the occlusion (i.e. in (b)). If the child was not

tracked by the tracker, his face image regions would be encoded in low quality.

Also, the tracker can sometimes drift as shown in Fig. 6.32. However, in such cases, the

NCC scores reduce and causes the execution of a detector on the foreground regions. Since

we do not update the template in the tracker, we find that the NCC scores reliably measure

the similarity of the pedestrian image with the template.


(a) (b) (c)

Figure 6.32: Figure shows the tracker bounding box positions as the tracked pedestrian

gets occluded and reappears later. During the occlusion (i.e. in (b)), the NCC score of

the tracked pedestrian drops from 0.76 to 0.41. This would trigger the execution of the

DPM detector. However, since the template is not updated, the tracker reassigns the correct

bounding box when the pedestrian reappears from occlusion in (c). The NCC score also

increases to 0.7.

6.11.4 Computational complexity

Computational complexity of the proposed ROI, RORI & RONI detection technique depends

on the number of foreground objects and their image sizes. We have measured the ex-

ecution time of various processing components for the ‘Porch’ video. The sampler based

foreground segmentation and blob processing requires about 10 ms per frame. Super pixel

detection on FG image regions takes 5 - 6 ms per k-means iteration (We use 4 iterations).

We note that the super pixel algorithm is easily parallelizable. Also, Benesova et al. [206]

have proposed a speeded up super pixel detector based on morphological processing that

is 6.6× faster than SLIC on 1MP images. The physics & texture based shadow detector

requires 5 - 6 ms & 2 - 3 ms per frame respectively. The pixel level skin detector requires 1

- 2 ms per frame. The computational cost of the system is dominated by the DPM detector.

On the ‘Porch’ video, feature computation on FG image regions in a frame takes 150 - 200

ms. DPM Score computation also requires 150 - 200 ms. Since we run the DPM detector

only once in a few frames, its average complexity cost is lesser. Also, multiple researches

have significantly reduces the computational cost of DPM. Dollar et al. in [207] propose


to generate HOG features at only octave spaced scale intervals. These features are used to

generate the HOG data at intermediate scales. Sadeghi et al. [208] combine hierarchical

vector quantization, hashing techniques, multi threading and cache optimization to obtain

a highly efficient DPM detector that runs at 30fps on a 6 core Intel Xeon processor. Optic

flow and super pixel matching required to perform detection-by-tracking requires 15 - 20

ms per frame.

Along with the recent algorithmic techniques that have been developed to reduce de-

tector complexity, many hardware architectures are also actively being researched. These

developments clearly suggest that the proposed ROI coding techniques can be successfully

implemented on future camera platforms.

6.11.5 Complexity control for ROI encoding

Due to the cost sensitive nature of the surveillance camera market, manufacturers prefer

to minimize the computational capability of camera platforms. In this thesis, we have al-

ready proposed multiple computational complexity reduction techniques. We now propose

an orthogonal approach for ROI video encoders in this section.

Consider the image blob of a single pedestrian in a scene. The area of the FG blob

and the number of FG MB’s increases significantly as the pedestrian image height increases.

Hence, bitrate reduction obtained using ROI encoding is high when the pedestrian image

size is large, i.e. when the pedestrian is close to the camera. The bitrate reduction obtained

by marking shadow regions as RONI is dependent on light sources and the background

image texture. Fig. 6.33 shows the bit count savings ∆B (i.e. difference between (I) bit

count with ROI coding and (II) bit count without ROI coding) plotted against the height of

the pedestrian image (in pixels). This plot was obtained using a sample surveillance video

in which a pedestrian was walking towards the camera. QP has been set equal to 24 for

the video encoded without ROI coding. For the ROI encoded video, QPROI is set to 24 and

QPRORI is set to 32.


��

��

��

��

��

��

��

��

��

��

��

Figure 6.33: Bit count savings is plotted against the height of the pedestrian image in

pixels. QP = 24 for the video encoded without ROI coding. For the ROI encoded video,

QPROI = 24 and QPRORI = 32.

Along with the pedestrian height and distance from the camera, bitrate reduction ob-

tained using the proposed ROI encoding technique depends on the position of the pedes-

trian in the scene. This is better explained using Fig. 6.34, where multiple pedestrians

are present in the scene. Pedestrians A & D are close to the camera and are minimally

occluded. Pedestrians B & C are highly occluded. Hence, marking ROI, RORI & RONI MB’s

for pedestrians A & D provides higher bit rate reduction.

Based on these observations, we can write the total bitrate savings ∆B obtained by

accurately identifying ROI, RORI & RONI MB’s as

∆B =N∑

i=1

φ(hi, oi, si) (6.38)

Here, φ(.) represents the bitrate model of a pedestrian image. N is the total number of

pedestrians in the scene. oi represents the occlusion pattern of the pedestrian, i.e. the set

of MB’s of the ith pedestrian image occluded by background objects or other foreground


AB

CDE F G

Blob 1

Blob 3

Blob 2

Figure 6.34: Scene shows multiple pedestrians in the scene. Pedestrians A & D cover a large

number of MB’s in the image. Hence, ROI detection on image regions of these pedestrians

provides higher bitrate savings.

objects. hi is the image height and si is the state of the ith pedestrian. The state si in-

cludes lighting conditions, appearance & background texture data. Here, we have made an

assumption that the ROI detections have been performed on all the pedestrians accurately.

However, under computational and detector accuracy constraints, the achievable savings

will be lower. This can be written as

∆Bach =N∑

i=1

Ci(R)Diφ(hi, oi, si) (6.39)

where Ci(R) = 1 indicates that the ith pedestrian was included in the hypotheses test

set. Here, R represents the set of image regions which are processed by the ROI detector.

Let RFG represent the set of all foreground regions in the image. Di indicates whether the

detector successfully determined the ROI, RORI & RONI MB’s of the ith pedestrian. The

objective for a resource constrained ROI encoder is to determine the optimal set of image

regions R over which the detector searches for pedestrians. The objective function can be

written as follows:


R∗ = arg maxR

(

N∑

i=1

Ci(R)Diφ(hi, oi, si)

)

(6.40)

We leave the detailed analysis and design of the optimal system as future work. Here,

we only describe a few considerations of such a ‘compression aware’ ROI detector. A simple

strategy would be to select the image regions to be processed in a sequential manner. For

example, in Fig. 6.34, blob 1 can be chosen first. After detection is complete on this blob,

blob 2 can be considered. Arrival of a new frame terminates the detection process on the

current image. ROI, RORI & RONI signalling is performed based on the available pedestrian

detections.

In such a system, the order in which the regions are processed determine ∆Bach. We

now illustrate this using the image shown in Fig. 6.34. Let us assume that the entire

foreground region RFG cannot be processed due to computational constraints. Depending

on the set of foreground image regions R chosen by the inference procedure, different sets

of pedestrians are detected. For example, if the blobs 2 & 3 are chosen, the pedestrians D,

E, F & G would be detected. However, if blobs 1 & 2 are chosen, pedestrians A, B, C & D

would be detected. Clearly, choosing blobs 1 & 2 results in greater bitrate reduction since

they contain a larger number of RORI & RONI MB’s. Similarly, pedestrian hypotheses that

are unoccluded (e.g. pedestrian D) can be prioritized over other regions. These observations

can be used to determine the optimal sequence of image regions to be processed by the ROI

detector.

6.12 Summary

In this chapter, we proposed a Region-of-Interest video encoder for pedestrian surveillance.

We showed that the Viola Jones face detector or the adaptive skin detector were not ca-

pable of accurately marking ROI’s in real world surveillance videos. Hence, we proposed

an architecture that combines multiple detector scores using bilattice logic reasoning. To

obtain compact representation of regions, super pixels were computed on the foreground

pixels. Shadow and skin probability scores were obtained for all the super pixels. We have


modified the DPM technique to obtain pedestrian part scores. Bilattice logic reasoning is

used to combine part scores and detect the pedestrians. Since the DPM based detector fails

occasionally, we use a tracker that uses optical flow and a Kalman filter to accurately detect

pedestrians in the video sequence. The tracker also reduced the computational complexity

by avoiding the need to run DPM on every video frame. We posed the ROI and RORI de-

tection task as a super pixel labelling problem. The bilattice reasoning framework was used

to mark ROI, RORI & RONI super pixels. QP assignment to the MB’s is performed using

the labels of the super pixels. The proposed techniques have been integrated into the x264

video encoder. Experiments show bitrate savings of up to 50.2%.

Chapter 7

Conclusion

High image detail is very critical to recognize and identify miscreants in surveillance footage.

As we have shown in the introduction, to obtain high image detail and large surveillance

coverage, we need to use high resolution surveillance cameras. However, communication

bandwidth requirements of HD cameras is very high. For example, the average typical op-

timized bitrate of a 12MP H.264 surveillance video stream is about 4 - 6Mbps [5, 6]. Such

a high bandwidth requirement increases the data communication and storage costs of the

system. Hence, it is very important to reduce the bitrate of HD camera videos to facilitate

faster market adoption of HD cameras. In this thesis, we have shown that this is achievable

by augmenting the H.264 video encoder with computer vision algorithms.

In Chapter 1, we partitioned the bit cost of a static-camera surveillance video as:

• Background MB cost

• Uncovered background MB cost

• Shadow MB cost

• Non face MB (clothing, arms) cost

• Face MB cost

In this thesis, we have addressed all these components. We proposed four techniques to

reduce the bitrate of surveillance videos:

1. Speeded up GMM based foreground segmentation: Reduces the computational com-

plexity of foreground segmentation which is required to perform skip detection.

180

Chapter 7. Conclusion 181

2. Skip detection: Reduces the cost of coding Background MB’s by accurately detecting

and marking them as Skip.

3. Reference frame selection: Optimally selects reference frames to reduce the uncov-

ered background MB cost.

4. Face ROI coding for pedestrian surveillance: Detects shadow MB’s and marks them

as Skip. A detector and tracker framework has been developed to accurately detect

face and non-face regions. Non face MB’s are encoded in lower quality to reduce the

bitrate.

To perform accurate skip detection, we designed a multi stage sampler based back-

ground MB classifier. Stratification and adaptive sampling techniques have been combined

to reduce the complexity of the BG MB detector. The sampled pixels were classified using a

GMM based segmentation algorithm. We have proposed a modified weight update scheme

to reduce the computational complexity of the GMM based pixel level foreground segmenta-

tion algorithm. The proposed technique marks background MB’s as Skip and hence reduces

the bitrate and complexity of the encoder. Although foreground object detection might have

initially seemed to be very easily achievable, experimental results show that real world is-

sues such as environmental noise, poor lighting conditions and limited processing power

pose significant challenges. The proposed skip detector reduces bit rate by up to 94.5% and

computational complexity by upto 74.5% without affecting the foreground image quality. It

requires 1-3.6ms on a single core and hence can be easily implemented on embedded cam-

era platforms. Also, experimental results of the modified GMM algorithm show a speedup

of up to 44% in scenes where a large fraction of the pixels require multimodal Gaussian

models.

The skip detector based encoder uses image content in the DPB to reconstruct the back-

ground image content. However, skip signaling of uncovered background MB’s in the H.264

standard is not possible if the decoded picture buffer does not contain the corresponding

background image. This reduces the achievable bit rate savings. We have shown that the

optimal selection of reference frames can maximize the number of BG MB’s in the DPB


and hence reduce the cost of coding uncovered background regions in the video frame. A

very low complexity technique has been proposed to determine the optimal set of reference

frames that need to be stored in the DPB. Experiments on real world datasets show that the

proposed reference frame selection method reduces bit rate by up to 24.7% and execution

time by upto 7.3%.

In the specific application of pedestrian surveillance video coding, the face of the pedes-

trian is the most important region in the image. Hence, we proposed to detect & encode the

MB’s that cover the image of the face in high quality. The non-face MB’s of the pedestrian

are considered as ‘Regions of reduced interest’ and are encoded in reduced quality. Shadow

regions are marked as skip. Face detection (based on facial features alone) in controlled

conditions has been very successful. However, we have shown that the accuracy of such

detectors is poor in real world scenarios. Hence, to accurately determine the ROI, RORI &

RONI MB’s, we have combined the outputs of multiple detectors. We pose the MB labelling

task as a super pixel classification problem. Shadow and skin detector scores of super pixels

have been computed. Pedestrians are detected using deformable part models. The face

region is determined using the deformed part locations. Detected pedestrians are tracked

using an optical flow based tracker combined with a Kalman filter. The tracker improves

the accuracy and also avoids the need to run the object detector on already detected pedes-

trians. Bilattice based logic inference has been used to combine multiple likelihood scores

and determine the labels of the super pixels. The coding mode and QP values of the MB’s

have been computed using the super pixel labels. Results show that the proposed face ROI

coding technique provides a further reduction in bitrate of up to 50.2%.

Although the results that we have shown in the thesis have been obtained by modifying

the H.264 encoder, we do note that the proposed techniques can be applied to the recently

finalized, HEVC standard as well. All the techniques presented in this thesis assume a static

camera setup. This is the most common use case in video surveillance installations. How-

ever, the proposed ROI coding ideas can be developed further to support pan tilt cameras.


7.1 Future Challenges and Opportunities

In this thesis, we have seen that applying computer vision algorithms as a preprocessing

step to compression significantly reduces the bit rate, especially when perceptual aspects

of the video sequence are considered. Meticulous design of vision algorithms surprisingly

reduces overall computation. However, many challenges still exist in designing effective

surveillance video encoding systems. We describe a few of them here.

7.1.1 Coding for surveillance cameras on drones

Since the past few years, unmanned aerial vehicles or UAVs have become very popular

for asset monitoring, law enforcement, agriculture and also for recreation. Many futuris-

tic applications such as cargo transport are also being conceived. Video compression on

such airborne platforms poses enormous challenges. These systems have very tight power

budgets and computational capability constraints. Since data communication is over the

wireless channel, bandwidth is also limited. Region of interest video coding techniques on

such platforms could reduce the bitrate and hence the energy consumption of the radio

module. This would increase the battery life and hence the operational time of the UAV.

However, since the camera is in motion, foreground segmentation is not easily achievable.

Hence, the complexity of the computer vision algorithms required is also higher. Image

stabilization and rolling shutter correction are very essential to reduce the residual energy.

ROI video coding for such low power, wireless platforms is a very exciting and challenging

research problem.

7.1.2 Power-Rate-Distortion optimization of ROI encoders

In Chapter 6, we have briefly discussed a technique to sequence the ROI detection opera-

tions for ROI coding on a compute-limited platform. We proposed to determine the order in

which FG blobs are processed based on the estimate of the bitrate reduction (which would

be obtained by processing the blobs). However, a full cross layer (i.e. video analytics engine,


compression engine, network layer & radio system) Power-Rate-Distortion optimization for-

mulation will improve the quality of service of wireless surveillance systems.

Also, with the emergence of wireless standards such as 5G, streaming encoded surveil-

lance videos over wireless networks will be soon realized. However, wireless networks are

unreliable and hence, a good rate control mechanism is very essential to avoid issues such

as buffer overload and frame drop. In Chapter 6, we briefly mentioned about using ROI,

RORI and RONI MB labels to perform rate control. Extending this further, we could con-

sider completely skipping RORI regions during severe network packet loss events. Another

approach would be to compress face image regions of only few frames of pedestrians as

they move across the scene. We plan to explore such multiple rate control schemes for ROI

surveillance coding in the future.

Multi resolution coding applied to ROI coding of surveillance videos is another interest-

ing approach that needs to be studied in detail. Although QP based ROI coding is supported

by standards, using lower resolution for regions of interest can provide better RD perfor-

mance at low bitrate. Encoding a high resolution video at high QP will cause artifacts such

as blocking, contouring and ringing. Instead, encoding a down-sampled video will result

only in a blurred output. Also since, the resolution is lower, the complexity of the encoder

is reduced.

7.1.3 360◦ surveillance video coding

Many surveillance camera vendors have started offering high resolution 360◦ cameras. Very

wide angle fish eye lenses are used to capture the scene. The 360◦ image content can

be represented in different layouts, for example, equirectangular, raw fisheye output or

cubemap representation. The choice of representation affects the compression performance

of the video encoder as well as the accuracy of the algorithms use to perform analytics. For

example, in the equirectangular layout (i.e. a world map layout), the image is distorted

near the poles. Hence, a thorough study of these issues will help in defining the entire

processing pipeline for such cameras.


7.1.4 HDR surveillance video coding

While High dynamic range imaging for static imaging has existed for many years, commer-

cial HDR cameras (e.g. HC-WXF990 by Panasonic) that capture two images at different

exposure settings have appeared recently. Surveillance in particular can benefit immensely

from HDR, for example, in surveillance footage which has mixed lighting conditions (bright

sunlight on one side and dark shadows on another region in the same image). Compressing

HDR video content requires a larger number of bits. Also, as cost of thermal imagers re-

duces, they will be adopted in commercial surveillance systems. These imagers commonly

output 14 bits of data per pixel. Registering color pictures with thermal imagery and ef-

ficiently coding them is a very interesting challenge which will emerge soon. Backward

compatibility (i.e. with existing codecs) is another key challenge that needs to be addressed

when developing video coding techniques for HDR.

Appendix A

Alternate derivation of the Speeded

up GMM update

The Gaussian mixture distribution of the pixel x (we have dropped the time index t here only

for the sake of clarity) formulated in terms of discrete latent variables z is shown in Eqn.

A.1 [128]. Here z is a K dimensional binary random variable having a 1-of-K representation

(z = [z1, z2, ....zK ]T ). zk = 1 indicates that the pixel x was generated from the kth mode of

the mixture model.

p (x) =∑

z

p(z)p(x|z) =K∑

k=1

wkN(

x|µk, σ2k

)

(A.1)

From the EM algorithm, the weight update at time instant t is given by Eqn. A.2 [128].

γ (zk (i)) is the posterior probability of zk (i) = 1 (i is the time index) once we have observed

the incoming pixel x(i). γ (zk (i)) can also be interpreted intuitively as the responsibility that

the mode k takes to explain away the pixel data at time instant i.

wk(t) =

∑ti=1 γ (zk (i))

t(A.2)

=

∑t−1i=1 γ (zk (i))

t+

γ (zk (t))

t(A.3)

=(t− 1)wk(t− 1)

t+

γ (zk (t))

t(A.4)

Stauffer et al. [27] proposed to set γ (zk (t)) to 1 for the mode k which matched the

incoming pixel. The responsibility for other modes is set to 0. Also, (1/t) is set equal to

186

Appendix A. Alternate derivation of the Speeded up GMM update 187

α as proposed in [27]. Hence, the weight update when pixel x(t) matched the mode ‘k’ is

obtained as shown in Eqn. A.5:

wk(t) = (1− α)wk(t− 1) + α (A.5)

Similarly, when pixel x(t) does not match the mode ‘k’, the weight update equation is:

wk(t) = (1− α)wk(t− 1) (A.6)

Assume that all the past Tw pixel samples i.e. x(t − Tw) . . .x(t − 1) matched the same

mode ‘k’. The cumulative weight update at the end of the current frame is:

wk(t) = [[[wk(t− Tw)(1− α) + α](1− α) + α]....] (A.7)

≈ wk(t− Tw)(1− α)Tw + Twα (A.8)

≈ wk(t− Tw)(1− Twα) + Twα (A.9)

Similarly for the case where none of the Tw pixel samples x(t−Tw) . . .x(t− 1) matched

the mode ‘k’, the cumulative weight update at the end of the current frame is:

wk(t) = wk(t− Tw) (1− α)Tw (A.10)

≈ wk(t− Tw) (1− Twα) (A.11)

Now, we propose to ignore the order in which the pixel samples x(t − Tw) . . .x(t − 1)

have arrived in the Tw frames. Hence for the case when Nk matches occur to a pixel mode

‘k’ in Tw frames, we can multiply the two Eqs. (A.9) & (A.11) with suitably modified count

values to obtain the final heuristic in Eq. (A.12). Here, Nk is equal to weightCountk which

Appendix A. Alternate derivation of the Speeded up GMM update 188

is the number of pixels in the Tw window that matched the kth mode. We also append the

αcT term from [3] to enable dynamic selection of number of modes in the GMM.

wk(t) = [(1−Nkα)wk(t− Tw) +Nkα][1− (Tw −Nk)α]− TwαcT (A.12)

Ignoring higher powers of α in Eqn. A.12, we obtain Eqn. A.13 which is identical to the

update equation that was derived in Chapter 3.

wk(t) = (1− Twα)wk(t− Tw) +Nkα− TwαcT (A.13)

Appendix B

Sampler design

Fig. B.1 shows the sampler architecture proposed in Chapter 4. The sparse and dense

samplers are simple systematic samplers [131, 132]. We first study general systematic

samplers used in the context of skip detection. We then use these results to further motivate

the architecture we proposed.

Sparse Sampler + BGS

Fsparse

Morphological Dilation using a 3x3

elementF’sal

Dense Sampler +

BGSIFG

Erosion using a 2x1 element

� FsalImage BCurrent

Fprev = (Bprev)C

Fprev

1 frame delayBprev

Salient MB detection BG MB detection

Systematic samplers

Figure B.1: GMM pixel level classifier + Sampler based Motion Detection (or ‘GMM S-MD’)

flow chart

B.1 Analysis of a simple systematic sampler

For sake of simplicity, let us consider a set of 1D pixels shown in Fig. B.2. The set of pixels

in the 1D image is divided into blocks of length B. The skip detector has to determine the set

of blocks that contain foreground objects. In the context of estimation, we have noted (in

Chapter 4) that sampled locations which are spread apart reduce the correlation between

the chosen data and hence reduce the variation of population averages. Natural populations

have highly correlated properties and hence uniform systematic sampling patterns which

189

Appendix B. Sampler design 190

spread out sample locations perform well. For the current task of skip detection, we now

show such uniform samplers minimize the size of the largest object that is missed.

Stride length = dsys

Object size = L

Block size = B

Block of pixels

Sampled pixels

Figure B.2: 1D sampler

B.1.1 Uniform versus Non Uniform sampling patterns

Let NB denote the total number of samples chosen in a block. For simplicity, let us assume

that the points at the boundaries of each block are always sampled and that the block length

is a multiple of the systematic sampler stride. Let dsys be the stride length of the uniform

systematic sampler.

B = NB dsys (B.1)

Let us now consider a non uniform sampling pattern. Let the inter pixel spacing of this

pattern be represented by di,i+1. Here di,i+1 is the distance between the ith pixel and its

next neigbour in raster scan order.


B =

NB−1∑

i=1

di,i+1 (B.2)

Let Dsys & DnonSys be the size of the largest object which can be missed by sampling

using the uniform systematic and non-uniform samplers respectively.

Dsys = dsys − 2 (B.3)

DnonSys = max1≤i≤NB−1

(di,i+1 − 2) (B.4)

We need to minimize the maximum separation between two sampled points. We need

to show that DnonSys ≥ Dsys, or that

max1≤i≤NB−1

(di,i+1) ≥ dsys (B.5)

We prove this by contradiction as follows:

Assume that: max0≤i≤NB

(di,i+1) < dsys (B.6)

From Eqns. B.1 & B.6, we obtain

NB−1∑

i=1

di,i+1 < NBdsys (B.7)

< B (B.8)

This is in contradiction with Eqn. B.2. Hence, uniform systematic samplers minimize

the size of the largest object that is missed.


B.1.2 Uniform systematic sampler accuracy

Let hpix be the pixel level classifier used to determine presence of an object i.e. hpix = 1

if an object has been detected by the pixel and hpix = 0 otherwise. Let ypix denote the

pixel level, true class label. Similarly, let h & y be the classifier & true class label of a block

respectively. Let the false negative probability (or the ‘miss’ rate) of h be pmiss. Let the false

positive probability (or the ‘false alarm’ rate) of h be pFA.

Pixel level accuracy: In Chapter 4, the system noise (camera and environment noise)

is modelled using a Gaussian mixture distribution. The Mahalanobis distance between the

pixel value and the GMM mode is thresholded to classify the pixel as BG/FG (For simplicity,

we assume the background process to be unimodal). Let the true difference between the

foreground object and the background image be vdif . Here, we are assuming that the object

and the background are uniform. The imager output represented by vI is equal to the sum

of vdif and the system noise vnoise. vnoise is Gaussian distributed random variable with

zero mean. We assume that the system noise is stationary and has been correctly estimated

by the GMM model using the EM algorithm. Let T be the threshold used by the GMM

algorithm to decide whether the pixel is FG/BG, i.e. a pixel is marked as FG if vI is greater

than Tσnoise. The ‘miss’ probability can be written as

p(hpix = 0/ypix = 1) = p

(∣

∣

∣

∣

vIσnoise

∣

∣

∣

∣

< T

)

(B.9)

= p

(∣

∣

∣

∣

vdif + vnoiseσnoise

∣

∣

∣

∣

< T

)

(B.10)

= p

(

−T <vdif + vnoise

σnoise< T

)

(B.11)

= p

(

−v − T <vnoiseσnoise

< −v + T

)

(B.12)

= Φ(−v + T )− Φ(−v − T ) (B.13)

(B.14)

Here, vdif has been normalized with respect to σnoise to obtain v. Φ() is the cumulative


distribution function of the standard normal distribution. Similarly, we can obtain the false

alarm probability as follows

p(hpix = 1/ypix = 0) = p

(∣

∣

∣

∣

vnoiseσnoise

∣

∣

∣

∣

> T

)

(B.15)

= 2(1− Φ(T )) (B.16)

(B.17)

By varying the threshold T , we can obtain the Receiver Operating Characteristic (ROC)

curve for the pixel level classifier hpix as shown in Fig. B.3.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False positive rate

Tru

eposi

tive

rate

v = 2

v = 3

v = 1

v = 4

Figure B.3: ROC curve of the pixel level classifier for different values of v (or normalized

signal level)

From Fig. B.3, we can observe that the accuracy of the pixel level classifier reduces

drastically when the normalized signal level v is less than 2. Hence, in images with high

noise, i.e. high σnoise, the contrast between the foreground and the background, i.e. vdif


needs to be very high in order to obtain good pixel level classification accuracy.

Block level accuracy: In an uniform systematic grid, let the number of points sampled

inside the object be represented by N . Let the object size (in pixels) be denoted by L. N

can take either of the two values n1 or n2, based on the locations of the object relative to

the stride locations.

n1 =

⌊

L

dsys

⌋

(B.18)

n2 =

⌊

L

dsys+ 1

⌋

(B.19)

The probability distribution of N is given by

p(N = n) =

dsys−(L%dsys)dsys

, if n = n1

L%dsysdsys

, if n = n2

0, otherwise

(B.20)

Similarly, let the number of points sampled inside a block be represented by M . The

probability distribution of M is given by

p(M = m) =

dsys−(B%dsys)dsys

, if m = m1

B%dsysdsys

, if m = m2

0, otherwise

(B.21)

where m1 & m1 are


m1 =

⌊

B

dsys

⌋

(B.22)

m2 =

⌊

B

dsys+ 1

⌋

(B.23)

We can now compute the probability of the classifier ‘h’ missing a true FG block as

follows (we have assumed that the noise in the pixels are independent).

pmiss = p(h = 0/y = 1) (B.24)

=L∑

n=0

[p(hpix = 0/ypix = 1)]n p(N = n) (B.25)

= [p(hpix = 0/ypix = 1)]n1 p(N = n1) (B.26)

+ [p(hpix = 0/ypix = 1)]n2 p(N = n2) (B.27)

The probability of the classifier ‘h’ marking a BG block as foreground is

pFA = p(h = 1/y = 0) (B.28)

= 1− p(h = 0/y = 0) (B.29)

= 1−L∑

n=0

[p(hpix = 0/ypix = 0)]m p(M = m) (B.30)

= [p(hpix = 0/ypix = 1)]m1 p(M = m1) (B.31)

+ [p(hpix = 0/ypix = 1)]m2 p(M = m2) (B.32)

The miss rate and the false alarm rate are plotted against the stride parameter dsys in

Fig. B.4. Similarly, we have also plotted the miss rate and the false alarm rate against the

pixel level classifier threshold T in Fig. B.5. From these experiments, we can observe that

increasing the stride or the threshold increases the miss rate of the skip detector. Setting


dsys or T to a very low value increases the false alarm rate. In the next section, we use these

observations to motivate the proposed multi stage FG MB detector.

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Stride

Miss rateFalse positive rate

Figure B.4: Sampler accuracy for different values of stride length (v = 3, L = 20, T = 2.5)


1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

T

Miss rateFalse positive rate

Figure B.5: Sampler accuracy for different values of pixel level classifier threshold (v = 3,

L = 20, dsys = 10)

B.2 Analysis of the proposed sampler

Based on the analysis of the uniform systematic sampler in the previous section, we now

motivate the architecture of the proposed architecture shown in Fig. B.1. From Fig. B.4 &

B.5, we note that the tradeoff between the miss rate & the false alarm rate is determined

by the selection of the sampler stride & pixel classifier threshold parameters. Setting the

stride or threshold to a low value would reduce the miss rate but would also increase the

number of false alarms. This would result in an increase in the bitrate of the encoded video.

Reducing the stride value also increases the complexity of the detector since larger number

of pixels will need to be classified.

To achieve high accuracy with low computational costs, we use a stratified adaptive

cluster sampling scheme. The sampler consists of two stages in which the stride of the

first stage is set to a relatively high value. Hence, a large number of BG MB’s would be

filtered out by the first stage. However, the increased stride setting would result in few


missed FG blocks. To mitigate this issue, we stratify the image. Consider FG blocks and

their neighbours in the previous frame. Such blocks have a large probability to contain

foreground objects in the current frame. Also, neighbours of blocks which were detected as

FG in the current frame are also likely to contain foreground objects. Hence, such blocks

are sampled in the second stage with a low stride setting. Although the false alarm rate

of the second sampler is high, this does not affect the overall system performance. This is

because the first sampler would have already filtered a large fraction of the BG blocks.

Appendix C

Bilattice logic based inference

In the influential paper [209] by Kripke, a partial truth assignment of atomic sentences is

performed. Kripke also introduced a method to extend partial truth assignments to non-

atomic formulas. A partial order on the valuations based on the information was also

defined. Fitting [210, 211] recognized that there are in fact two orderings, one involving

the information (described by Kripke) and the other for the truth. In the strong 3-Valued

logic (T, F,N) (where the third state N is considered as undefined) introduced by Kleene,

the ∧ & ∨ operators are definable using the ordering involving truth. As we show later in

this section, this idea of ordering involving truth and information is central to performing

inference using Bilattice logic.

Belnap in [212, 213] extended Kleene’s 3-Valued logic and introduced a four value logic

system. Ofer Arieli et al. argue that four valued logic is useful in inference problems. How-

ever, in the present context of applying multi valued logics to inference, the next important

development came when Ginsberg in [202] introduced the concept of Bilattices. Bilattices,

which are algebraic structures generalized Belnap’s four valued logic. In fact, the lattice

formed by Belnap’s four valued logic is isomorphic to the simplest Bilattice. Theory of ap-

plying bilattices to inference has been described in detail in [202, 173, 214]. Shet [215] in

his PhD thesis proposed to use Bilattice based reasoning to perform pedestrian detection,

aerial object detection and identity maintenance tasks. We sketch only the most important

parts of this inference technique and refer the reader to the original references for more

details.

Definition C.1. A poset (or partially ordered set) is an ordered pair P = (X,≤) where

X is a set and ≤ is a partial order (i.e. a binary relation which is reflexive, transitive &

antisymmetric).

199

Appendix C. Bilattice logic based inference 200

Definition C.2. Lattice L is a poset in which, every pair of elements x, y ∈ L has,

• a least upper bound x ∨ y ∈ L (called join)

• a greatest lower bound x ∧ y ∈ L (called meet)

that is,

• x ∨ y ≤ z ⇐⇒ x ≤ z and y ≤ z

• z ≤ x ∧ y ⇐⇒ z ≤ x and z ≤ y

Here, ‘least upper bound’ and ‘greatest lower bound’ are idempotent, commutative and

associative binary operations.

Definition C.3. A lattice L is said to be complete iff there exists a unique lub and glb for

every nonempty subset M of L.

Definition C.4. Bilattice is a quadruple B = (B,≤t,≤k,¬) where B is a non empty set, ≤t

& ≤k are partial orderings and ¬ is a mapping from B to itself, such that:

• (B,≤t), (B,≤k) are complete lattices

• x ≤t y =⇒ ¬y ≤t ¬x

• x ≤k y =⇒ ¬x ≤k ¬y

• ¬¬x = x.

Before we provide a formal description of using bilattices for inference, we will first

provide some intuition. Fig. C.1 shows double Hasse diagrams of two valued, four valued

and continuous square bilattices. The y axis value represents the information available

about the formula and the x axis value represents the belief (i.e. whether the wff is true or

false). The two partial orders of the bilattice B, defined on elements in the set B, can be

interpreted as follows:


• The ≤t is a partial order based on belief, i.e. wff’s which have a larger probability of

being true are placed higher in the ordering.

• ≤k is a partial ordering on information content, i.e. logic values which contain more

information are placed higher in the ordering.

The logic represented by the trivial two valued bilattice shown in Fig. C.1a is identical to

that used in classical propositional calculus. Fig. C.1b shows the Belnap’s four valued logic

bilattice which adds ⊥ & ⊤ to the classical two logic calculus. Here, ⊥ represents the truth

value of wff’s about which we do not have any information and ⊤ represents contradiction

in data. Although the four valued Belnap’s logic accommodates contradictory sources of

data, it does not allow representation of continuous uncertainty values or probabilities.

The continuous square bilattice [202] solves this problem by extending Belnap’s four

valued logic to include continuous truth value assignments. Each wff is assigned a truth

value equal to 〈p, q〉 where p, q ∈ [0, 1], i.e. the elements of the set B of the bilattice are

ordered pairs. Here p represents the probability or confidence that the wff is true and

q represents the probability that it is false. Logic values are assigned to detector scores

and inference rules. Since detector scores and rule weights are normalized to 1, p & q are

allowed to take values in the range [0, 1]. One important point to note here is that no logical

consistency is imposed on the logical value 〈p, q〉, i.e. q need not be equal to 1− p. Hence, q

here represents the evidence against the logical statement and not merely the lack of belief

of a proposition. The square bilattice is shown in Fig. C.1c. Here, ⊥ = 〈0, 0〉 represents no

information about the wff and ⊤ = 〈1, 1〉 represents contradiction in the data (i.e. some

data suggests the proposition to be true and the rest claims it to be false).

Each detector score (e.g. head part filter score, torso part filter score, super pixel skin

score) is mapped to a point in the bilattice. For example, head(X,Y, S) used to represent the

detection score of the head part filter (at location X, Y and scale S) takes a value 〈0.8, 0.2〉

for the pedestrian image in Fig. C.1c. Likewise, logical weights are assigned to rules that

are used to perform reasoning. For example, φ(pedestrian(X,Y, S) ← head(X,Y, S)) rep-

resents the sentence: ‘detection of a head part at location X, Y and scale S indicates the


existence of a pedestrian’. The weight for this rule is assigned a value 〈0.7, 0.3〉. As we show

later in this section, all these detector scores and reasoning rules are combined to obtain

the final inferred truth values of an image region. Fig. C.1c shows representative scores of

pedestrian and non pedestrian image regions marked inside the bilattice. We observe that

the logical assignment to the pedestrian image is closer to the ‘true’ (i.e. 〈1, 0〉) value. In

contrast, the assignment to the background image is closer to the ‘false’ (i.e. 〈0, 1〉) value.

As already discussed for the case of a regular bilattice, the x & y axes values impose an

ordering of the wff’s. As an example, we can write 〈0.8, 0.2〉 <k 〈0.87, 0.27〉 and 〈0.8, 0.2〉 <t

〈0.87, 0.13〉. This ordering is illustrated in Fig. C.2. Here v is the value of a wff. The shaded

rectangle in Fig. C.2b is the logic values which are placed lower than v by the ‘information’

order. Similarly, the shaded rectangle in Fig. C.2a is the logic values which are placed lower

than v by the ‘belief’ order.

As we have already see from Defn. C.4, the negation operator flips the logic elements

around the truth axis without altering the information ordering, i.e. ¬〈p, q〉 = 〈q, p〉. Along

with the negation operator, the bilattice is also associated with another operator called

‘conflation’ which is denoted by −. The conflation operator flips the logic elements around

the information (or k) axis without altering the belief ordering.

We now discuss the formal construction of the continuous valued square bilattice used

to perform reasoning.

Definition C.5. Square Bilattice is a quadrapule L2 = (L× L,≤t,≤k,¬) where for every

〈p1, q1〉, 〈p2, q2〉 in L2,

• L = 〈L,≤L〉 is a complete lattice

• ¬〈p1, q1〉 = 〈q1, p1〉

• 〈p1, q1〉 ≤t 〈p2, q2〉 ⇐⇒ p1 ≤L p2 and q2 ≤L q1

• 〈p1, q1〉 ≤k 〈p2, q2〉 ⇐⇒ p1 ≤L p2 and q1 ≤L q2


Fig. C.3 shows the construction of the square bilattice that we use for reasoning un-

der uncertainty. The square bilattice L2 = {[0, 1] ∗ [0, 1],≤t,≤k} can be decomposed into

two lattices (A) Lattice L1 = {[0, 1] ∗ [0, 1],≤t} based on the belief ordering (B) Lattice

L2 = {[0, 1] ∗ [0, 1],≤k} based on the information content ordering. The lub & glb operators

for the lattice L1, based on the belief ordering, are denoted by ∨ (disjunction) & ∧ (con-

junction) respectively. The lub & glb operators for the lattice L2 (based on the information

ordering) are represented by⊕

&⊗

respectively. The⊗

is also called the consensus op-

erator. Informally, we can consider p⊗

q to be the extent to which p and q agree upon.

Likewise, the⊕

is called the gullibility operator. Again, informally, it represents the opera-

tor that combines any information from different sources.

Definition C.6. The lub & glb operators along the belief axis (∨, ∧) and the lub & glb

operators along the information axis (⊗

&⊕

) are defined as:

• 〈p1, q1〉 ∧ 〈p2, q2〉 = 〈p1 ∧L p2, q1 ∨L q2〉

• 〈p1, q1〉 ∨ 〈p2, q2〉 = 〈p1 ∨L p2, q1 ∧L q2〉

• 〈p1, q1〉⊗

〈p2, q2〉 = 〈p1 ∧L p2, q1 ∧L q2〉

• 〈p1, q1〉⊕

〈p2, q2〉 = 〈p1 ∨L p2, q1 ∨L q2〉

In Defn. C.6, the lub and glb of the square bilattice have been defined based on the

glb and lub operators of L. We now define the lub and glb operators for the bilattice L to

complete the construction of the square bilattice. As noted by Shet et al. in [173], triangular

norm and conorm functions introduced by Schweizer et al. [216] are popular for reasoning

in many-valued logics. Shet et al. adopted these to construct the lub and glb operators for

the bilattice L. T-norms have also been used for rule weighting in fuzzy rule based methods.

Definition C.7. A function T : [0, 1] × [0, 1] → [0, 1] is a t-norm if it satisfies the following

properties:

• Commutativity: T (a, b) = T (b, a)

• Monotonicity: T (a, b) ≤ T (c, d) if a ≤ c and b ≤ d


• Associativity: T (a, T (b, c)) = T (T (a, b), c)

• Identity element: The number 1 is the identity element i.e. T (a, 1) = a

Definition C.8. A function S : [0, 1]× [0, 1]→ [0, 1] is a t-conorm if it satisfies the following

properties:

• Commutativity: S(a, b) = S(b, a)

• Monotonicity: S(a, b) ≤ S(c, d) if a ≤ c and b ≤ d

• Associativity: S(a,S(b, c)) = S(S(a, b), c)

• Identity element: The number 0 is the identity element i.e. S(a, 0) = a

Following [173], we use T (a, b) = ab & S(a, b) = a + b − ab as the glb & lub operators

for the lattice L respectively. With this, we have now completely specified the algebraic

structure of the square bilattice that we will use to perform inference.

We now show the application of the bilattice formulation to reasoning under uncertainty.

The input to the inference algorithm is a set of facts, which in our case are detector scores

(e.g. head shoulder part filter score, texture based shadow super pixel score). The query is

a wff that represents the label assigned to a bounding box or super pixel (e.g. presence of

a pedestrian in a bounding box or whether a super pixel is a shadow region). All the scores

are represented as logical values in the square bilattice.

Definition C.9. Let L be the formal language where the inference is carried out. Truth

assignment is a function that assigns some truth value to each wff in L. More formally, a

truth assignment is a function φ : L→ B where B is a Bilattice on truth values.

Definition C.10. Let KB = s1, s2, . . . , sM be the sentences in the knowledge base. Let φ be

a truth assignment that labels sentences i.e. if si is a sentence, φ(si) is the truth value of si

assigned by φ. Using this truth data, we can obtain information of other sentences logically


related to si. This logical consequence, also called entailment, is denoted by |=. The closure

cl(φ) is the truth assignment that labels sentences entailed by the knowledge base.

Let su be the sentence that has to be inferred from the knowledge base. Borrowing

notations from [173, 202], we write S as the subset of L from which it is possible to derive

su, i.e. S is a set of sentences that entail su. The conjunction of sentences in S is:

∧s∈S

cl(φ)(s) (C.1)

There could be multiple such sets that entail su. For example, the presence of a pedes-

trian can be indicated by a high head-shoulder part filter score and a high skin super pixel

score. Let π+(su) denote the collection of such subsets of L that entail su. Similarly, let

π−(su) denote the collection of subsets of L that entail ¬su. Ginsberg [202] showed that

the closure can be written as follows:

cl(φ)(su) =

⊕

S∈π+(su)

⊥ ∨

[

∧s∈S

cl(φ)(s)

]

⊕

⊕

S∈π−(su)

⊥ ∧

[

¬ ∧s∈S

cl(φ)(s)

]

(C.2)

Using Demorgan’s law for bilattices [202], i.e. ¬(a∧b) = (¬a)∨ (¬b) and since ¬⊥ = ⊥,

¬ (a⊕

b) = (¬a⊕

¬b) we can rewrite Eqn. C.2 as:

cl(φ)(su) =

⊕

S∈π+(su)

⊥ ∨

[

∧s∈S

cl(φ)(s)

]

⊕

¬⊕

S∈π−(su)

⊥ ∨

[

∧s∈S

cl(φ)(s)

]

(C.3)

Eqn. C.3 is a disjunction of conjunctions of bilattice values. The closure operation

explicitly uses logic terms that entail q and ¬q in separate conjunction terms. This allows us

to ‘only accept’ or ‘only reject’ hypotheses based on certain facts. For example, inconsistent

geometry can cause a rejection of a pedestrian hypothesis. However, if the geometry is

consistent, it does not increase the truth value of the pedestrian detection. The formulation

we have described to determine closure is identical to that used by Shet in [173]. Reasoning


using other bilattice structures (e.g. bilattice for default logic) can be found in Shet’s thesis

[215].


tf

� k

� t

Belief axis

(a) Two valued logic bilattice

tf

� k

� t

Belief axis

(b) Four valued Belnap’s bilattice

tf

� k

� t

Belief axis

<1,0>

<1,1>

<0,0>

<0,1>

pq

(c) Square Bilattice

Figure C.1: Double Hasse diagrams of different bilattices. In (c), a surveillance video frame

is shown. Also, the logic values of pedestrian and non pedestrian image regions are shown

in the double Hasse diagram.


tf

� k

� t

Belief axis

<1,0>

<1,1>

<0,0>

<0,1>v

pq

(a) Ordering based on belief

tf

� k

� t

Belief axis

<1,0>

<1,1>

<0,0>

<0,1>

v

pq

(b) Ordering based on information

Figure C.2: Double Hasse diagrams show partial ordering based on belief and information

in bilattices

Ap

pen

dix

C.

Bila

tticelo

gic

base

din

fere

nce

20

9

{ [0,1], ≤L }

Lattice Square bilattice Underlying lattices

Order ≤t and≤K :

⟨p1,q1⟩ ≤t ⟨p2,q2⟩ ⇔ p1 ≤L p2 and q2 ≤L q1

⟨p1,q1⟩ ≤k ⟨p2,q2⟩ ⇔ p1 ≤L p2 and q1 ≤L q2

{ [0,1]*[0,1], ≤t, ≤k }

x ∧L y = xy

x ∨L y = x + y - xy

glb and lub operators:

{ [0,1]*[0,1], ≤t } { [0,1]*[0,1], ≤k }

glb and lub operators :

⟨p1,q1⟩ ∧ ⟨p2,q2⟩ = ⟨p1∧Lp2 , q1∨Lq2⟩

⟨p1,q1⟩ ∨ ⟨p2,q2⟩ = ⟨p1∨Lp2 , q1∧Lq2⟩

⟨p1,q1⟩ ⊗ ⟨p2,q2⟩ = ⟨p1∧Lp2 , q1∧Lq2⟩

⟨p1,q1⟩ ⊕ ⟨p2,q2⟩ = ⟨p1∨Lp2 , q1∨Lq2⟩

glb and lub operators:

Figure C.3: Construction of the square bilattice

Bibliography

[1] “Planning, design, installation and operation of CCTV surveillance systems: code of

practice and associated guidance,” British Security Industry Association, 2014.

[2] D.-S. Lee, “Effective gaussian mixture learning for video background subtraction,”

IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 27, pp. 827–832, 2005.

[3] Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtrac-

tion,” Int’l Conf. on Pattern Recognition, vol. 2, pp. 28–31, 2004.

[4] “https://www.mobotix.com/eng au/support/planning-tools/mx-planning-tool-

optics,” MX Planning Tool Optics.

[5] “http://resource.boschsecurity.com/documents/nbn 80122 data sheet enus

14878683787.pdf,” DINION IP ultra 8000 MP datasheet.

[6] “http://resource.boschsecurity.com/documents/npc 2000 data sheet enus

11392811915.pdf,” TINYON IP 2000.

[7] I. E. Richardson, The H.264 Advanced Video Compression Standard, 2nd ed. Wiley

Publishing, 2010.

[8] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian, “Statistical modeling of complex back-

grounds for foreground object detection,” IEEE Trans. on Image Processing, vol. 13,

no. 11, pp. 1459–1472, 2004.

210

BIBLIOGRAPHY 211

[9] “http://imagelab.ing.unimore.it/vssn06/.”

[10] E. Martinec, “http://www.vuezone.com,” Noise, dynamic range and bit depth in digi-

tal SLRs, 2008.

[11] x264 encoder software. [Online]. Available:

http://www.videolan.org/developers/x264.html

[12] “http://www.proxicast.com/security/security-video.htm,” LAN-Cell 3G/4G Cellular

Router for Video Surveillance.

[13] I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next-

generation Multimedia. New York, NY, USA: John Wiley & Sons, Inc., 2003.

[14] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE

Trans. on Information Theory, vol. 19, no. 4, pp. 471–480, 1973.

[15] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side infor-

mation at the decoder,” IEEE Trans. on Information Theory, vol. 22, no. 1, pp. 1–10,

1976.

[16] L. Liu, Z. Li, and E. Delp, “Efficient and low-complexity surveillance video compres-

sion using backward-channel aware wyner-ziv video coding,” IEEE Trans. Circuits

Syst. Video Technol., vol. 19, no. 4, pp. 453 –465, April 2009.

[17] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed video coding,”

Proceedings of the IEEE, vol. 93, no. 1, pp. 71–83, 2005.

[18] R. Puri, A. Majumdar, and K. Ramchandran, “Prism: A video coding paradigm with

motion estimation at the decoder,” Image Processing, IEEE Transactions on, vol. 16,

no. 10, pp. 2436–2448, Oct 2007.

[19] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc

video coding standard,” IEEE Transactions on Circuits and Systems for Video Technol-

ogy, vol. 13, no. 7, pp. 560–576, July 2003.

BIBLIOGRAPHY 212

[20] H.264 Advanced video coding for generic audiovisual services. [Online]. Available:

http://www.itu.int/rec/T-REC-H.264

[21] Y. Lee, J. Kim, and C.-M. Kyung, “Energy-aware video encoding for image quality

improvement in battery-operated surveillance camera,” IEEE Trans. Very Large Scale

Integr. (VLSI) Syst., vol. 20, no. 2, pp. 310 –318, Feb. 2012.

[22] A. Vetro, T. Haga, K. Sumi, and H. Sun, “Object-based coding for long-term archive

of surveillance video,” IEEE Int. Conf. on Multimedia and Expo, vol. 2, pp. 417–420,

2003.

[23] S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, “Efficient moving object segmentation algo-

rithm using background registration technique,” IEEE Trans. Circuits Syst. Video Tech-

nol., vol. 12, no. 7, pp. 577 –586, Jul 2002.

[24] X. Jin and S. Goto, “Encoder adaptable difference detection for low power video

compression in surveillance system,” Image Commun., vol. 26, no. 3, pp. 130–142,

Mar. 2011.

[25] Z. He and D. Wu, “Resource allocation and performance analysis of wireless video

sensors,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16,

no. 5, pp. 590–599, May 2006.

[26] L.-T. Cheok and N. Gagvani, “Analytics-modulated coding of surveillance video,” in

Multimedia and Expo (ICME), 2010 IEEE International Conference on, July 2010, pp.

127–132.

[27] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time

tracking,” IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, p. 2246,

1999.

[28] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts,

and shadows in video streams,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,

no. 10, pp. 1337–1342, 2003.

BIBLIOGRAPHY 213

[29] K. Kim, T. Chalidabhongse, D. Harwood, and L. Davis, “Background modeling and

subtraction by codebook construction,” in Image Processing, 2004. ICIP ’04. 2004

International Conference on, vol. 5, Oct 2004, pp. 3061–3064 Vol. 5.

[30] Y. Benezeth, P. M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Comparative

study of background subtraction algorithms,” J. Elec. Imaging, vol. 19, no. 3, 2010.

[31] G. Guo and C. Dyer, “Patch-based image correlation with rapid filtering,” IEEE Conf.

on Comput. Vision and Pattern Recognition, 2007. CVPR ’07., pp. 1–6, 2007.

[32] Y. Yu and D. Doermann, “Model of object-based coding for surveillance video,” in

Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE Inter-

national Conference on, vol. 2, March 2005, pp. 693–696.

[33] S.-C. Hsia, C. H. Hsiao, and C.-Y. Huang, “Single-object-based segmentation and cod-

ing technique for video surveillance system,” Journal of Electronic Imaging, vol. 18,

no. 3, pp. 033 007–033 007–10, 2009.

[34] R. Venkatesh Babu and A. Makur, “Object-based surveillance video compression us-

ing foreground motion compensation,” in Control, Automation, Robotics and Vision,

2006. ICARCV ’06. 9th International Conference on, Dec 2006, pp. 1–6.

[35] H. Song and C.-C. Kuo, “A region-based h.263+ codec and its rate control for low

vbr video,” Multimedia, IEEE Transactions on, vol. 6, no. 3, pp. 489–500, June 2004.

[36] P. Baccichet, X. Zhu, and B. Girod, “Network-aware h.264/avc region-of-interest cod-

ing for a multi-camera wireless surveillance network,” in Picture Coding Symposium,

2006.

[37] C.-Y. Wu and P.-C. Su, “A region of interest rate-control scheme for encoding traffic

surveillance videos,” in Intelligent Information Hiding and Multimedia Signal Process-

ing, 2009. IIH-MSP ’09. Fifth International Conference on, Sept 2009, pp. 194–197.

BIBLIOGRAPHY 214

[38] Y. Liu, Z. Li, Y. Soh, and M. Loke, “Conversational video communication of h.264/avc

with region-of-interest concern,” in Image Processing, 2006 IEEE International Con-

ference on, Oct 2006, pp. 3129–3132.

[39] T. Thomas, S. Emmanuel, P. Zhang, and M. Kankanhalli, “An authentication mecha-

nism using chinese remainder theorem for efficient surveillance video transmission,”

in Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE Interna-

tional Conference on, 2010, pp. 567–573.

[40] C. S. Kannangara, I. E. G. Richardson, M. Bystrom, J. R. Solera, Y. Zhao, A. Maclen-

nan, and R. Cooney, “Low-complexity skip prediction for H.264 through Lagrangian

cost estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 202–208,

2006.

[41] H. Zeng, C. Cai, and K.-K. Ma, “Fast mode decision for H.264/AVC based on mac-

roblock motion activity,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 4, pp.

491–499, 2009.

[42] A. Saha, K. Mallick, J. Mukherjee, and S. Sural, “Skip prediction for fast rate dis-

tortion optimization in H.264,” IEEE Trans. Consum. Electron., vol. 53, no. 3, pp.

1153–1160, Aug 2007.

[43] A. Kannur and B. Li, “An enhanced rate control scheme with motion assisted slice

grouping for low bit rate coding in h.264,” in Image Processing, 2008. ICIP 2008.

15th IEEE International Conference on, Oct 2008, pp. 2100–2103.

[44] H. Li, Z. Wang, H. Cui, and K. Tang, “An improved roi-based rate control algorithm

for h.264/avc,” in Signal Processing, 2006 8th International Conference on, vol. 2,

2006.

[45] X. Zhang, L. Liang, Q. Huang, Y. Liu, T. Huang, and W. Gao, “An efficient coding

scheme for surveillance videos captured by stationary cameras,” Visual Communica-

tions and Image Processing, 2010.

BIBLIOGRAPHY 215

[46] X. Zhang, T. Huang, Y. Tian, and W. Gao, “Background-modeling-based adaptive

prediction for surveillance video coding,” Image Processing, IEEE Transactions on,

vol. 23, no. 2, pp. 769–784, Feb 2014.

[47] M. Paul, W. Lin, C.-T. Lau, and B.-S. Lee, “Explore and model better i-frames for video

coding,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 21,

no. 9, pp. 1242–1254, Sept 2011.

[48] M. Paul, W. Lin, C. Lau, and B.-S. Lee, “Video coding using the most common frame

in scene,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International

Conference on, March 2010, pp. 734–737.

[49] T. Totozafiny, O. Patrouix, F. Luthon, and J.-M. Coutellier, “Dynamic background seg-

mentation for remote reference image updating within motion detection jpeg2000,”

in Industrial Electronics, 2006 IEEE International Symposium on, vol. 1, July 2006,

pp. 505–510.

[50] S. Han, X. Zhang, Y. Tian, and T. Huang, “An efficient background reconstruction

based coding method for surveillance videos captured by moving camera,” in Ad-

vanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International

Conference on, Sept 2012, pp. 160–165.

[51] “http://www.onvif.org/,” ONVIF standards.

[52] “http://www.onvif.org/,” PSIA: Physical Security Interoperability Alliance.

[53] V. Chellappa, P. Cosman, and G. Voelker, “Dual frame motion compensation with

uneven quality assignment,” in Proc. Data Compression Conference, DCC 2004, pp.

262–271.

[54] M. Tiwari and P. Cosman, “Selection of long-term reference frames in dual-frame

video coding using simulated annealing,” IEEE Signal Process. Lett., vol. 15, pp. 249–

252, 2008.

BIBLIOGRAPHY 216

[55] D. Liu, D. Zhao, X. Ji, and W. Gao, “Dual frame motion compensation with optimal

long-term reference frame selection and bit allocation,” IEEE Trans. Circuits Syst.

Video Technol., vol. 20, no. 3, pp. 325 –339, March 2010.

[56] B. Li, J. Xu, H. Li, and F. Wu, “Optimized reference frame selection for video coding

by cloud.” in IEEE Int. Workshop on Multimedia Signal Process. (MMSP). IEEE, 2011,

pp. 1–5.

[57] H. Li, B. Li, and J. Xu, “Rate-distortion optimized reference picture management for

high efficiency video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12,

pp. 1844–1857, 2012.

[58] X. Zhang, Y. Tian, T. Huang, S. Dong, and W. Gao, “Optimizing the hierarchical pre-

diction and coding in hevc for surveillance and conference videos with background

modeling,” Image Processing, IEEE Transactions on, vol. 23, no. 10, pp. 4511–4526,

Oct 2014.

[59] D. Grois and O. Hadar, Recent Advances on Video Coding, D. J. D. S. Lorente, Ed.

InTech, 2011.

[60] I. Fernandez, P. Rondao Alface, T. Gan, R. Lauwereins, and C. De Vleeschouwer,

“Integrated h.264 region-of-interest detection, tracking and compression for surveil-

lance scenes,” in Packet Video Workshop (PV), 2010 18th International, Dec 2010, pp.

17–24.

[61] N. Doulamis, A. Doulamis, D. Kalogeras, and S. Kollias, “Low bit-rate coding of image

sequences using adaptive regions of interest,” Circuits and Systems for Video Technol-

ogy, IEEE Transactions on, vol. 8, no. 8, pp. 928–934, Dec 1998.

[62] Z. Bojkovic and D. Milovanovic, “Multimedia coding using adaptive regions of inter-

est,” in Neural Network Applications in Electrical Engineering, 2004. NEUREL 2004.

2004 7th Seminar on, Sept 2004, pp. 67–71.

BIBLIOGRAPHY 217

[63] C. Bulla, A. Steiger, and P. Hosten, “Realtime object detection & tracking for roi

encoding,” in International Workshop on Acoustic Signal Enhancement IWAENC’12,

Aachen, Germany, Sep. 2012.

[64] C. Bulla, C. Feldmann, and M. Schink, “Region of interest encoding in video confer-

ence systems,” in Proc. of International Conference on Advances in Multimedia MME-

DIA’13, Venice, Italy, Apr. 2013, pp. 119–124.

[65] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Comput. Vision,

vol. 57, no. 2, pp. 137–154, May 2004.

[66] M.-C. Chi, M.-J. Chen, and C.-T. Hsu, “Region-of-interest video coding by fuzzy con-

trol for h.263+ standard,” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of

the 2004 International Symposium on, vol. 2, May 2004, pp. II–93–6 Vol.2.

[67] Y. Liu, Z. G. Li, and Y. C. Soh, “Region-of-interest based resource allocation for con-

versational video communication of h.264/avc,” Circuits and Systems for Video Tech-

nology, IEEE Transactions on, vol. 18, no. 1, pp. 134–139, Jan 2008.

[68] S.-F. Huang, M.-J. Chen, K.-H. Tai, and M.-S. Li, “Region-of-interest determination

and bit-rate conversion for h.264 video transcoding,” EURASIP Journal on

Advances in Signal Processing, vol. 2013, no. 1, 2013. [Online]. Available:

http://dx.doi.org/10.1186/1687-6180-2013-112

[69] D. Chai and K. Ngan, “Face segmentation using skin-color map in videophone appli-

cations,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 4,

pp. 551–564, Jun 1999.

[70] D. Wu, S. Ci, H. Luo, Y. Ye, and H. Wang, “Video surveillance over wireless sensor

and actuator networks using active cameras,” Automatic Control, IEEE Transactions

on, vol. 56, no. 10, pp. 2467–2472, Oct 2011.

BIBLIOGRAPHY 218

[71] H. Cheng and J. Wus, “Adaptive region of interest estimation for aerial surveillance

video,” in Image Processing, 2005. ICIP 2005. IEEE International Conference on, vol. 3,

Sept 2005, pp. III–860–3.

[72] H. Meuel, M. Munderloh, and J. Ostermann, “Low bit rate roi based video coding for

hdtv aerial surveillance video sequences,” in Computer Vision and Pattern Recognition

Workshops (CVPRW), 2011 IEEE Computer Society Conference on, June 2011, pp. 13–

20.

[73] M. M. Holger Meuel, Julia Schmidt and J. Ostermann, Advanced Video Coding for

Next-Generation Multimedia Services, P. Y.-S. Ho, Ed. InTech, 2013.

[74] A. Mavlankar and B. Girod, “Video streaming with interactive pan/tilt/zoom,”

in High-Quality Visual Experience, ser. Signals and Communication Technology,

M. Mrak, M. Grgic, and M. Kunt, Eds. Springer Berlin Heidelberg, 2010, pp. 431–

455.

[75] ——, “Spatial-random-access-enabled video coding for interactive virtual

pan/tilt/zoom functionality,” Circuits and Systems for Video Technology, IEEE

Transactions on, vol. 21, no. 5, pp. 577–588, May 2011.

[76] ——, “Background extraction and long-term memory motion-compensated predic-

tion for spatial-random-access-enabled video coding,” in Picture Coding Symposium,

2009. PCS 2009, May 2009, pp. 1–4.

[77] A. Mavlankar, P. Baccichet, D. Varodayan, and B. Girod, “Optimal slice size for

streaming regions of high resolution video with virtual pan/tilt/zoom functionality,”

in Proc. of 15th European Signal Processing Conference (EUSIPCO, 2007.

[78] F. Boulos, W. Chen, B. Parrein, and P. Le Callet, “A new h.264/avc error resilience

model based on regions of interest,” in Packet Video Workshop, 2009. PV 2009. 17th

International, May 2009, pp. 1–9.

BIBLIOGRAPHY 219

[79] C. Koch and S. Ullman, Matters of Intelligence. Springer Netherlands, 1987, vol. 188,

ch. Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry, pp.

115–141.

[80] L. Itti, “Automatic foveation for video compression using a neurobiological model of

visual attention,” Image Processing, IEEE Transactions on, vol. 13, no. 10, pp. 1304–

1318, 2004.

[81] Z. Li, S. Qin, and L. Itti, “Visual attention guided bit allocation in video compression,”

Image and Vision Computing, vol. 29, no. 1, pp. 1 – 14, 2011.

[82] L. Itti, “Automatic attention-based prioritization of unconstrained video for compres-

sion,” in In Proc. SPIE Human Vision and Electronic Imaging IX (HVEI04, pp. 272–283.

[83] A. Unterweger and A. Uhl, “Slice groups for post-compression region of interest en-

cryption in h.264/avc and its scalable extension,” Signal Processing: Image Commu-

nication, vol. 29, no. 10, pp. 1158 – 1170, 2014.

[84] S. Khire, A. Rodriguez, S. Robertson, and N. Jayant, “Error-resilient delivery of re-

gion of interest video using multiple representation coding,” in Acoustics, Speech and

Signal Processing (ICASSP), 2013 IEEE International Conference on, May 2013, pp.

2055–2059.

[85] H.-M. Hu, B. Li, W. Lin, W. Li, and M.-T. Sun, “Region-based rate control for

h.264/avc for low bit-rate applications,” Circuits and Systems for Video Technology,

IEEE Transactions on, vol. 22, no. 11, pp. 1564–1576, Nov 2012.

[86] X. Zhu, E. Setton, and B. Girod, “Content-adaptive coding and delay-aware rate

control for a multi-camera wireless surveillance network,” in Multimedia Signal Pro-

cessing, 2005 IEEE 7th Workshop on, Oct 2005, pp. 1–4.

[87] F. Licandro, A. Lombardo, and G. Schembra, “Multipath routing and rate-controlled

video encoding in wireless video surveillance networks,” Multimedia Systems, vol. 14,

no. 3, pp. 155–165, 2008.

BIBLIOGRAPHY 220

[88] A. Zainaldin, I. Lambadaris, and B. Nandy, “Adaptive rate control low bit-rate video

transmission over wireless zigbee networks,” in Communications, 2008. ICC ’08. IEEE

International Conference on, May 2008, pp. 52–58.

[89] Y. Sun, I. Ahmad, D. Li, and Y.-Q. Zhang, “Region-based rate control and bit alloca-

tion for wireless video transmission,” Multimedia, IEEE Transactions on, vol. 8, no. 1,

pp. 1–10, Feb 2006.

[90] C.-M. Huang and C.-W. Lin, “Multiple-priority region-of-interest h.264 video com-

pression using constraint variable bitrate control for video surveillance,” Optical En-

gineering, vol. 48, no. 4, pp. 047 004–047 004–10, 2009.

[91] J. Chao, R. Huitl, E. Steinbach, and D. Schroeder, “A novel rate control framework for

sift/surf feature preservation in h.264/avc video compression,” Circuits and Systems

for Video Technology, IEEE Transactions on, vol. 25, no. 6, pp. 958–972, 2015.

[92] “http://resource.boschsecurity.com/documents/commercial brochure enus

9822241291.pdf,” DINION and FLEXIDOME HD 1080p High Dynamic Range cameras.

[93] “https://blog.sony.com/press/sonys-4k-security-camera-has-1-0-type-exmor-r-cmos-

sensor-for-advanced-imaging-capabilities/,” Sony’s 4k Secuirity Camera: Advanced

Imaging Capabilities.

[94] “http://www.axis.com/files/whitepaper/wp zipstream 64253 en 1506 lo.pdf,” Axis

Zipstream technology.

[95] VideobanditTM suite. General Dynamics, C4 Systems. [Online]. Available:

http://www.gdc4s.com/video-bandit

[96] T. Gan and P. Rondao Alface, “Fast mode decision for h.264/avc encoding of tunnel

surveillance video,” in Advances in Multimedia (MMEDIA), 2010 Second International

Conferences on, June 2010, pp. 7–12.

BIBLIOGRAPHY 221

[97] M. Akram and E. Izquierdo, “Fast multiframe motion estimation for surveillance

videos,” in Image Processing (ICIP), 2010 17th IEEE International Conference on, Sept

2010, pp. 753–756.

[98] ——, “Fast motion estimation for surveillance video compression,” Signal, Image and

Video Processing, vol. 7, no. 6, pp. 1103–1112, 2013.

[99] M. Akram, “Surveillance centric coding,” Ph.D. dissertation, Queen Mary, University

of London, 2011.

[100] G. Xu, M. Ding, Y. Cheng, and Y. Tian, “Global motion estimation based on kalman

predictor,” in Imaging Systems and Techniques, 2009. IST ’09. IEEE International Work-

shop on, May 2009, pp. 395–398.

[101] M. Munderloh, H. Meuel, and J. Ostermann, “Mesh-based global motion compensa-

tion for robust mosaicking and detection of moving objects in aerial surveillance,” in

Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer

Society Conference on, June 2011, pp. 1–6.

[102] H.-J. Stolberg, S. Moch, L. Friebe, A. Dehnhardt, M. Berekovic, and P. Pirsch, “An

soc with two multimedia dsps and a risc core for video compression applications,”

in Solid-State Circuits Conference, 2004. Digest of Technical Papers. ISSCC. 2004 IEEE

International, Feb 2004, pp. 330–531 Vol.1.

[103] Y. Chi, R. Elienne-Cummings, and G. Cauwenberghs, “Image sensor with focal plane

change event driven video compression,” in Circuits and Systems, 2008. ISCAS 2008.

IEEE International Symposium on, May 2008, pp. 1862–1865.

[104] ——, “Image sensor with focal plane change event driven video compression,” in

Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, May 2008,

pp. 1862–1865.

BIBLIOGRAPHY 222

[105] BoZhao, X. Zhang, S. Chen, K.-S. Low, and H. Zhuang, “A 64 × 64 cmos image sensor

with on-chip moving object detection and localization,” Circuits and Systems for Video

Technology, IEEE Transactions on, vol. 22, no. 4, pp. 581–588, April 2012.

[106] S. Mizuno, K. Fujita, H. Yamamoto, N. Mukozaka, and H. Toyoda, “A 256 times;256

compact cmos image sensor with on-chip motion detection function,” Solid-State

Circuits, IEEE Journal of, vol. 38, no. 6, pp. 1072–1075, June 2003.

[107] M. Zhang, N. Llaser, H. Mathias, and A. Dupret, “Design and optimization of two mo-

tion detection circuits for video monitoring system,” in Circuits and Systems (ISCAS),

2012 IEEE International Symposium on, May 2012, pp. 1907–1910.

[108] N. Massari, M. Gottardi, L. Gonzo, D. Stoppa, and A. Simoni, “A cmos image sensor

with programmable pixel-level analog processing,” Neural Networks, IEEE Transac-

tions on, vol. 16, no. 6, pp. 1673–1684, Nov 2005.

[109] W. Leon-Salas, S. Balkir, K. Sayood, N. Schemm, and M. Hoffman, “A cmos imager

with focal plane compression using predictive coding,” Solid-State Circuits, IEEE Jour-

nal of, vol. 42, no. 11, pp. 2555–2572, Nov 2007.

[110] S. Kawahito, M. Yoshida, M. Sasaki, K. Umehara, D. Miyazaki, Y. Tadokoro, K. Mu-

rata, S. Doushou, and A. Matsuzawa, “A cmos image sensor with analog two-

dimensional dct-based compression circuits for one-chip cameras,” Solid-State Cir-

cuits, IEEE Journal of, vol. 32, no. 12, pp. 2030–2041, Dec 1997.

[111] Z. Lin, M. Hoffman, N. Schemm, W. Leon-Salas, and S. Balkir, “A cmos image sensor

for multi-level focal plane image decomposition,” Circuits and Systems I: Regular

Papers, IEEE Transactions on, vol. 55, no. 9, pp. 2561–2572, Oct 2008.

[112] S. Chen, A. Bermak, and Y. Wang, “A cmos image sensor with on-chip image com-

pression based on predictive boundary adaptation and memoryless qtd algorithm,”

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, no. 4, pp.

538–547, April 2011.

BIBLIOGRAPHY 223

[113] M. Zhang and A. Bermak, “Compressive acquisition cmos image sensor: From the

algorithm to hardware implementation,” Very Large Scale Integration (VLSI) Systems,

IEEE Transactions on, vol. 18, no. 3, pp. 490–500, March 2010.

[114] ——, “Cmos image sensor with on-chip image compression: A review and perfor-

mance analysis,” Journal of Sensors, 2010.

[115] C. Yeo and K. Ramchandran, “Robust distributed multiview video compression for

wireless camera networks,” Image Processing, IEEE Transactions on, vol. 19, no. 4,

pp. 995–1008, April 2010.

[116] “http://blinkforhome.com,” Blink wireless surveillance system.

[117] “http://www.vuezone.com,” Netgear VueZone remote video system.

[118] L. J. Song and Q. Fan, “The design and implementation of a video surveillance system

for large scale wind farm,” Advanced Materials Research, vol. 361-363, pp. 1257–

1262, 2011.

[119] C. Hartung, R. Han, C. Seielstad, and S. Holbrook, “Firewxnet: A multi-tiered

portable wireless system for monitoring weather conditions in wildland fire envi-

ronments,” in Proceedings of the 4th International Conference on Mobile Systems, Ap-

plications and Services, ser. MobiSys ’06. New York, NY, USA: ACM, 2006, pp. 28–41.

[120] Y. Ye, S. Ci, A. Katsaggelos, Y. Liu, and Y. Qian, “Wireless video surveillance: A

survey,” Access, IEEE, vol. 1, pp. 646–660, 2013.

[121] J. Jung, J. Lim, S. Lee, J. Lee, J. Yang, and C.-M. Kyung, “A low-energy video event

data recorder using dual image/video codec,” in Advanced Video and Signal Based

Surveillance (AVSS), 2014 11th IEEE International Conference on, Aug 2014, pp. 277–

282.

[122] C. Li, D. Wu, and H. Xiong, “Power-rate-distortion model for wireless video commu-

nication under delay and energy constraints,” Circuits and Systems for Video Technol-

ogy, IEEE Transactions on, vol. 24, no. 7, pp. 1170–1183, July 2014.

BIBLIOGRAPHY 224

[123] Z. He, W. Cheng, and X. Chen, “Energy minimization of portable video communica-

tion devices based on power-rate-distortion optimization,” Circuits and Systems for

Video Technology, IEEE Transactions on, vol. 18, no. 5, pp. 596–608, May 2008.

[124] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion analysis for

wireless video communication under energy constraints,” Circuits and Systems for

Video Technology, IEEE Transactions on, vol. 15, no. 5, pp. 645–658, May 2005.

[125] Z. He and D. Wu, “Resource allocation and performance analysis of wireless video

sensors,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16,

no. 5, pp. 590–599, May 2006.

[126] M. Marijan, I. Demirkol, D. Maricic, G. Sharma, and Z. Ignjatovic, “Adaptive sensing

and optimal power allocation for wireless video sensors with sigma-delta imager,”

Image Processing, IEEE Transactions on, vol. 19, no. 10, pp. 2540–2550, Oct 2010.

[127] P. Kaewtrakulpong and R. Bowden, “An Improved Adaptive Background Mixture

Model for Realtime Tracking with Shadow Detection,” in Proc. 2nd European Work-

shop on Advanced Video Based Surveillance Systems. Kluwer Academic Publishers,

September 2001.

[128] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and

Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.

[129] A. Prati, I. Mikic, M. M. Trivedi, and R. Cucchiara, “Detecting moving shadows:

algorithms and evaluation,” IEEE Trans. on Pattern Analysis and Machine Intelligence,

vol. 25, no. 7, pp. 918–923, Jun. 2003.

[130] P. Gorur and B. Amrutur, “Speeded up gaussian mixture model algorithm for back-

ground subtraction,” IEEE Conf. on Advanced Video and Signal Based Surveillance

(AVSS), pp. 386–391, 2011.

[131] W. G. Cochran, Sampling Techniques, 3rd Edition. John Wiley, 1977.

[132] S. K. Thompson, Sampling. Wiley Series in Probability and Statistics, 2012.

BIBLIOGRAPHY 225

[133] “Guidance on choosing a sampling design for environmental data collection,” Envi-

ronmental Protection Agency, United States, Tech. Rep., 2002.

[134] H. J. Chang, H. Jeong, and J. Y. Choi, “Active attentional sampling for speed-up

of background subtraction,” IEEE Conf. on Comput. Vision and Pattern Recognition

(CVPR), pp. 2088 –2095, June 2012.

[135] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in Performance

Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE Interna-

tional Workshop on, Dec 2009, pp. 1–6.

[136] N. Goyette, P. M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar, “Changedetection.net: A

new change detection benchmark dataset,” in 2012 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition Workshops, June 2012, pp. 1–8.

[137] JVT JM Reference Software. [Online]. Available:

http://iphome.hhi.de/suehring/tml/

[138] J. Park, A. Tabb, and A. Kak, “Hierarchical data structure for real-time background

subtraction,” in IEEE Int. Conf. on Image Process. (ICIP), 2006, pp. 1849–1852.

[139] D.-Y. Lee, J.-K. Ahn, and C.-S. Kim, “Fast background subtraction algorithm using

two-level sampling and silhouette detection,” in IEEE Int. Conf. on Image Process.

(ICIP), 2009, pp. 3177–3180.

[140] J. M. Guo, Y.-F. Liu, C.-H. Hsia, M.-H. Shih, and C.-S. Hsu, “Hierarchical method

for foreground detection using codebook model,” IEEE Trans. Circuits Syst. Video

Technol., vol. 21, no. 6, pp. 804–815, June 2011.

[141] H.-H. Lin, J.-H. Chuang, and T.-L. Liu, “Regularized background adaptation: A novel

learning rate control scheme for gaussian mixture modeling,” IEEE Trans. Image Pro-

cess., vol. 20, no. 3, pp. 822–836, 2011.

[142] C. A. Rothkopf, D. H. Ballard, and M. M. Hayhoe, “Task and context determine where

you look,” J. of Vision, vol. 7, no. 14, 2007.

BIBLIOGRAPHY 226

[143] J. W. Suchow and G. A. Alvarez, “Motion silences awareness of visual change,” Cur-

rent biology, vol. 21, pp. 140 – 143, 2011.

[144] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with

discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell.,

vol. 32, no. 9, pp. 1627–1645, 2010.

[145] F. Dadgostar, “Real-time vision-based hand and face tracking and recognition of ges-

ture,” Ph.D. dissertation, Massey University, 2006.

[146] F. Dadgostar and A. Sarrafzadeh, “An adaptive real-time skin detector based on hue

thresholding: A comparison on two motion tracking methods,” Pattern Recognition

Letters, vol. 27, no. 12, pp. 1342 – 1352, 2006.

[147] X. Ren and J. Malik, “Learning a classification model for segmentation,” in Computer

Vision, 2003. Proceedings. Ninth IEEE International Conference on, Oct 2003, pp. 10–

17 vol.1.

[148] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered object detection for

multi-class segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2010

IEEE Conference on, June 2010, pp. 3113–3120.

[149] H. Meuel, M. Reso, J. Jachalsky, and J. Ostermann, “Superpixel-based segmentation

of moving objects for low bitrate roi coding systems,” in Advanced Video and Signal

Based Surveillance (AVSS), 2013 10th IEEE International Conference on, Aug 2013,

pp. 395–400.

[150] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpix-

els compared to state-of-the-art superpixel methods,” Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 34, no. 11, pp. 2274–2282, Nov 2012.

[151] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts,

and shadows in video streams,” Pattern Analysis and Machine Intelligence, IEEE Trans-

actions on, vol. 25, no. 10, pp. 1337–1342, Oct 2003.

BIBLIOGRAPHY 227

[152] A. Joshi and N. Papanikolopoulos, “Learning to detect moving shadows in dy-

namic environments,” Pattern Analysis and Machine Intelligence, IEEE Transactions

on, vol. 30, no. 11, pp. 2055–2063, Nov 2008.

[153] S. Nadimi and B. Bhanu, “Physical models for moving shadow and object detection

in video,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26,

no. 8, pp. 1079–1087, Aug 2004.

[154] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast shadows through a

multidistribution approach,” Pattern Analysis and Machine Intelligence, IEEE Transac-

tions on, vol. 29, no. 7, pp. 1133–1146, July 2007.

[155] F. Porikli and J. Thornton, “Shadow flow: a recursive method to learn moving cast

shadows,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference

on, vol. 1, Oct 2005, pp. 891–898 Vol. 1.

[156] T. Horprasert, D. Harwood, and L. S. Davis, “A statistical approach for real-time

robust background subtraction and shadow detection,” in Proc. IEEE ICCV, vol. 99,

pp. 1–19.

[157] A. Sanin, C. Sanderson, and B. C. Lovell, “Shadow detection: A survey and compar-

ative evaluation of recent methods,” Pattern Recognition, vol. 45, no. 4, pp. 1684 –

1695, 2012.

[158] J.-B. Huang and C.-S. Chen, “Moving cast shadow detection using physics-based fea-

tures,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference

on, June 2009, pp. 2310–2317.

[159] A. Leone, C. Distante, and F. Buccolieri, “A texture-based approach for shadow de-

tection,” in Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE


BIBLIOGRAPHY 228

[160] R. Qin, S. Liao, Z. Lei, and S. Li, “Moving cast shadow removal based on local de-

scriptors,” in Pattern Recognition (ICPR), 2010 20th International Conference on, Aug

2010, pp. 1377–1380.

[161] A. Sanin, C. Sanderson, and B. Lovell, “Improved shadow removal for robust person

tracking in surveillance scenarios,” in Pattern Recognition (ICPR), 2010 20th Interna-

tional Conference on, Aug 2010, pp. 141–144.

[162] C. M. Ahmed Elgammal and D. Hu, “Skin detection - a short tutorial.”

[163] M. Jones and J. Rehg, “Statistical color models with application to skin detection,”

in Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference

on., vol. 1, 1999, p. 280 Vol. 1.

[164] H. Greenspan, J. Goldberger, and I. Eshet, “Mixture model for face-color modeling

and segmentation,” Pattern Recognition Letters, vol. 22, no. 14, pp. 1525 – 1536,

2001.

[165] S. Phung, A. Bouzerdoum, and S. Chai, D., “Skin segmentation using color pixel

classification: analysis and comparison,” Pattern Analysis and Machine Intelligence,

IEEE Transactions on, vol. 27, no. 1, pp. 148–154, Jan 2005.

[166] Lti-lib: Image processing and computer vision library. [Online]. Available:

http://ltilib.sourceforge.net/doc/homepage/index.shtml

[167] C. Papageorgiou and T. Poggio, “A trainable system for object detection,” Interna-

tional Journal of Computer Vision, vol. 38, no. 1, pp. 15–33, 2000.

[168] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society

Conference on, June 2005, vol. 1, pp. 886–893 vol. 1.

[169] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features.” in BMVC.

British Machine Vision Association, 2009.

BIBLIOGRAPHY 229

[170] B. Wu and R. Nevatia, “Detection of multiple, partially occluded humans in a single

image by bayesian combination of edgelet part detectors,” in Computer Vision, 2005.

ICCV 2005. Tenth IEEE International Conference on, vol. 1, Oct 2005, pp. 90–97 Vol.

1.

[171] S. Maji, A. Berg, and J. Malik, “Classification using intersection kernel support vector

machines is efficient,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008.


[172] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast human detection using a cascade

of histograms of oriented gradients,” in Computer Vision and Pattern Recognition,

2006 IEEE Computer Society Conference on, vol. 2, 2006, pp. 1491–1498.

[173] V. Shet, M. Singh, C. Bahlmann, V. Ramesh, J. Neumann, and L. Davis, “Predicate

logic based image grammars for complex pattern recognition,” International Journal

of Computer Vision, vol. 93, no. 2, pp. 141–161, 2011.

[174] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion

and appearance,” in Computer Vision, 2003. Proceedings. Ninth IEEE International

Conference on, Oct 2003, pp. 734–741 vol.2.

[175] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-

detection-by-tracking,” in Computer Vision and Pattern Recognition, 2008. CVPR

2008. IEEE Conference on, June 2008, pp. 1–8.

[176] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for ac-

curate object detection and semantic segmentation,” in Computer Vision and Pattern

Recognition (CVPR), 2014 IEEE Conference on, June 2014, pp. 580–587.

[177] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol.

abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842

BIBLIOGRAPHY 230

[178] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-

C. Loy, and X. Tang, “Deepid-net: Deformable deep convolutional neural networks

for object detection,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE

Conference on, June 2015, pp. 2403–2412.

[179] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time

object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015.

[Online]. Available: http://arxiv.org/abs/1506.01497

[180] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation

of the state of the art,” Pattern Analysis and Machine Intelligence, IEEE Transactions

on, vol. 34, no. 4, pp. 743–761, 2012.

[181] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade object detection with de-

formable part models,” in Computer Vision and Pattern Recognition (CVPR), 2010


[182] H.-T. Lin, C.-J. Lin, and R. Weng, “A note on platts probabilistic outputs for support

vector machines,” Machine Learning, vol. 68, no. 3, pp. 267–276, 2007.

[183] R. C. Jingchen Liu and Y. Liu, in Automatic Surveillance Camera Calibration without

Pedestrian Tracking, 2011, pp. 117.1–117.11.

[184] S. C. Lee and R. Nevatia, “Robust camera calibration tool for video surveillance cam-

era in urban environment,” in Computer Vision and Pattern Recognition Workshops

(CVPRW), 2011 IEEE Computer Society Conference on, June 2011, pp. 62–67.

[185] P. Sudowe and B. Leibe, “Efficient use of geometric constraints for sliding-window

object detection in video,” in Computer Vision Systems, ser. Lecture Notes in Com-

puter Science, J. Crowley, B. Draper, and M. Thonnat, Eds. Springer Berlin Heidel-

berg, 2011, vol. 6962, pp. 11–20.

[186] 3d calibration of riva ip cameras with integrated video analytics. [Online]. Available:

http://www.rivatech.de/en/vca/vca-installation

BIBLIOGRAPHY 231

[187] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual

tracking: An experimental survey,” Pattern Analysis and Machine Intelligence, IEEE

Transactions on, vol. 36, no. 7, pp. 1442–1468, July 2014.

[188] M. Godec, P. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in

Computer Vision (ICCV), 2011 IEEE International Conference on, Nov 2011, pp. 81–

88.

[189] D. Mitzel, E. Horbert, A. Ess, and B. Leibe, “Multi-person tracking with sparse de-

tection and continuous segmentation,” in Computer Vision ECCV 2010, ser. Lecture

Notes in Computer Science, K. Daniilidis, P. Maragos, and N. Paragios, Eds. Springer

Berlin Heidelberg, 2010, vol. 6311, pp. 397–410.

[190] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the

integral histogram,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer

Society Conference on, vol. 1, June 2006, pp. 798–805.

[191] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” in Com-

puter Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, June 2012,

pp. 1940–1947.

[192] C. Tomasi and T. Kanade, “Detection and tracking of point features,” International

Journal of Computer Vision, Tech. Rep., 1991.

[193] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern Recog-

nition, 1994. Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on, Jun

1994, pp. 593–600.

[194] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” Pattern Anal-

ysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 5, pp. 564–577, May

2003.

BIBLIOGRAPHY 232

[195] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans

by bayesian combination of edgelet based part detectors,” International Journal of

Computer Vision, vol. 75, no. 2, pp. 247–266, 2007.

[196] T. Zhao, R. Nevatia, and B. Wu, “Segmentation and tracking of multiple humans in

crowded environments,” Pattern Analysis and Machine Intelligence, IEEE Transactions

on, vol. 30, no. 7, pp. 1198–1211, July 2008.

[197] K. Smith, D. Gatica-Perez, and J. Odobez, “Using particles to track varying numbers

of interacting people,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005.

IEEE Computer Society Conference on, vol. 1, June 2005, pp. 962–969 vol. 1.

[198] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online mul-

tiple instance learning,” Pattern Analysis and Machine Intelligence, IEEE Transactions

on, vol. 33, no. 8, pp. 1619–1632, Aug 2011.

[199] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2,

pp. 261–271, Feb. 2007.

[200] Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error: Automatic de-

tection of tracking failures,” in Pattern Recognition (ICPR), 2010 20th International

Conference on, Aug 2010, pp. 2756–2759.

[201] L. Antanas, M. van Otterlo, J. O. Mogrovejo, T. Tuytelaars, and L. D. Raedt, “There

are plenty of places like home: Using relational representations in hierarchies for

distance-based image understanding,” Neurocomputing, vol. 123, pp. 75 – 85, 2014,

contains Special issue articles: Advances in Pattern Recognition Applications and

Methods.

[202] M. Ginsberg, “Multivalued logics: A uniform approach to reasoning in ai,” Computer

Intelligence, vol. 4, no. 1, pp. 256–316, 1988.

BIBLIOGRAPHY 233

[203] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization

with superpixel neighborhoods,” in Computer Vision, 2009 IEEE 12th International


[204] W. Gao, Y. Tian, T. Huang, S. Ma, and X. Zhang, “The IEEE 1857 standard: Empow-

ering smart video surveillance systems,” Intelligent Systems, IEEE, vol. 29, no. 5, pp.

30–39, Sept 2014.

[205] W. Gao and S. Ma, Advanced Video Coding Systems. Springer International Publish-

ing, 2014.

[206] W. Benesova and M. Kottman, “Fast superpixel segmentation using morphological

processing,” in MVML, 2014, pp. 1–9.

[207] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object

detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36,

no. 8, pp. 1532–1545, 2014.

[208] M. Sadeghi and D. Forsyth, “30hz object detection with dpm v5,” in Computer Vision

ECCV 2014, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla, B. Schiele,

and T. Tuytelaars, Eds. Springer International Publishing, 2014, vol. 8689, pp.

65–79.

[209] S. A. Kripke, “Outline of a theory of truth,” Journal of Philosophy, vol. 72, no. 19, pp.

690–716, 1975.

[210] M. Fitting, “Notes on the mathematical aspects of kripke’s theory of truth.” Notre

Dame J. Formal Logic, vol. 27, no. 1, pp. 75–88, 01 1986.

[211] ——, “Bilattices are nice things,” in Self-Reference, T. Bolander, V. Hendricks, and

S. A. Pedersen, Eds. Csli Publications, 2006.

[212] N. Belnap, “How a computer should think,” in Contemporary Aspects of Philosophy,

G. Ryle, Ed. Oriel Press Ltd., 1977.

BIBLIOGRAPHY 234

[213] N. D. Belnap, “A useful four-valued logic,” in Modern Uses of Multiple-Valued Logic,

J. M. Dunn and G. Epstein, Eds. D. Reidel, 1977.

[214] C. Cornelis, O. Arieli, G. Deschrijver, and E. Kerre, “Uncertainty modeling by bilattice-

based squares and triangles,” Fuzzy Systems, IEEE Transactions on, vol. 15, no. 2, pp.

161–175, April 2007.

[215] V. D. Shet, “Bilattice based logical reasoning for automated visual surveillance and

other applications,” Ph.D. dissertation, University of Maryland, College Park, 2007.

[216] B. Schweizer and A. Sklar, “Associative functions and abstract semi-groups,” 1963.

bitrate reduction techniques for low-complexity ...skip decision algorithm to perform roi coding for...

Documents