ren haoyu 2009.10.30 iccv 2009 paper reading. selected paper paper 1 –187 labelme video: building...

Ren Haoyu 2009.10.30

ICCV 2009 Paper Reading

Selected Paper

• Paper 1– 187 LabelMe Video: Building a Video Database with

Human Annotations– J. Yuen, B. Russell, C. Liu, and A. Torralba

• Paper 2– 005 An HOG-LBP Human Detector with Partial Occlusion

Handling – X. Wang, T. Han, and S. Yan

Paper 1 Overview

• Author information• Abstract• Paper content

– To build a database– Database information– Database function

• Conclusion

Author information(1/4)

• Jenny Yuen– Education

• BSD at Computer Science from the University of Washington

• Now a third year PhD student at MIT-CSAIL• Advised by Professor Antonio Torralba

– Research Interest• Computer vision (object/action recognition, scene

understanding, image/video databases)– Papers

• 2 ECCV’08, 1 CVPR’09


• Bryan Russell– Education

• ...• Postdoctoral Fellow at INRIA WILLOW Team

– Research Interest• 3D object and scene modeling, analysis, and retrieval• Human activity capture and classification• Category-level object and scene recognition

– Papers• 1 NIPS’09 1 CVPR’09, 1CVPR’08


• Ce Liu– Education

• BSD at Department of Automation, Tsinghua University• MSD at Department of Automation, Tsinghua University• PhD in Department of Electrical Engineering and Computer

Science, MIT

– Research Interest• Computer Vision, Computer Graphics, Computational

Photography, applications of Machine Learning in Vision and Graphics

– Publications• ECCV08, CVPR08, PAMI08…


• Antonio Torralba– Education

• …• Associate Professor at MIT-CSAIL

– Research Interest• scene and object recognition

– Papers• 3 CVPR’09, 2 ICCV09…

Abstract (1/1)

• Problem– Currently, video analysis algorithms suffer from lack of inform

ation regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking.

• Main contribution– We designed an online and openly accessible video annotati

on system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos.

– The annotations are also complemented with knowledge from static image databases to infer occlusion and depth information.

– Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations.

• We complement this paper demonstrating potential uses of this database by studying motion statistics as well as cause-effect motion relationships between objects.

To build a database (1/1)

• In original database, little has been taken into account for the prior knowledge of motion, location and appearance at the object and object interaction levels in real world videos.

• To build one that will scale in quantity, variety, and quality like the currently available ones in benchmark databases for both static images and videos

• Diversity, accuracy and openness– We want to collect a large and diverse database of

videos that span many different scene, object, and action categories, and to accurately label the identity and location of objects and actions.

– Furthermore, we wish to allow open and easy access to the data without copyright restrictions.

Database Information (1/1)

• 238 object classes, 70 action classes and 1,903 video sequences

Database function (1/1)

• Object Annotation• Event Annotation• Annotation interpolation

– To fill the missing polygons in between key frames– 2D/3D interpolation

• Occlusion handling and depth ordering• Cause-effect relations within moving objects

Annotation interpolation (1/3)

• 2D interpolation– Given the object 2-D position p0 and p1 at two key frames t = 0

and t = 1, assumes that the points in outlining objects are transformed by a 2D projection plus a residual term, where S, R, and T are scaling, rotation, and translation matrices encoding the projection from p0 to p1 that minimizes the residual term r

– Then any polygons at frame t ∈ [0, 1], can then be linearly interpolatedas:

1 0*p SR p T r

0[ ] [ ]ttp SR p t T r


• 3D interpolation– Assuming that a given point on the object moves in a straight line in

the 3D world, the motion of point X(t) at time t in 3D can be written as

0( ) ( )X Xt t D where the X0 is the initial point, D is the 3D direction and is the displacement along the direction vector

– The image coordinates for points on the object can be written as

0 0( ) ( )( ( ), ( )) ( , )

( ) 1 ( ) 1v vx t x y t y

x t y tt t

– Assume perspective projection and that the camera is stationary, the intrinsic and extrinsic parameters of the camera can be expressed as a 3×4 matrix P, the points projected to the image plane is

0 0 0( ) ( ) , ,X x x x X xv vP t t P PD


• 3D interpolation– Assume that the point moves with constant velocity, we have( )t vt

– In summary, to find the image coordinates for points on the object at any time, we simply need to know the coordinates of a point at two different times.

0

( )v

x xv

t x x

– Given a corresponding second point along the path projected into another frame, we can recover the velocity as

( ) ( , ,1)x t x y

• Comparison of 2D and 3D interpolation– Pixel error per object class.

Occlusion handling and depth ordering (1/1)

• Occlusion handling and depth ordering – Method 1: not work

• Model the appearance of the object, wherever there is overlapping with another object, infer which object owns the visible part based on matching appearance

– Method 2: not work• When two objects overlap, the polygon with more control

points in the intersection region is in front– Method 3: work

• Extract accurate depth information using the object labels and to infer support relationships from a large database of annotated images

• Defining a subset of objects as being ground objects (e.g., road, sidewalk,etc.), inferred the support relationship by counting how many times the bottom part of a polygon overlaps with the supporting object (e.g, a person + road)

Cause-effect relations within moving objects (1/1)

• Cause-effect relations – Define a measure of causality, which is the degree to

which an object class C causes the motion in an object of class E: ( | )

( , )( | )

moves moves and cause E

moves moves and not cause E

p E CCausality C E

p E C

Examples (1/1)

Conclusion (1/1)

• We designed an open, easily accessible, and scalable annotation system to allow online users to label a database of real-world videos.

• Using our labeling tool, we created a video database that is diverse in samples and accurate, with human guided annotations.

• Based on this database, we studied motion statistics and cause-effect relationships between moving objects to demonstrate examples of the wide array of applications for our database.

• Furthermore, we enriched our annotations by propagating depth information from a static and densely annotated image database.

Paper 2 Overview

• Author information• Abstract• Paper content

– HOG feature– LBP feature– Occlusion handling

• Experimental result• Conclusion


• Wang Xiaoyu: can not find the homepage• Tony Xu Han

– Education• PhD at University of Illinois at Urbana-Champaign.

(Advisor: Prof. Thomas Huang) • MSD at University of Rhode Island • MSD at Beijing Jiaotong University• BSD at Beijing Jiaotong University

– Research interest• Computer Vision, Machine Learning, Human Computer

Interaction, Elder Care Technology – Papers

• 1 CSVT08, 1 CSVT09, 1CVPR09


• Yan Shuicheng– Education

• BSD, MSD, PhD at mathematics department, PKU• Assistant Professor in the Department of Electrical and

Computer Engineering at National University of Singapore• The founding lead of the Learning and Vision Research

Group

– Research Interest• Activity and event detection in images and videos, Subspace

learning and manifold learning, Transfer Learning…

– Papers• 2 CVPR’09, 1 IP’09, 1PAMI’09, 2 ACM’09…

Abstract (1/1)

• Problem– Performance, occlusion handling

• Main contribution– By combining Histograms of Oriented Gradients (HOG)

and Local Binary Pattern (LBP) as the feature set, we propose a novel human detection approach capable of handling partial occlusion.

– For each ambiguous scanning window, we construct an occlusion likelihood map by using the response of each block of the HOG feature to the global detector.

– If partial occlusion is indicated with high likelihood in a certain scanning window, part detectors are applied on the unoccluded regions to achieve the final classification on the current scanning window.

• We achieve a detection rate of 91.3% with FPPW= 10−6, 94.7% with FPPW= 10−5, and 97.9% with FPPW= 10−4 on the INRIA dataset, which, to our best knowledge, is the best human detection performance on the INRIA dataset.

HOG feature(1/4)

• HOG feature– Histogram of Oriented Gradient, the most famous and

most successful feature in human detection– Calculate the gradient orientation histogram voted

by the gradient magnitude– Perform well with Linear SVM, Kernel SVM (RBF, IK,

Quadratic…), LDA + AdaBoost, SVM + AdaBoost, Logistic Boost…

• HOG feature extraction

HOG feature(2/4)

• HOG feature extraction– For a 64x128 patch, the minimum cell size is 8x8– Quantize the gradient orientation into 9 bins, use tri-

linear interpolation + Gauss weighting to vote the gradient magnitude

– A block consists of 2x2 cells, overlap 50%, total 105 blocks with 3,780 bins

– We can use integral image to speed up the HOG feature extraction without tri-linear interpolation and Gauss weightingCell C0 C1

C2 C3

dx 1-dx

dy

1-dy

HOG histogram of block

Cell C0 C1

C2 C3

9 orientation bins for each cell

HOG feature(3/4)

• Convoluted Tri-linear Interpolation (CTI)– Use CTI instead of Tri-linear Interpolation to fit integral

image– Vote the gradient with a real-valued direction between 0

and pi into the 9 discrete bins according to its direction and magnitude

– Using bilinear interpolation to distribute the magnitude of the gradient into two adjacent bins

– Designed a 7x7 convolution kernel. The weights are distributed to the neighborhood linearly according to the distances. Use this kernel to convolve over the orientation bin image to achieve the tri-linear interpolation.

– Make the integral image on the convoluted bin image

HOG feature(4/4)

• Convoluted Tri-linear Interpolation (CTI)– Use FFT for convolution is efficient. So CTI doesn’t

increase the space complexity of the integral image approach

LBP feature(1/1)

• LBP feature– Local Binary Pattern, an exceptional texture that widely

used in various applications and has achieved very good results in face recognition

– Build pattern histograms in cells– Use to denote LBP feature that takes n sample

points with radius r, and the number of 0-1 transitions is no more than u

– Use bilinear interpolation to locate the points, use Euclidean distance to measure the distance instead of

– Integral image for fast extraction

,un rLBP

l

Occlusion handling (1/5)

• Basic idea– If a portion of the pedestrian is occluded, the densely

extracted blocks of features in that area uniformly respond to the linear SVM classifier with negative inner products

– Use the classification score of each block to infer whether the occlusion occurs and where it occurs

– When the occlusion occurs, the part-based detector is triggered to examine the unoccluded portion


• The decision function of linear SVM is

1

( ) ,l

Tk k

k

f

x x x w x

where w is the weighting vector of the linear SVM as

1 1051

[ ,..., ]l

Tk k

i

w x w w

We distribute the constant bias to each block Bi.

105 105

1 1

( ) ( )T Ti i i

i i

f f B

x w x w x

Then the real contribution of a block could be got by subtracting the corresponding bias from the summation of feature inner production over this block. So the key problem is how to learni


• Learn the , i.e. the constant bias from the training part of the INRIA dataset by collecting the relative ratio of the bias constant in each block to the total bias constant.

Denote the set of HOG features of positive/negative training samples as: (N+/N- is the number of positive/negative samples)Denote the ith block as , we have

i

{ } 1,..., { } 1,...,p qp N q N x x

; ;,p i q iB B

105

:1 1 1

105

:1 1 1

( )

( )

N NT

p i p ip p i

N NT

q i q iq q i

f S N B

f S N B

x w

x w


• Denoted , we have105

: :1 1 1

0 ( )N N

Ti p i q i

i p q

A N N A B B

w

SA

S

i.e. 105

: :1 1 1

( )N N

Ti p i q i

i p q

B A B B

w

where1

BA N N

Then we have

: :1 1

( )N N

Ti i p i q i

p q

B A B B

w


• Implementation– Construct the binary occlusion likelihood image

according to the response of each block of the HOG feature. The intensity of the occlusion likelihood image is the sign of

– Use mean-shift to segment out the possible occlusion regions on the binary occlusion likelihood image with as the weight

– A segmented region of the window with an overall negative response is inferred as an occluded region. The part detector is applied. But if all the segmented regions are consistently negative, we tends to treat the image as a negative image

( )i if B

| ( ) |i if B

Original patches occlusion likelihood image

Experimental result (1/3)

• Experiment 1: using cell-LBP feature on INRIA dataset– with L1-norm of 16x16 size shows the best result, about

94.0% at 10e-4

28,1LBP


• Experiment 2: using HOG-LBP feature on INRIA dataset– HOG-LBP outperforms all state-of-the are algorithms

under both FPPI and FPPW criteria


• Experiment 3: occlusion handling– Using occlusion handling strategy shows sufficient

improvement on the detection performance– Overlaying PASCAL segmented objects to the testing

images in the INRIA dataset

Conclusion (1/1)

• We propose a human detection approach capable of handling partial occlusion and a feature set that combines the tri-linear interpolated HOG with LBP in the framework of integral image.

• It has been shown in our experiments that the HOG-LBP feature outperforms other state-of-the-art detectors on the INRIA dataset.

• However, our detector cannot handle the articulated deformation of people, which is the next problem to be tackled.

Thanks