hidden part models for human action recognition probabilistic vs. margin. - wang, mori - ieee...

8/3/2019 Hidden Part Models for Human Action Recognition Probabilistic vs. Margin. - Wang, Mori - IEEE Transactions on Patt…

http://slidepdf.com/reader/full/hidden-part-models-for-human-action-recognition-probabilistic-vs-margin- 1/14

Hidden Part Models for Human ActionRecognition: Probabilistic versus Max Margin

Yang Wang and Greg Mori, Member , IEEE

Abstract—We present a discriminative part-based approach for human action recognition from video sequences using motion

features. Our model is based on the recently proposed hidden conditional random field (HCRF) for object recognition. Similarly to

HCRF for object recognition, we model a human action by a flexible constellation of parts conditioned on image observations.

Differently from object recognition, our model combines both large-scale global features and local patch features to distinguish various

actions. Our experimental results show that our model is comparable to other state-of-the-art approaches in action recognition. In

particular, our experimental results demonstrate that combining large-scale global features and local patch features performs

significantly better than directly applying HCRF on local patches alone. We also propose an alternative for learning the parameters of

an HCRF model in a max-margin framework. We call this method the max-margin hidden conditional random field (MMHCRF). We

demonstrate that MMHCRF outperforms HCRF in human action recognition. In addition, MMHCRF can handle a much broader range

of complex hidden structures arising in various problems in computer vision.

Index Terms—Human action recognition, part-based model, discriminative learning, max margin, hidden conditional random field.

Ç

1 INTRODUCTION

A good image representation is the key to the solution of many recognition problems in computer vision. In the

literature, there has been a lot of work in designing goodimage representations (i.e., features). Different imagefeatures typically capture different aspects of imagestatistics. Some features (e.g., GIST [1]) capture the globalscene of an image, while others (e.g., SIFT [2]) capture theimage statistics of local patches.

A primary goal of this work is to address the followingquestion: Is there a principled way to combine both large-scale and local patch features? This is an important questionin many visual recognition problems. The work in [3] hasshown the benefit of combining large-scale features and localpatch features for object detection. In this paper, we applythe same intuition to the problem of recognizing humanactions from video sequences. In particular, we represent ahuman action by combining large-scale template featuresand part-based local features in a principled way, see Fig. 1.

Recognizing human actions from videos is a task of obvious scientific and practical importance. In this paper,we consider the problem of recognizing human actions

from video sequences on a frame-by-frame basis. Wedevelop a discriminatively trained hidden part model torepresent human actions. Our model is inspired by thehidden conditional random field (HCRF) model [4] inobject recognition.

In object recognition, there are three major representa-tions: global template (rigid, e.g., [5], or deformable, e.g.,[6]), bag-of-words [7], and part-based [3], [8]. All threerepresentations have been shown to be effective on certainobject recognition tasks. In particular, recent work [3] hasshown that part-based models outperform global templatesand bag-of-words on challenging object recognition tasks.

A lot of ideas used in object recognition can also be

found in action recognition. For example, there is work [9]that treats actions as space-time shapes and reduces theproblem of action recognition to 3D object recognition. Inaction recognition, both global template [10] and bag-of-words models [11], [12], [13], [14] have been shown to beeffective on certain tasks. Although conceptually appealingand promising, the merit of part-based models has not yet been widely recognized in action recognition. One goal of this work is to address this issue and to show the benefits of combining large-scale global template features and partmodels based on local patches.

A major contribution of this work is that we combine the

flexibility of part-based approaches with the global per-spectives of large-scale template features in a discriminativemodel for action recognition. We show that the combinationof part-based and large-scale template features improvesthe final results.

Another contribution of this paper is to introduce a newlearning method for training HCRF models based on themax-margin criterion. The new learning method, which wecall the Max-Margin Hidden Conditional Random Field(MMHCRF), is based on the idea of latent SVM (LSVM) in[3]. The advantage of MMHCRF is that the model can besolved efficiently for a large variety of complex hidden

structures. The difference between our approach and LSVMis that we directly deal with multiclass classification, whileLSVM in [3] only deals with binary classification. It turnsout that the multiclass case cannot be easily solved using

1310 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 7, JULY 2011

. Y. Wang is with the Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801.E-mail: [email protected].

. G. Mori is with the School of Computing Science, Simon Fraser University,Burnaby, BC V5A 1S6, Canada. E-mail: [email protected].

Manuscript received 15 July 2009; revised 19 Mar. 2010; accepted 14 Aug.2010; published online 29 Nov. 2010.

Recommended for acceptance by P. Perez.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2009-07-0453.Digital Object Identifier no. 10.1109/TPAMI.2010.214.

0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society



the off-the-shelf SVM solver, unlike LSVM. Instead, wedevelop our own optimization technique for solving ourproblem. Furthermore, we directly compare probabilisticversus max-margin learning criteria on identical models.We provide both experimental and theoretical evidences forthe effectiveness of the max-margin learning.

Previous versions of this paper are published in [15] and[16]. The rest of this paper is organized as follows: Section 2reviews previous work. Section 3 introduces our hidden

part model for human action recognition, based on thehidden conditional random field. Section 3 also gives thedetails of learning and inference in HCRF. Section 4presents a new approach, called the MMHCRF, for learningthe parameters in the hidden part model proposed inSection 3. In Section 5, we give a detailed analysis andcomparison of HCRF and MMHCRF from a theoreticalpoint of view. We present our experimental results on two benchmark data sets in Section 6 and conclude in Section 7.

2 RELATED WORK

A lot of work has been done in recognizing actions fromvideo sequences. Much of this work is focused on analyzingpatterns of motion. For example, Cutler and Davis [17] andPolana and Nelson [18] detect and classify periodicmotions. Little and Boyd [19] analyze the periodic structureof optical flow patterns for gait recognition. Rao et al. [20]describe a view-invariant representation for 2D trajectoriesof tracked skin blobs. There is also work using both motionand shape cues. For example, Bobick and Davis [21] use arepresentation known as “temporal templates” to capture both motion and shape, represented as evolving silhouettes.Nowozin et al. [13] find discriminative subsequences invideos for action recognition. Jhuang et al. [22] develop a

biologically inspired system for action recognition.Recently, space-time interest points [23] were introduced

into the action recognition community [14]. Most of theapproaches use bag-of-words representation to model thespace-time interest points [11], [12], [14]. This representa-tion suffers the same drawbacks as their 2D analogies inobject recognition, i.e., spatial information is ignored. Thereis also some recent work [24] that tries to combine space-time interest points with a generative model similar to theconstellation model [25] in object recognition.

Our work is partly inspired by a recent work in part-basedevent detection [26]. In that work, template matching is

combined with a pictorial structure model to detect andlocalize actions in crowded videos. One limitation of thatwork is that one has to manually specify the parts. Unlike Keet al. [26], the parts in our model are initialized automatically.

Our work is also related to discriminative learningmethods in machine learning, in particular, learning withstructured outputs [27], [28], [29], [30] and latent variables[3], [4].

3 HIDDEN CONDITIONAL RANDOM FIELDS FOR

HUMAN ACTIONS

Our part-based representation for human actions is inspired by thehidden conditional randomfield model [4], which wasoriginally proposed for object recognition and has also beenapplied in sequence labeling. Objects are modeled as flexibleconstellations of parts conditioned on the appearances of local patches found by interest-point operators. The prob-ability of the assignment of parts to local features is modeled by a conditional random field (CRF) [27]. The advantage of the HCRF is that it relaxes the conditional independenceassumption commonly used in the bag-of-words approachesof object recognition. In standard bag-of-words approaches,all of the local patches in an image are independent of each

other, given the class label. This assumption is somewhatrestrictive. The HCRF relaxes this assumption by allowingnearby patches to interact with each other.

Similarly, local patches can also be used to distinguishactions. Fig. 6a shows some examples of human motion andlocal patches that can be used to distinguish them. A bag-of-words representation can be used to model these localpatches for action recognition. However, it suffers from thesame restriction of conditional independence assumptionthat ignores the spatial structures of the parts. In this work,we use similar ideas to model the constellation of these localpatches in order to alleviate this restriction.

There are also some important differences between objectsand actions. For objects, local patches could carry enoughinformation for recognition. But, for actions, we believe thatlocal patches are not sufficiently informative. In ourapproach, we modify the HCRF model to combine localpatches and large-scale global features. The large-scale globalfeatures are represented by a root model that takes the frameas a whole. Another important difference with [4] is that weuse the learned root model to find discriminative localpatches, rather than using a generic interest-point operator.

3.1 Motion Features

Our model is built upon the optical flow features in [10]. This

motion descriptor has been shown to perform reliably withnoisy image sequences and has been applied in varioustasks, such as action classification, motion synthesis, etc.

To calculate the motion descriptor, we first need to trackand stabilize the people in a video sequence. Any reason-able tracking or human detection algorithm can be usedsince the motion descriptor we use is very robust to jittersintroduced by the tracking. Given a stabilized videosequence in which the person of interest appears in thecenter of the field of view, we compute the optical flow ateach frame using the Lucas-Kanade [31] algorithm. Theoptical flow vector field F is then split into two scalar fields,

F x and F y, corresponding to the x and y components of F .F x and F y are further half-wave rectified into fournonnegative channels: F þx , F Àx , F þy , and F Ày , so that F x ¼F þx À F Àx and F y ¼ F þy À F Ày . These four nonnegative

WANG AND MORI: HIDDEN PART MODELS FOR HUMAN ACTION RECOGNITION: PROBABILISTIC VERSUS MAX MARGIN 1311

Fig. 1. High-level overview of our model: A human action is represented

by the combination of large-scale template features (we use optical flowin this work) and a collection of local “parts.” Our goal is to unify bothfeatures in a principled way for action recognition.



channels are then blurred with a Gaussian kernel andnormalized to obtain the final four channels F bþ

x , F bÀx , F bþ

y ,and F bÀ

y (see Fig. 2).

3.2 Hidden Part Model

Now, we describe how we model a frame I in a video

sequence. Let x be the motion feature of this frame and y be

the corresponding class label of this frame, ranging over a

finite label alphabet Y . Our task is to learn a mapping from x

to y. We assume that each image I contains a set of salient

patches fI 1; I 2; . . . ; I mg. We will describe how to find these

salient patches in Section 3.3. Our training set consists of

labeled images hxðtÞ; yðtÞi (as a notation convention, we use

superscripts in brackets to index training images and

subscripts to index patches) for t ¼ 1; 2; . . . ; N , where yðtÞ 2

Y and xðtÞ ¼ ðx

ðtÞ1 ; x

ðtÞ2 . . . ; xðtÞ

m Þ. xðtÞi ¼ x

ðtÞðI ðtÞi Þ is the feature

vector extracted from the global motion feature xðtÞ at the

location of the patch I ðtÞi . For each image I ¼ fI 1; I 2; . . . ; I mg,

we assume that there exists a vector of hidden “part”variables h ¼ fh1; h2; . . . ; hmg, where each hi takes values

from a finite set H of possible parts. Intuitively, each hi

assigns a part label to the patch I i, where i ¼ 1; 2; . . . ; m. For

example,for the action “waving-two-hands,”these parts may

be used to characterize the movement patterns of the left and

right arms. Thevalues of h arenot observed in thetraining set

and will become the hidden variables of the model.We assume that there are certain constraints between

some pairs of ðh j; hkÞ. For example, in the case of “waving-two-hands,” two patches h j and hk on the left hand might

have the constraint that they tend to have the same partlabel since both of them are characterized by the movementof the left hand. If we consider hi ði ¼ 1; 2; . . . ; mÞ to bevertices in a graph G ¼ ðE ; VÞ, the constraint between h j andhk is denoted by an edge ð j;kÞ 2 E . See Fig. 3 for anillustration of our model. Note that the graph structure can be different for different images. We will describe how tofind the graph structure E in Section 3.3.

Given the motion feature x of an image I , its correspond-ing class label y, and part labels h, a hidden conditionalrandom field is defined as

pðy;hjx; Þ ¼expð> Á Èðx;h; yÞÞP

y2Y P

h2Hm expð> Á Èðx; h; yÞÞ;

where is the model parameter and Èðy;h;xÞ is a featurevector depending on the motion feature x, the class label y,

and the part labels h. We use Hm to denote the set of all of the possible labelings of m hidden parts. It follows that

pðyjx; Þ ¼Xh2Hm

pðy;hjx; Þ

¼ Ph2Hm expð> Á Èðx;h; yÞÞ

Py2Y Ph2Hm expð> Á Èðx;h; yÞÞ:

We assume > Á Èðy;h;xÞ has the following form:

> Á Èðh;x; yÞ ¼X j2V

> Á ðx j; h jÞ þX j2V

> Á ’ðy; h jÞ

þX

ð j;kÞ2E

> Á ðy; h j; hkÞ þ > Á !ðy;xÞ;ð1Þ

where ðÁÞ and ’ðÁÞ are feature vectors depending on unaryh js, ðÁÞ is a feature vector depending on pairs of ðh j; hkÞ,and !ðÁÞ is a feature vector that does not depend on thevalues of hidden variables. The details of these featurevectors are described in the following.

Unary potential > Á ðx j; h jÞ: This potential functionmodels the compatibility between x j and part label h j, i.e.,how likely it is that patch x j is labeled as part h j. It isparameterized as

> Á ðx j; h jÞ ¼Xc2H

>c Á 11fh j¼cg Á

Âf aðx jÞf sðx jÞ

Ã; ð2Þ

where we use ½f aðx jÞf sðx jÞ to denote the concatenation of two vectors f aðx jÞ and f sðx jÞ. f aðx jÞ is a feature vectordescribing the appearance of the patch x j. In our case, f aðx jÞis simply the concatenation of four channels of the motionfeatures at patch x j, i.e., f aðx jÞ ¼ ½F bþ

x ðx jÞF bÀx ðx jÞF bþ

y ðx jÞ

F bÀy ðx jÞ. f sðx jÞ is a feature vector describing the spatial

location of the patch x j. We discretize the whole imagelocations into l bins, and f sðx jÞ is a length l vector of all


Fig. 2. Construction of the motion descriptor. (a) Original image. (b) Optical flow. (c) x and y components of optical flow vectors F x and F y. (d) Half-

wave rectification of x and y components to obtain four separate channels, F þx , F Àx , F þy , F Ày . (e) Final blurry motion descriptors F bþx , F bÀ

x , F bþy , F bÀ

y .

Fig. 3. Illustration of the model. Each circle corresponds to a variableand each square corresponds to a factor in the model.



zeros, with a single one for the bin occupied by x j. The

parameter c can be interpreted as the measurement of

compatibility between feature vector ½f aðx jÞf sðx jÞ and part

label h j ¼ c. The parameter is simply the concatenation of

c for all c 2 H.Unary potential > Á ’ðy; h jÞ: This potential function

models the compatibility between class label y and part

label h j, i.e., how likely it is that an image with class label ycontains a patch with part label h j. It is parameterized as

> Á ’ðy; h jÞ ¼Xa2Y

Xb2H

a;b Á 11fy¼ag Á 11fh j¼bg; ð3Þ

where a;b indicates the compatibility between y ¼ a and

h j ¼ b.Pairwise potential > Á ðy; h j; hkÞ: This pairwise poten-

tial function models the compatibility between class label y

and a pair of part labels ðh j; hkÞ, i.e., how likely it is that an

image with class label y contains a pair of patches with part

labels h j and hk, where ð j;kÞ 2 E corresponds to an edge in

the graph. It is parameterized as

> Á ðy; h j; hkÞ ¼Xa2Y

Xb2H

Xc2H

a;b;c Á 11fy¼ag Á 11fh j¼bg Á 11fhk¼cg;

ð4Þ

where a;b;c indicates the compatibility of y ¼ a, h j ¼ b, and

hk ¼ c for the edge ð j;kÞ 2 E .Root model > Á !ðy;xÞ: The root model is a potential

function that models the compatibility of class label y and

the large-scale global feature of the whole image. It is

parameterized as

> Á !ðy;xÞ ¼Xa2Y

>a Á 11fy¼ag Á g ðxÞ; ð5Þ

where g ðxÞ is a feature vector describing the appearance of

the whole image. In our case, g ðxÞ is the concatenation of all

the four channels of the motion features in the image, i.e.,

g ðxÞ ¼ ½F bþx F bÀ

x F bþy F bÀ

y . a can be interpreted as a root

filter that measures the compatibility between the appear-

ance of an image g ðxÞ and a class label y ¼ a. And is

simply the concatenation of a for all a 2 Y .The parametrization of > Á Èðy;h;xÞ issimilartothatused

in object recognition [4]. But there are two important

differences. First of all, our definition of the unary potential

function ðÁÞ encodes both appearance and spatial informa-

tion of the patches. Second, we have a potential function !ðÁÞ

describing the large-scale appearance of the whole image.

The representation in Quattoni et al. [4] only models local

patches extracted from the image. This may be appropriate

for object recognition, but, for human action recognition, it

is not clear that local patches can be sufficiently informative.

We will demonstrate this experimentally in Section 6.

3.3 Learning and Inference

Let D ¼ ðhx

ð1Þ

; y

ð1Þ

i; hx

ð2Þ

; y

ð2Þ

i; . . . ; hx

ðN Þ

; y

ðN Þ

iÞ be a set of labeled training examples; the model parameters are

learned by maximizing the conditional log likelihood on the

training images:

Ã ¼ arg max

LðÞ ¼ argmax

XN

t¼1

LtðÞ

¼ arg max

XN

t¼1

log pÀ

yðtÞjxðtÞ; Á

¼ arg max X

N

t¼1

log

Xh p

ÀyðtÞ;hjxðtÞ;

Á !

;

ð6Þ

where LtðÞ denotes the conditional log likelihood of thetth training example and LðÞ denotes the conditional loglikelihood of the whole training set D. Differently from CRF[27], the objective function LðÞ of HCRF is not concave, dueto the hidden variables h. But we can still use gradientascent to find that is locally optimal. The gradient of theconditional log likelihood LtðÞ with respect to thetth training image ðxðtÞ; yðtÞÞ can be calculated as

@ LtðÞ

@¼X j2V

ÂIE pðh jjyðtÞ;xðtÞ;Þðx

ðtÞ j ; h jÞ

À IE pðh j;yjxðtÞ;ÞðxðtÞ

j ; h jÞÃ

@ LtðÞ

@ ¼X j2V

ÂIE pðh jjyðtÞ;xðtÞ;Þ’ðh j; yðtÞÞ

À IE pðh j;yjxðtÞ;Þ’ðh j; yÞÃ

@ LtðÞ

@ ¼

Xð j;kÞ2E

ÂIE pðh j;hkjyðtÞ;xðtÞ;ÞðyðtÞ; h j; hkÞ

À IE pðh j;hk;yjxðtÞ;Þðy; h j; hkÞÃ

@ LtðÞ

@ ¼ !

ÀyðtÞ;xðtÞ

ÁÀ IE pðyjxðtÞ;Þ!

Ày;xðtÞ

Á:

ð7Þ

The expectations needed for computing the gradient can beobtained by belief propagation [4].

3.4 Implementation Details

Now, we describe several details about how the above ideasare implemented.

Learning root filter : Given a set of training imageshxðtÞ; yðtÞi, we first learn the root filter by solving thefollowing optimization problem:

Ã ¼ arg max

XN

t¼1

log Lroot

ÀyðtÞjxðtÞ;

Á¼ arg max

XN

t¼1

logexp > Á !ðyðtÞ;xðtÞÞ

À ÁPy exp > Á !ðy;xðtÞÞð Þ

:

ð8Þ

In other words, Ã is learned by only considering the featurevector !ðÁÞ. This reduces the number of parameters needingto be considered initially. Similar tricks have been used in[3], [32]. We then use Ã as the starting point for in thegradient ascent (7). Other parameters , , and areinitialized randomly.

Patch initialization: We use a simple heuristic similar tothat used in [3]to initialize 10 salient patches oneverytrainingimage from the root filter Ã trained above. For each training

image I with class label a, we apply the root filter a on I ,then select a rectangle region of size 5 Â 5 in the image thathas the most positive energy. We zero out the weights inthis region and repeat until 10 patches are selected. See




Fig. 6a for examples of the patches found in some images.The tree G ¼ ðV ; EÞ is formed by running a minimumspanning tree algorithm over the 10 patches.

Inference: During testing, we do not know the class labelof a given test image, so we cannot use the patchinitialization described above to initialize the patches sincewe do not know which root filter to use. Instead, we run rootfilters from all of the classes on a test image, then calculate

the probabilities of all possible instantiations of patchesunder our learned model, and classify the image by pickingthe class label that gives the maximum of these probabilities.

4 MAX-MARGIN HIDDEN CONDITIONAL RANDOM

FIELDS

In this section, we present an alternative training methodfor learning the model parameter . Our learning method isinspired by the success of max-margin methods in machinelearning [3], [28], [30], [33]. Given a learned model, theclassification is achieved by first finding the best labeling of

the hidden parts for each action, then picking the actionlabel with the highest score. The learning algorithm aims toset the model parameters so that the scores of correct actionlabels on the training data are higher than the scores of incorrect action labels by a large margin. We call ourapproach MMHCRF.

4.1 Model Formulation

We assume that an hx; yi pair is scored by a function of the form:

f ðx; yÞ ¼ maxh

>Èðx;h; yÞ: ð9Þ

Similarly to HCRF, is the model parameter andh

is avector of hidden variables. Please refer to Section 3.2 fordetails description of >Èðx;h; yÞ. In this paper, weconsider the case in which h ¼ ðh1; h2; . . . ; hmÞ forms atree-structured undirected graphical model, but our pro-posed model is a rather general framework and can beapplied to a wide variety of structures. We will brieflydiscuss them in Section 4.4. Similarly to latent SVMs,MMHCRFs are instances of the general class of energy- based models [34].

The goal of learning is to learn the model parameter sothat, for a new example x, we can classify x to be class yÃ if yÃ ¼ arg maxyf ðx; yÞ.

In analogy to classical SVMs, we would like to train

from labeled examples D by solving the followingoptimization problem:

min;

1

2kk2 þ C

XN

t¼1

ðtÞ

s:t: f ðxðtÞ; yÞ À f ðxðtÞ; yðtÞÞ ðtÞ À 1; 8t; 8y 6¼ yt

ðtÞ ! 0; 8t;

ð10Þ

where C is the trade-off parameter similar to that in SVMs

and ðtÞ

is the slack variable for the tth training example tohandle the case of soft margin.

The optimization problem in (10) is equivalent to thefollowing optimization problem:

min;

1

2kk2 þ C

XN

t¼1

ðtÞ

s:t: maxh

>ÈðxðtÞ;h; yÞ À maxh

0>È

Àx

ðtÞ;h0; yðtÞÁ

ðtÞ À À

y; yðtÞÁ

; 8t; 8y

where ðy; yðtÞÞ ¼1; if y 6¼ yðtÞ

0; otherwise:(

ð11Þ

An alternative to the formulation in (11) is to convert the

multiclass classification problem into several binary classi-

fications (e.g., one-against-all), then solve each of them using

LSVM. Although this alternative is simple and powerful, it

cannot capture correlations between different classes since

those binary problems are independent [33]. We will

demonstrate experimentally (Section 6) that this alternative

does not perform as well as our proposed method.

4.2 Semiconvexity and Primal Optimization

Similarly to LSVMs, MMHCRFs have the property of

semiconvexity. Note that f ðx; yÞ is a maximum of a set of

functions, each of which is linear in , so f ðx; yÞ is convex in

. If we restrict the domain of h0 in (11) to a single choice,

the optimization problem of (11) becomes convex [35]. This

is in analog to restricting the domain of the latent variables

for the positive examples to a single choice in LSVMs [3].

But here, we are dealing with multiclass classification; our

“positive examples” are those hxðtÞ; yi pairs where y ¼ yðtÞ.We can compute a local optimum of (11) using a

coordinate descent algorithm similar to LSVMs [3]:

1. Holding ; fixed, optimize the latent variables h0 forthe hxðtÞ; yðtÞi pair:

hðtÞ

yðtÞ ¼ arg maxh

0>È

Àx

ðtÞ;h0; yðtÞÁ

:

2. Holding hðtÞ

yðtÞ fixed, optimize ; by solving thefollowing optimization problem:

min;

1

2kk2 þ C

XN

t¼1

ðtÞ

s:t: maxh

>ÈÀxðtÞ;h; yÁ À >ÈÀxðtÞ;hðtÞ

yðtÞ ; yðtÞÁ ðtÞ À ðy; yðtÞÞ; 8t; 8y:

ð12Þ

It can be shown that both steps always improve or maintain

the objective [3].The optimization problem in Step 1 can be solved

efficiently for certain structures of h0 (see Section 4.4 for

details). The optimization problem in Step 2 involves solving

a quadratic program (QP) with piecewise linear constraints.

Although it is possible to solve it directly using barrier

methods [35], we will not be able to take advantage of

existing highly optimized solvers (e.g., CPLEX) which only

accept linear constraints. It is desirable to convert (12) into astandard quadratic program with only linear constraints.

One possible way to convert (12) into a standard QP is tosolve the following convex optimization problem:




min;

1

2kk2 þ C

XN

t¼1

ðtÞ

s:t: >ÈðxðtÞ;h; yÞ À >ÈðxðtÞ;hðtÞ

yðtÞ ; yðtÞÞ

ðtÞ À ðy; yðtÞÞ; 8t; 8h; 8y:

ð13Þ

It is easy to see that (12) and (13) are equivalent, and all of theconstraints in (13) are linear. Unfortunately, the optimization

problem in (13) involves an exponential number of con-straints—for each example xðtÞ and each possible labeling y,there are exponentially many possible hs.

We would like to perform optimization over a muchsmaller set of constraints. One solution is to use a cuttingplane algorithm similar to that used in structured SVMs[30] and CRFs [36]. In a nutshell, the algorithm starts withno constraints (which corresponds to a relaxed version of (13)), then iteratively finds the “most violated” constraintsand adds those constraints. It can be shown that thisalgorithm computes an arbitrarily close approximation tothe original problem of (13) by evaluating only a poly-

nomial number of constraints.More importantly, the optimization problem in (13) has

certain properties that allow us to find and add constraintsin an efficient way. For a fixed example x

ðtÞ and a possiblelabel y, define h

ðtÞy as follows:

hðtÞy ¼ arg max

h

>ÈÀx

ðtÞ;h; yÁ

:

Consider the following two set of constraints for the

hxðtÞ; yi pair:

>ÈÀxðtÞ;hðtÞ

y ; yÁ À >ÈÀxðtÞ;h

ðtÞ

yðtÞ ; yðtÞ

Á ðtÞ À

Ày; yðtÞ

Á;

ð14Þ

>ÈÀx

ðtÞ;h; yÁ

À >ÈÀx

ðtÞ;hðtÞ

yðtÞ ; yðtÞÁ

ðtÞ À À

y; yðtÞÁ

; 8h:ð15Þ

It is easy to see that within a local neighborhood of , (14)and (15) define the same set of constraints, i.e., (14) implies(15) and vice versa. This suggests that for a fixed hxðtÞ; yipair, we only need to consider the constraint involving h

ðtÞy .

Putting everything together, we learn the model para-meter by iterating the following two steps:

1. Fixing ; , optimize the latent variable h for eachpair hxðtÞ; yi of an example x

ðtÞ and a possiblelabeling y:

hðtÞy ¼ argmax

h

>ÈÀx

ðtÞ;h; yÁ

:

2. Fixing hðtÞy 8t; 8y, optimize ; by solving the

following optimization problem:

min;

1

2kk2 þ C

XN

t¼1

ðtÞ

s:t: >ÈðxðtÞ;hðtÞy ; yÞ À >ÈðxðtÞ;hðtÞ

yðtÞ ; yðtÞÞ

ðtÞ À ðy; yðtÞÞ; 8t; 8y:

ð16Þ

Step 1 in the above algorithm can be efficiently solved forcertain structured h (Section 4.4). Step 2 involves solving aquadratic program with N Â jYj constraints.

The optimization in (16) is very similar to the primalproblem of a standard multiclass SVM [33]. In fact, if hðtÞ

y isthe same for different ys, it is just a standard SVM and wecan use an off-the-shelf SVM solver to optimize (16).Unfortunately, the fact that hðtÞ

y

can vary with different ysmeans that we cannot directly use standard SVM packages.We instead develop our own optimization algorithm.

4.3 Dual Optimization

In analog to classical SVMs, it is helpful to solve theproblem in (16) by examining its dual. To simplify thenotation, let us define ÉðxðtÞ; yÞ ¼ ÈðxðtÞ;hðtÞ

y ; yÞ À ÈðxðtÞ;

hðtÞ

yðtÞ ; yðtÞÞ. Then, the dual problem of (16) can be written asfollows:

max X

N

t¼1 Xy

t;y

Ày; yðtÞ

Á À1

2 XN

t¼1 Xy

t;yÉ

Àx

ðtÞ; y

Á2

s:t:X

y

t;y ¼ C; 8t

t;y ! 0; 8t; 8y:

ð17Þ

The primal variable can be obtained from the dualvariables as follows:

¼ ÀXN

t¼1

Xy

t;yÉÀx

ðtÞ; yÁ

:

Note that (17) is quite similar to the dual form of

standard multiclass SVMs. In fact, if hðtÞy is a deterministic

function of xðtÞ, (17) is just a standard dual form of SVMs.Similar to classical SVMs, we can also obtain a kernelized

version of the algorithm by defining a kernel function of size N Â jYj by N Â jYj in the following form:

K ðt; y; s; y0Þ ¼ ÉÀx

ðtÞ; yÁ>

ÉÀx

ðsÞ; y0Á

:

Let us define as the concatenation of ft;y : 8t 8yg, sothe length of is N Â jYj. Define Á as a vector of the samelength. The ðt; yÞth entry of Á is 1 if y 6¼ yðtÞ and 0 otherwise.Then, (17) can be written as

max

>Á À 12

>K

s:t:X

y

t;y ¼ C; 8t

t;y ! 0; 8t; 8y:

ð18Þ

Note that the matrix K in (18) only depends on the dotproduct between feature vectors of different hxðtÞ; yi pairs.So, our model has a very intuitive and interestinginterpretation—it defines a particular kernel function thatrespects the latent structures.

It is easy to show that the optimization problem in (17) is

concave, so we can find its global optimum. But the numberof variables is N Â jYj, where N is the number of trainingexamples and jYj is the size of all possible class labels. So, itis infeasible to use a generic QP solver to optimize it.




Instead, we decompose the optimization problem of (17)and solve a series of smaller QPs. This is similar to thesequential minimal optimization (SMO) used in SVM [33],[37] andM3N [28]. Thebasic idea of this algorithmis to chooseall of the ft;y : 8y 2 Y g for a particular training example xðtÞ

and fix all of the other variables fs;y0 : 8s : s 6¼ t; 8y0 2 Ygthat do not involve x

ðtÞ. Then, instead of solving a QP

involving all the variables f

t;y:

8t;

8y

g, we can solve amuch smaller QP only involving ft;y : 8yg. The number of variables of this smaller QP is jYj, which is much smallerthan N Â jYj.

First, we write the objective of (17) in terms of ft;y : 8ygas follows:

Lðft;y : 8ygÞ

¼X

y

t;y ðy; yðtÞÞ À1

2

Xy

t;yÉðxðtÞ; yÞ

224

þ2 Xy

t;y

ÉðxðtÞ; yÞ !>

Xs:s6¼t

Xy0

s;y

0 ÉðxðsÞ; y0Þ !#þ other terms not involving ft;y : 8yg:

The smaller QP corresponding to hxðtÞ; yðtÞi can bewritten as follows:

maxt;y:8y

Lðft;y : 8ygÞ

s:t:X

y

t;y ¼ C

t;y ! 0; 8y:

ð19Þ

Note that Ps:s6¼t Py0 s;y0 ÉðxðsÞ; y0Þ can be written as

À ÀX

y

t;yÉðxðtÞ; yÞ:

So as long as we maintain (and keep updating) the globalparameter and keep track of t;y and ÉðxðtÞ; yÞ for eachexample hxðtÞ; yðtÞi, we do not need to actually do thesummation

Ps:s6¼t

Py0 when optimizing (19). In addition,

when we solve the QP involving t;y for a fixed t, all of theother constraints involving s;y where s 6¼ t are not affected.This is not the case if we try to optimize the primal problemin (16). If we try to optimize the primal variable by onlyconsidering the constraints involving the tth examples, it ispossible that the new obtained from the optimizationmight violate the constraints imposed by other examples.There is also work [38] showing that the dual optimizationhas a better convergence rate.

4.4 Finding the Optimal h

The alternating coordinate descent algorithm for learningthe model parameter described in Section 4.2 assumes thatwe have an inference algorithm for finding the optimal hÃ

for a fixed hx; yi pair:

hÃ ¼ arg max

h

>Èðx;h; yÞ: ð20Þ

In order to adopt our approach to problems involvingdifferent latent structures, this is the only component of thealgorithm that needs to be changed.

If h ¼ ðh1; h2; . . . ; hmÞ forms a tree-structured graphicalmodel, the inference problem in (20) can be solved exactly,e.g., using the Viterbi dynamic programming algorithm fortrees. We can also solve it using standard linear program-ming (LP) as follows [29], [39]: We introduce variables ja todenote the indicator 11fh j¼ag for all vertices j 2 V and theirvalues a 2 H. Similarly, we introduce variables jkab todenote the indicator 11fh j¼a;hk¼bg for all edges ð j;kÞ 2 E andthe values of their nodes, a 2 H, b 2 H. We use jðh jÞ tocollectively represent the summation of all of the unarypotential functions in (1) that involve the node j 2 V . Weuse jkðh j; hkÞ to collectively represent the summation of allthe pairwise potential functions in (1) that involve the edgeð j;kÞ 2 E . The problem of finding of optimal h

Ã can beformulated into the following linear programming problem:

max01

X j2V

Xa2H

ja jðaÞ þX

ð j;kÞ2E

Xa2H

Xb2H

jkab jkða; bÞ

s:t:

Xa2H

ja ¼ 1; 8 j 2 V

Xa2H

Xb2H

jkab ¼ 1; 8ð j;kÞ 2 E

Xa2H

jkab ¼ kb; 8ð j;kÞ 2 E ; 8b 2 H

Xb2H

jkab ¼ ja; 8ð j;kÞ 2 E ; 8a 2 H:

ð21Þ

If the optimal solution of this LP is integral, we canrecover hÃ from Ã very easily. It has been shown that if E forms a forest, the optimal solution of this LP is guaranteedto be integral [29], [39]. For general graph topology, theoptimal solution of this LP can be fractional, which is not

surprising since the problem in (20) is NP-hard for generalgraphs. Although the LP formulation does not seem to beparticularly advantageous in the case of tree-structuredmodels, since they can be solved by Viterbi dynamicprogramming anyway, the LP formulation provides a moregeneral way of approaching other structures (e.g., Markovnetworks with submodular potentials, matching [29]).

5 DISCUSSION

HCRF and MMHCRF can be thought of as two differentapproaches for solving the general problem of classificationwith structured latent variables. Many problems in computer

vision can be formulated in this general problem setting.Consider the following three vision tasks: 1) Pedestriandetection: It can be formulated as a binary classificationproblem that classifies an image patch x tobe þ1 (pedestrian)or 0 (nonpedestrian). The locations of the body parts can beconsidered as latent variables in this case since most of theexisting training data sets for pedestrian detection do notprovide this information. These latent variables are alsostructured—e.g., the location of the torso imposes certainconstraints on the possible locations of other parts. Previousapproaches [3], [40] usually use a tree-structured model tomodel these constraints. 2) Object recognition: This problem is

to assign a class label to an image if it contains the object of interest.If we consider thefigure/groundlabelingof pixelsof the image as latent variables, object recognition is also aproblem of classification with latent variables. The latent




variables are also structured, typically represented by agrid-structured graph. 3) Object identification: Given twoimages, the task is to decide whether these are two imagesof the same object or not. If an image is represented by a setof patches found by interest-point operators, one particularway to solve this problem is to first find the correspondence between patches in the two images, then learn a binaryclassifier based on the result of the correspondence [41]. Of course, the correspondence information is “latent”—notavailable in the training data—and “structured”—assumingthat one patch in one image matches to one or zero patchesin the other image, this creates a combinatorial structure[29]. This general problem setting occurs in other fields aswell. For example, in natural language processing, Cherryand Quirk [42] use a similar idea for sentence classification by considering the parse tree of a sentence as the hiddenvariable. In bioinformatics, Yu and Joachims [43] solve themotif finding problem in yeast DNA by considering thepositions of motifs as hidden variables.

One simplified approach to solve the above-mentioned

problems is to ignore the latent structures and treat them asstandard classification problems, e.g., Dalal and Triggs [5]in the case of pedestrian detection. However, there isevidence [3], [4], [15], [40] showing that incorporating latentstructures into the system can improve the performance.

Learning with latent structures has a long history inmachine learning. In visual recognition, probabilistic latentsemantic analysis (pLSA) [44] and latent Dirichlet allocation(LDA) [45] are representative examples of generativeclassification models with latent variables that have beenwidely used. As far as we know, however, there has only been some recent effort [3], [4], [15], [16], [43], [46] on

incorporating latent variables into discriminative models. Itis widely believed in the machine learning literature thatdiscriminative methods typically outperform generativemethods1 (e.g., [27]), so it is desirable to develop discrimi-native methods with structured latent variables.

MMHCRFs are closely related to HCRFs. The maindifference lies in their different learning criteria—maximiz-ing the margin in MMHCRFs and maximizing the condi-tional likelihood in HCRFs. As a result, the learningalgorithms for MMHCRFs and HCRFs involve solvingtwo different types of inference problems—maximizingover h versus summing over h. In the following, we

compare and analyze these two approaches from bothcomputational and modeling perspectives, and argue whyMMHCRFs are preferred over HCRFs in many applications.

5.1 Computational Perspective

HCRFs and MMHCRFs share some commonality in termsof their learning algorithms. Both are iterative methods.During each iteration, both of them require solving aninference on each training example. Since the learningobjectives of both learning algorithms are nonconvex, bothof them can only find local minimum. The computationalcomplexity of both learning algorithms involves threefactors: 1) the number of iterations, 2) the number of

training examples, and 3) the computation needed on eachexample during an iteration. Of course, the number of

training examples remains the same in both algorithms. If we assume that both algorithms take the same number of iterations, the main difference in the computational com-plexity of these two algorithms comes from the third factor,i.e., the computation needed on each example during aniteration. This complexity is mainly dominated by the runtime of the inference algorithm for each algorithm—sum-

ming overh

in HCRFs and maximizing overh

inMMHCRFs. In other words, the computational complexityof the learning problem for either HCRFs or MMHCRFs isdetermined by the complexity of the correspondinginference problem. So, in order to understand the difference between HCRFs and MMHCRFs in terms of their learningalgorithm complexity, we only need to focus on thecomplexity of their inference algorithms.

From the computational perspective, if h has a treestructure and jHj is relatively small, both inference problems(max versus sum) can be solved exactly, using dynamicprogramming and belief propagation, respectively. But the

inference problem (maximization) in MMHCRFs can dealwith a much wider range of latent structures. Here are a fewexamples [29] (although these problems are not addressed inthis paper) in computer vision:

. Binary Markov networks with submodular poten-tials, commonly encountered in figure/groundsegmentation [48]. MMHCRFs can use LP [29] orgraph cut [48] to solve the inference problem. ForHCRFs, the inference problem can only be solvedapproximately, e.g., using loopy BP or mean-fieldvariational methods.

. Matching/correspondence (see the object identifica-tion example mentioned above, or the examples in[29]). The inference of this structure can be solved byMMHCRFs using LP [29], [49]. It is not clear howHCRFs can be used in this situation since it requiressumming over all the possible matchings.

. Tree structures, but each node in h can have a largenumber of possible labels (e.g., all of the possiblepixel locations in an image), i.e., jHj is big. If thepairwise potentials have certain properties, distancetransform [8] can be applied in MMHCRFs to solvethe inference problem. This is essentially what has been done in [3]. This inference problem for HCRFs

can be solved using convolution. But distancetransform (OðjHjÞ) is more efficient than convolution(OðjHj log jHjÞ).

5.2 Modeling Perspective

From the modeling perspective, we believe that MMHCRFsare better suited for classification than HCRFs. This is because HCRFs require summing over exponentially manyhs. In order to maximize the conditional likelihoods, thelearning algorithm of HCRFs has to try very hard to pushthe probabilities of many “wrong” labelings of hs to beclose to zero. But in MMHCRFs, the learning algorithm only

needs to push apart the “correct” labeling and its next bestcompetitor. Conceptually, the modeling criterion of MMHCRFs is easier to achieve and more relevant toclassification than that of HCRFs. In the following, we give


1. We acknowledge that this view is not unchallenged, e.g., [47].



a detailed analysis of the differences between HCRFs andMMHCRFs in order to gain the insights.

Max margin versus log likelihood: The first difference between MMHCRFs and HCRFs lies in their differenttraining criteria, i.e., maximizing the margin in MMHCRFsand maximizing the conditional log likelihood in HCRFs.Here, we explain why max margin is a better trainingcriterion using a synthetic classification problem illustratedin Table 1. To simplify the discussion, we assume a regularclassification problem (without hidden structures). Weassume three possible class labels, i.e., Y ¼ f1; 2; 3g. Forsimplicity, let us assume that there is only one datum x inthe training data set, and the ground-truth label for x isY ¼ 2. Suppose we have two choices of model parametersð1Þ and ð2Þ. The conditional probabilities pðY jx; ð1ÞÞ and pðY jx; ð2Þ) are shown in two rows in Table 1. If we choosethe model parameter by maximizing the conditional prob-ability on the training data, we will choose ð1Þ over ð2Þ since pðY ¼ 2jx; ð1ÞÞ > pðY ¼ 2jx; ð2ÞÞ. But the problem with ð1Þ

is that x will be classified as Y ¼ 3 if we use ð1Þ as the modelparameter since pðY ¼ 3jx; ð1ÞÞ > pðY ¼ 2jx; ð1ÞÞ.

Let us use yÃ to denote the ground-truth label of x and y0

to denote the “best competitor” of yÃ. If we defineyopt ¼ arg maxy pðY ¼ yjx; Þ, y0 can be formally defined asfollows:

y0 ¼yopt; if yÃ 6¼ yopt

arg maxy:y6¼yÃ

pðY ¼ yjx; Þ; if yÃ ¼ yopt:

(

If we define marginðÞ ¼ pðY ¼ yÃjx; Þ À pðY ¼ y0jx; Þ,we c an see t ha t marginðð2ÞÞ ¼ 0:45 À 0:43 ¼ 0:02 and

marginðð1ÞÞ ¼ 0:48 À 0:5 ¼ À0:02. So, the max-margin cri-terion will choose ð2Þ over ð1Þ. If we use ð2Þ as the modelparameter to classify x, we will get the correct label Y ¼ 2.

This simple example shows that the max-margin trainingcriterion is more closely related to the classification problemwe are trying to solve in the first place.

Maximization versus summation: Another difference between the MMHCRF and the HCRF is that the formerrequires maximizing over hs, while the latter requiressumming over hs. In addition to the computational issuesmentioned above, the summation over all of the possible hsin HCRFs may cause other problems as well. To make the

discussion more concrete, let us consider the pedestriandetection setting in [3]. In this setting, x is an image patchand y is a binary class label þ1 or 0 to indicate whether theimage patch x is a pedestrian or not. The hidden variable h

represents the locations of body parts. Recall that HCRFsneed to compute the summation of all possible hs in thefollowing form:

pðyjx; Þ ¼Xh

pðy;hjx; Þ: ð22Þ

The intuition behind (22) is that a correct labeling of h will

have a higher value of probability pðy;h

jx;

Þ, while anincorrect labeling of h will have a lower value. Hence,correct part locations will contribute more to pðyjx; Þ in (22).

However, this is not necessarily the case. In an image,there are exponentially many possible placements of partlocations. That means h has exponentially many possibleconfigurations. But only a very small number of thoseconfigurations are “correct” ones, i.e., the ones that roughlycorrespond to correct locations of body parts. Ideally, a goodmodel will put high probabilities on those “correct” hs andlower probabilities (close to 0) on “incorrect” ones. Anincorrect h only carries a small probability, but since thereare exponentially many of them, the summation in (22) canstill be dominated by those incorrect hs. This surprisingeffect is exactly the opposite of what we have expected themodel to behave. This surprising effect is partly due to thehigh dimensionality of the latent variable h. Many counter-intuitive properties of high-dimensional spaces have also been observed in statistics and machine learning.

We would like to emphasize that we do not mean todiscredit HCRFs. In fact, we believe that HCRFs have theirown advantages. The main merit of HCRFs lies in theirrigorous probabilistic semantics. A probabilistic model hastwo major advantages compared with nonprobabilisticalternatives (e.g., energy-based models like MMHCRFs).

First of all, it is usually intuitive and straightforward toincorporate prior knowledge within a probabilistic model,e.g., by choosing appropriate prior distributions, or by building hierarchical probabilistic models. Second, inaddition to predicting the best labels, a probabilistic modelalso provides posterior probabilities and confidence scoresof its predictions. This might be valuable for certainscenarios, e.g., in building cascades of classifiers. HCRFsand MMHCRFs are simply two different ways of solvingproblems. They both have their own advantages andlimitations. Which approach to use depends on the specificproblem at hand.

6 EXPERIMENTS

We test our algorithm on two publicly available data setsthat have been widely used in action recognition: theWeizmann human action data set [9] and the KTH humanmotion data set [14]. Performance on these benchmarks issaturating—state-of-the-art approaches achieve near-perfectresults. We show that our method achieves results compar-able to the state of the art and, more importantly, that ourextended HCRF model significantly outperforms a directapplication of the original HCRF model [4].

Weizmann data set: The Weizmann human action dataset contains 83 video sequences showing nine differentpeople, each performing nine different actions: running,walking, jumping jack, jumping forward on two legs,


TABLE 1A Toy Example Illustrating Why the Learning Criterion

of MMHCRF (i.e., Maximizing the Margin) Is More Relevantto Classification than that of HCRF

(i.e., Maximizing the Log Likelihood)

We assume a classification problem with three class labels (i.e.,Y ¼ f1; 2; 3g ) and one training datum. The table shows the conditional probability pðY jx; ð1ÞÞ and pðY jx; ð2ÞÞ, where Y ¼ 1; 2; 3 for two choices of model parameters ð1Þ and ð2Þ.



jumping in place on two legs, galloping sideways, wavingtwo hands, waving one hand, and bending. We track andstabilize the figures using the background subtractionmasks that come with this data set. See Fig. 6a for sampleframes of this data set.

We randomly choose videos of five subjects as a training

set and the videos of the remaining four subjects as test set.We learn three HCRF models with different sizes of possible part labels, jHj ¼ 6; 10; 20. Our model classifiesevery frame in a video sequence (i.e., per-frame classifica-tion), but we can also obtain the class label for the wholevideo sequence by the majority voting of the labels of itsframes (i.e., per-video classification). We show the confu-sion matrices of HCRFs and MMHCRFs with jHj ¼ 10 for both per-frame and per-video classification in Figs. 4 and 5.

We compare our system to two baseline methods. Thefirst baseline (root model) only uses the root filter > Á !ðy;xÞ, which is simply a discriminative version of Efros et al. [10]. The second baseline (local HCRF) is a direct

application of the original HCRF model [4]. It is similar toour model, but without the root filter > Á !ðy;xÞ, i.e., localHCRF only uses the root filter to initialize the salientpatches, but does not use it in the final model. Thecomparative results are shown in Table 2. Our approachsignificantly outperforms the two baseline methods. Wealso compare our results (with jHj ¼ 10) with previouswork in Table 3. Note that Blank et al. [9] classify space-timecubes. It is not clear how it can be compared with othermethods that classify frames or videos. Our result issignificantly better than [24], and comparable to [22].Although we accept the fact that the comparison is notcompletely fair since Niebles and Fei-Fei [24] do not use anytracking or background subtraction.

We visualize the parts by HCRFs in Fig. 6a. Each patch isrepresented by a color that corresponds to the most likely

part label of that patch. We also visualize the root filtersapplied on these images in Fig. 6b. The visualization on

MMHCRFs is similar; it is omitted due to space constraints.KTH data set: The KTH human motion data set contains

six types of human actions (walking, jogging, running, boxing, hand waving, and hand clapping) performedseveral times by 25 subjects in four different scenarios:outdoors, outdoors with scale variation, outdoors withdifferent clothes, and indoors. See Fig. 9a for sample frames.We first run an automatic preprocessing step to track andstabilize the video sequences so that all of the figuresappear in the center of the field of view.

We split the videos roughly equally into training/testsets. The confusion matrices (with jHj ¼ 10) for both per-

frame and per-video classification are shown in Figs. 7 and8. The comparison with the two baseline algorithms issummarized in Table 2. Again, our approach outperformsthe two baselines systems.

The comparison with other approaches is summarized inTable 4. We should emphasize that we do not attempt adirect comparison since different methods listed in Table 4have all sorts of variations in their experiments (e.g.,different split of training/test data, whether temporalsmoothing is used, whether per-frame classification can beperformed, whether tracking/background subtraction isused, whether the whole data set is used, etc.) which makeit impossible to directly compare them. We provide the

results only to show that our approach is comparable to thestate of the art. Similarly, we visualize the learned parts onthe KTH data set in Fig. 9.

We perform additional diagnostic experiments to answerthe following questions regarding MMHCRFs:


Fig. 4. Confusion matrices of classification results (per frame and pervideo) of HCRFs on the Weizmann data set. Rows are ground truths andcolumns are predictions. (a) Per frame. (b) Per video.

Fig. 5. Confusion matrices of classification results of MMHCRFs on theWeizmann data set. (a) Per frame. (b) Per video.

TABLE 2Comparison of Two Baseline Systems with Our Approach

(HCRF and MMHCRF) on the Weizmann and KTH Data Sets

TABLE 3Comparison of Classification Accuracy

with Previous Work on the Weizmann Data Set



With or without pairwise potentials? If we remove the

pairwise potentials >ðy; h j; hkÞ in the model, the learning

and inference will become easier since each part label can be

independently assigned. One interesting question is

whether the pairwise potentials really help the recognition.

To answer this question, we run MMHCRFs without the

pairwise potentials on both Weizmann and KTH data sets.

The results are shown in Table 5 (“no pairwise”). Compar-

ing with the results in Table 2, we can see that MMHCRF

models without pairwise potentials do not perform as wellas those with pairwise potentials.

Multiclass or one-against-all? MMHCRFs directly han-dle multiclass classification. An alternative is to convert themulticlass problem into several binary classification problems (e.g., one-against-all). The disadvantage of this

alternative is discussed in [33]. Here, we demonstrate itexperimentally in our application. For a K -class classifica-tion problem, we convert it into K binary classification (one-against-all) problems. Each binary classification problem issolved using the formulation of LSVMs in [3]. In the end, weget K binary LSVM models. For a test image, we assign itsclass label by choosing the corresponding LSVM modelwhich gives the highest score on this image. Similarly, weperform per-frame classification and use majority voting toobtain per-video classification. The results are shown inTable 5 (“one-against-all”). We can see that the one-against-all strategy does not perform as well as our proposed method

(see Table 2). We believe that it is because those binaryclassifiers are trained independently of each other, and their


Fig. 6. Visualization of the learned model on the Weizmann data set: (a) Visualization of the learned parts. Patches are colored according to theirmost likely part labels. Each color corresponds to a part label. Some interesting observations can be made. For example, the part label representedby red seems to correspond to the “moving down” patterns mostly observed in the “bending” action. The part label represented by green seems tocorrespond to the motion patterns distinctive of “hand-waving” actions. (b) Visualization of root filters applied on these images. For each image withclass label c, we apply the root filter c. The results show the filter responses aggregated over four motion descriptor channels. Bright areascorrespond to positive energies, i.e., areas that are discriminative for this class. This figure is best viewed in color with PDF magnification.

Fig. 7. Confusion matrices of classification results of HCRFs on the KTHdata set. (a) Per frame. (b) Per video.

Fig. 8. Confusion matrices of classification results of MMHCRFs on theKTH data set. (a) Per frame. (b) Per video.

TABLE 4Comparison of Per-Video Classification Accuracywith Previous Approaches on the KTH Data Set



scores are not calibrated to be comparable. To test this

hypothesis, we perform another method (called “one-against-all + SVM”) in Table 5. We take the output scoresof K binary LSVM models and stack them into aK -dimensional vector. Then, we train a multiclass SVM thattakes this vector as its input and produces one of the K classlabels. This multiclass SVM will calibrate the scores of different binary LSVM models. The results show that thiscalibration step significantly improves the performance of the one-against-all strategy. But, the final results are stillworse than MMHCRFs. So, in summary, our experimentalresults show that the one-against-all strategy does notperform as well as MMHCRFs.

7 CONCLUSION

We have presented discriminatively learned part models forhuman action recognition. Our model combines both large-scale features used in global templates and local patchfeatures used in bag-of-words models. Our experimentalresults show that our model is quite effective in recognizingactions. The results are comparable to the state-of-the-art

approaches. In particular, we show that the combination of

large-scale features and local patch features performssignificantly better than using either of them alone. Wealso presented a new max-margin learning method forlearning the model parameter. Our experimental resultsshow that the max-margin learning outperforms theprobabilistic learning based on maximizing the conditionallog likelihood of training data. More importantly, the max-margin learning allows us to deal with a large spectrum of different complex structures.

We have also given a detailed theoretical analysis of prosand cons of HCRFs and MMHCRFs in terms of bothcomputational and modeling perspectives. Our analysisexplains why the design choices of MMHCRFs (max margininstead of maximizing likelihood, maximizing over h

instead of summing over h) lead to better performance inour experiments.

The applicability of the proposed models and learningalgorithms goes beyond human action recognition. Incomputer vision and many other fields, we are constantlydealing with data with rich, interdependent structures. We believe that our proposed work will pave a new avenue toconstruct more powerful models for handling various formsof latent structures.

Our proposed models are still very simple. As futurework, we would like to extend our models to handle things

like partial occlusion, viewpoint change, etc. We also plan toinvestigate more robust features. These extensions will benecessary to make our models applicable on more challen-ging videos.

REFERENCES

[1] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: AHolistic Representation of the Spatial Envelope,” Int’l J. ComputerVision, vol. 42, no. 3, pp. 145-175, 2001.

[2] D.G. Lowe, “Distinctive Image Features from Scale-InvariantKeypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[3] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discrimina-tively Trained, Multiscale, Deformable Part Model,” Proc. IEEE

Int’l Conf. Computer Vision and Pattern Recognition, 2008.[4] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell,“Hidden Conditional Random Fields,” IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1848-1852,Oct. 2007.


Fig. 9. Visualization of the learned model on the KTH data set. (a) Visualization of the learned parts. (b) Visualization of root filters applied on theseimages. See the caption of Fig. 6 for detailed explanations. Similarly, we can make some interesting observations. For example, parts colored inpink, red, and green are representatives of “boxing,” “handclapping,” and “hand-waving” actions, respectively. The part colored in light blue is sharedamong “jogging,” “running,” and “walking” actions. This figure is best viewed in color with PDF magnification.

TABLE 5Results of Learning MMHCRF Models without Pairwise

Potentials (“No Pairwise”) and by Converting the MulticlassProblem into Several Binary Classification Problems(“One-Against-All”), Then Calibrating Scores of Those BinaryClassifiers by Another SVM Model (“One-Against-All + SVM”)



[5] N. Dalal and B. Triggs, “Histogram of Oriented Gradients forHuman Detection,” Proc. IEEE CS Conf. Computer Vision andPattern Recognition, 2005.

[6] A.C. Berg, T.L. Berg, and J. Malik, “Shape Matching andObject Recognition Using Low Distortion Correspondence,”Proc. IEEE CS Conf. Computer Vision and Pattern Recognition,2005.

[7] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T.Freeman, “Discovering Objects and Their Location in Images,”Proc. IEEE Int’l Conf. Computer Vision, vol. 1, pp. 370-377, 2005.

[8] P.F. Felzenszwalb and D.P. Huttenlocher, “Pictorial Structures forObject Recognition,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 55-79, Jan. 2005.

[9] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri,“Actions as Space-Time Shapes,” Proc. IEEE Int’l Conf. ComputerVision, 2005.

[10] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, “RecognizingAction at a Distance,” Proc. IEEE Int’l Conf. Computer Vision,pp. 726-733, 2003.

[11] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “BehaviorRecognition via Sparse Spatio-Temporal Features,” Proc. IEEE Int’lWorkshop Visual Surveillance and Performance Evaluation of Trackingand Surveillance, 2005.

[12] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Proc.British Machine Vision Conf., 2006.

[13] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative SubsequenceMining for Action Classification,” Proc. IEEE Int’l Conf. ComputerVision, 2007.

[14] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing HumanActions: A Local SVM Approach,” Proc. Int’l Conf. PatternRecognition, vol. 3, pp. 32-36, 2004.

[15] Y. Wang and G. Mori, “Learning a Discriminative Hidden PartModel for Human Action Recognition,” Proc. Neural InformationProcessing Systems, 2009.

[16] Y. Wang and G. Mori, “Max-Margin Hidden Conditional RandomFields for Human Action Recognition,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2009.

[17] R. Cutler and L.S. Davis, “Robust Real-Time Periodic MotionDetection, Analysis, and Applications,” IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 22, no. 8, pp. 781-796, Aug.2000.

[18] R. Polana and R.C. Nelson, “Detection and Recognition of Periodic, Non-Rigid Motion,” Int’l J. Computer Vision, vol. 23,no. 3, pp. 261-282, June 1997.

[19] J.L. Little and J.E. Boyd, “Recognizing People by Their Gait: TheShape of Motion,” Videre, vol. 1, no. 2, pp. 1-32, 1998.

[20] C. Rao, A. Yilmaz, and M. Shah, “View-Invariant Representationand Recognition of Actions,” Int’l J. Computer Vision, vol. 50, no. 2,pp. 203-226, 2002.

[21] A.F. Bobick and J.W. Davis, “The Recognition of Human Move-ment Using Temporal Templates,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.

[22] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A BiologicallyInspired System for Action Recognition,” Proc. IEEE Int’l Conf.Computer Vision, 2007.

[23] I. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc.

Int’l Conf. Computer Vision, 2003.[24] J.C. Niebles and L. Fei-Fei, “A Hierarchical Model of Shape andAppearance for Human Action Classification,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, 2007.

[25] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning,” Proc. IEEE CS Conf.Computer Vision and Pattern Recognition, 2003.

[26] Y. Ke, R. Sukthankar, and M. Hebert, “Event Detection inCrowded Videos,” Proc. IEEE Int’l Conf. Computer Vision, 2007.

[27] J. Lafferty, A. McCallum, and F. Pereira, “Conditional RandomFields: Probabilistic Models for Segmenting and Labeling Se-quence Data,” Proc. Int’l Conf. Machine Learning, 2001.

[28] B. Taskar, C. Guestrin, and D. Koller, “Max-Margin MarkovNetworks,” Proc. Neural Information Processing Systems, 2004.

[29] B. Taskar, S. Lacoste-Julien, and M.I. Jordan, “Structured Predic-tion, Dual Extragradient and Bregman Projections,” J. Machine

Learning Research, vol. 7, pp. 1627-1653, 2006.

[30] Y. Altun, T. Hofmann, and I. Tsochantaridis, “SVM Learning forInterdependent and Structured Output Spaces,” Machine Learningwith Structured Outputs, G. Bakir, T. Hofman, B. Scholkopf,A.J. Smola, B. Taskar, and S.V.N. Vishwanathan, eds., MIT Press,2006.

[31] B.D. Lucas and T. Kanade, “An Iterative Image RegistrationTechnique with an Application to Stereo Vision,” Proc. DARPAImage Understanding Workshop, 1981.

[32] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Modelsfor Multi-Class Object Layout,” Proc. IEEE Int’l Conf. Computer

Vision, 2009.[33] K. Crammer and Y. Singer, “On the Algorithmic Implementation

of Multiclass Kernel-Based Vector Machines,” J. Machine LearningResearch, vol. 2, pp. 265-292, 2001.

[34] Y. LeCun, S. Schopra, R. Radsell, R. Marc’Aurelio, and F. Huang,“A Tutorial on Energy-Based Learning,” Predicting StructuredData, T. Hofman, B. Scholkopf, A. Smola, and B. Taskar, eds., MITPress, 2006.

[35] S. Boyd and L. Vanderghe, Convex Optimization. Cambridge Univ.Press, 2004.

[36] M. Szummer, P. Kohli, and D. Hoiem, “Learning CRFs UsingGraph Cuts,” Proc. European Conf. Computer Vision, 2008.

[37] J.C. Platt, “Using Analytic QP and Sparseness to Speed Training of Support Vector Machines,” Proc. Neural Information ProcessingSystems, 1999.

[38] M. Collins, A. Globerson, T. Koo, X. Carreras, and P.L. Bartlett,

“Exponentiated Gradient Algorithms for Conditional RandomFields and Max-Margin Markov Networks,” J. Machine LearningResearch, vol. 9, pp. 1757-1774, Aug. 2008.

[39] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky, “MAP Estima-tion via Agreement on Trees: Message-Passing and LinearProgramming,” IEEE Trans. Information Theory, vol. 51, no. 11,pp. 3697-3717, Nov. 2005.

[40] D. Tran and D. Forsyth, “Configuration Estimates ImprovePedestrian Finding,” Proc. Neural Information Processing Systems,2008.

[41] S. Belongie, G. Mori, and J. Malik, “Matching with ShapeContexts,” Analysis and Statistics of Shapes, Birkhauser, 2005.

[42] C. Cherry and C. Quirk, “Discriminative, Syntactic LanguageModeling through Latent SVMs,” Proc. Assoc. for MachineTranslation in the Americas, 2008.

[43] C.-N. Yu and T. Joachims, “Learning Structural SVMs with LatentVariables,” Proc. Ann. Int’l Conf. Machine Learning, 2009.

[44] T. Hofmann, “Unsupervised Learning by Probabilistic LatentSemantic Analysis,” Machine Learning, vol. 42, pp. 177-196, 2001.

[45] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[46] P. Viola, J.C. Platt, and C. Zhang, “Multiple Instance Boosting forObject Recognition,” Proc. Neural Information Processing Systems,2006.

[47] A.Y. Ng and M.I. Jordan, “On Discriminative versus GenerativeClassifiers: A Comparison of Logistic Regression and NaiveBayes,” Proc. Neural Information Processing Systems, 2002.

[48] V. Kolmogorov and R. Zabih, “What Energy Functions Can BeMinimized via Graph Cuts,” IEEE Trans. Pattern Analysis and

Machine Intelligence, vol. 26, no. 2, pp. 147-159, Feb. 2004.[49] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency.

Springer, 2003.[50] J. Liu and M. Shah, “Learning Human Actions via InformationMaximization,” Proc. IEEE Conf. Computer Vision and PatternRecognition, 2008.

[51] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual EventDetection Using Volumetric Features,” Proc. IEEE Int’l Conf.Computer Vision, 2005.




Yang Wang received the BSc degree from theHarbin Institute of Technology, China, the MScdegree from the University of Alberta, Canada,and the PhD degree from Simon Fraser Uni-versity, Canada, all in computer science. He iscurrently an NSERC postdoctoral fellow in theDepartment of Computer Science, University ofIllinois at Urbana-Champaign. He was a re-search intern at Microsoft Research Cambridgein summer 2006. His research interests lie in

high-level recognition problems in computer vision, in particular, humanactivity recognition, human pose estimation, object/scene recognition,etc. He also works on various topics in statistical machine learning,including structured prediction, probabilistic graphical models, semi-supervised learning, etc.

Greg Mori received the Hon BSc degree incomputer science and mathematics with highdistinction from the University of Toronto in 1999and the PhD degree in computer science fromthe University of California, Berkeley, in 2004.He spent one year (1997-1998) as an intern atAdvanced Telecommunications Research (ATR)in Kyoto, Japan. After graduating from Berkeley,he returned home to Vancouver and is currentlyan assistant professor in the School of Comput-

ing Science at Simon Fraser University. His research interests are incomputer vision, and include object recognition, human activityrecognition, and human body pose estimation. He serves on theprogram committees of major computer vision conferences (CVPR,ECCV, and ICCV), and was the program cochair of the CanadianConference on Computer and Robot Vision (CRV) in 2006 and 2007. Hereceived the Excellence in Undergraduate Teaching Award from theSFU Computing Science Student Society in 2006, and the CanadianImage Processing and Pattern Recognition Society (CIPPRS) Award forResearch Excellence and Service in 2008. He received an NSERCDiscovery Accelerator Supplement in 2008. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


hidden part models for human action recognition probabilistic vs. margin. - wang, mori - ieee...

Documents