inferring human activities using robust privileged ... · inferring human activities using robust...

Inferring Human Activities Using Robust Privileged Probabilistic Learning

Michalis Vrigkas1 Evangelos Kazakos2 Christophoros Nikou2 Ioannis A. Kakadiaris11Computational Biomedicine Lab, University of Houston, Houston, TX, USA

2Dept. Computer Science & Engineering, University of Ioannina, Ioannina, Greece

Abstract

Classification models may often suffer from “structureimbalance” between training and testing data that may oc-cur due to the deficient data collection process. This im-balance can be represented by the learning using privilegedinformation (LUPI) paradigm. In this paper, we present asupervised probabilistic classification approach that inte-grates LUPI into a hidden conditional random field (HCRF)model. The proposed model is called LUPI-HCRF and isable to cope with additional information that is only avail-able during training. Moreover, the proposed method em-ployes Student’s t-distribution to provide robustness to out-liers by modeling the conditional distribution of the priv-ileged information. Experimental results in three publiclyavailable datasets demonstrate the effectiveness of the pro-posed approach and improve the state-of-the-art in theLUPI framework for recognizing human activities.

1. IntroductionThe rapid development of human activity recognition

systems for applications such as surveillance and human-machine interactions [5, 31] brings forth the need for devel-oping new learning techniques. Learning using privilegedinformation (LUPI) [18, 28, 34] has recently generated con-siderable research interest. The insight of LUPI is that onemay have access to additional information about the train-ing samples, which is not available during testing.

Despite the impressive progress that has been made inrecognizing human activities, the problem still remainschallenging. First, constructing a visual model for learn-ing and analyzing human movements is difficult. The largeintra-class variabilities or changes in appearance make therecognition problem difficult to address. Finally, the lack ofinformative data or the presence of misleading informationmay lead to ineffective approaches.

We address these issues by presenting a probabilistic ap-proach, which is able to learn human activities by exploitingadditional information about the input data, that may reflecton auxiliary properties about classes and members of the

Figure 1. Robust learning using privileged information. Given aset of training examples and a set of additional information aboutthe training samples (left) our system can successfully recognizethe class label of the underlying activity without having accessto the additional information during testing (right). We explorethree different forms of privileged information (e.g., audio signals,human poses, and attributes) by modeling them with a Student’st-distribution and incorporating them into the LUPI-HCRF model.

classes of the training data (Fig. 1). In this context, we em-ploy a new learning method based on hidden conditionalrandom fields (HCRFs) [24], called LUPI-HCRF, whichcan efficiently manage dissimilarities in input data, such asnoise, or missing data, using a Student’s t-distribution. Theuse of Student’s t-distribution is justified by the propertythat it has heavier tails than a standard Gaussian distribu-tion, thus providing robustness to outliers [23].

The main contributions of our work can be summarizedin the following points. First, we developed a probabilis-tic human activity recognition method that exploits privi-leged information based on HCRFs to deal with missing orincomplete data during testing. Second, contrary to previ-ous methods, which may be sensitive to outlying data mea-surements, we propose a robust framework by employing aStudent’s t-distribution to attain robustness against outliers.Finally, we emphasize the generic nature of our approach tocope with samples from different modalities.

2. Related workA major family of methods relies on learning human

activities by building visual models and assigning activityroles to people associated with an event [27, 36]. Earlierapproaches use different kinds of modalities, such as au-dio information, as additional information to construct bet-ter classification models for activity recognition [32].

A shared representation of human poses and visual infor-mation has also been explored [40]. Several kinematic con-straints for decomposing human poses into separate limbshave been explored to localize the human body [4]. How-ever, identifying which body parts are most significant forrecognizing complex human activities still remains a chal-lenging task [16]. Much focus has also been given in rec-ognizing human activities from movies or TV shows by ex-ploiting scene contexts to localize and understand humaninteractions [10, 22]. The recognition accuracy of suchcomplex videos can also be improved by relating textualdescriptions and visual context to a unified framework [26].

Recently, intermediate semantic features representationfor recognizing unseen actions during training has been pro-posed [17, 38]. These features are learned during trainingand enable parameter sharing between classes by capturingthe correlations between frequently occurring low-level fea-tures [1]. Instead of learning one classifier per attribute, atwo-step classification method has been proposed by Lam-pert et al. [14]. Specific attributes are predicted from pre-trained classifiers and mapped into a class-level score.

Recent methods that exploited deep neural networkshave demonstrated remarkable results in large-scaledatasets. Donahue et al. [6] proposed a recurrent con-volutional architecture, where recurrent long-term modelsare connected to convolutional neural networks (CNNs)that can be jointly trained to simultaneously learn spatio-temporal dynamics. Wang et al. [37] proposed a new videorepresentation that employs CNNs to learn multi-scale con-volutional feature maps. Tran et al. [33] introduced a 3DConvNet architecture that learns spatio-temporal featuresusing 3D convolutions. A novel video representation, thatcan summarize a video into a single image by applying rankpooling on the raw image pixels, was proposed by Bilen etal. [2]. Feichtenhofer et al. [7] introduced a novel architec-ture for two stream ConvNets and studied different ways forspatio-temporal fusion of the ConvNet towers. Zhu et al.[41] argued that videos contain one or more key volumesthat are discriminative and most volumes are irrelevant tothe recognition process.

The LUPI paradigm was first introduced by Vapnik andVashist [34] as a new classification setting to model basedon a max-margin framework, called SVM+. The choice ofdifferent types of privileged information in the context ofan object classification task implemented in a max-marginscheme was also discussed by Sharmanska et al. [30]. Wand

x1 x2 xTx∗1 x∗

2 x∗T· · ·

h1 h2 hT· · ·

y

Figure 2. Graphical representation of the chain structure model.The grey nodes are the observed features (xi), the privileged infor-mation (x∗

i ), and the unknown labels (y), respectively. The whitenodes are the unobserved hidden variables (h).

and Ji [39] proposed two different loss functions that exploitprivileged information and can be used with any classifier.Recently, a combination of the LUPI framework and activelearning has been explored by Vrigkas et al. [35] to classifyhuman activities in a semi-supervised scheme.

3. Robust privileged probabilistic learningOur method uses HCRFs, which are defined by a chained

structured undirected graph G = (V, E) (see Fig. 2), as theprobabilistic framework for modeling the activity of a sub-ject in a video. During training, a classifier and the mappingfrom observations to the label set are learned. In testing, aprobe sequence is classified into its respective state usingloopy belief propagation (LBP) [12].

3.1. LUPI-HCRF model formulation

We consider a labeled dataset with N video sequencesconsisting of triplets D = {(xi,j ,x∗i,j , yi)}Ni=1, wherexi,j ∈ RMx×T is an observation sequence of length T withj = 1 . . . T . For example, xi,j might correspond to thejth frame of the ith video sequence. Furthermore, yi corre-sponds to a class label defined in a finite label set Y . Also,the additional information about the observations xi is en-coded in a feature vector x∗i,j ∈ RMx∗×T . Such privilegedinformation is provided only at the training step and it isnot available during testing. Note that we do not make anyassumption about the form of the privileged data. In whatfollows, we omit indices i and j for simplicity.

The LUPI-HCRF model is a member of the exponentialfamily and the probability of the class label given an obser-vation sequence is given by:

p(y|x,x∗;w) =∑h

p(y,h|x,x∗;w)

=∑h

exp (E(y,h|x,x∗;w)−A(w)) ,

(1)

where w = [θ,ω] is a vector of model parameters, andh = {h1, h2, . . . , hT }, with hj ∈ H being a set of latent

variables. In particular, the number of latent variables maybe different from the number of samples, as hj may cor-respond to a substructure in an observation. Moreover, thefeatures follow the structure of the graph, in which no fea-ture may depend on more than two hidden states hj andhk [24]. This property not only captures the synchroniza-tion points between the different sets of information of thesame state, but also models the compatibility between pairsof consecutive states. We assume that our model followsthe first-order Markov chain structure (i.e., the current stateaffects the next state). Finally, E(y,h|x;w) is a vector ofsufficient statistics and A(w) is the log-partition functionensuring normalization:

A(w) = log∑y′

∑h

exp (E(y′,h|x,x∗;w)) . (2)

Different sufficient statistics E(y|x,x∗;w) in (1) definedifferent distributions. In the general case, sufficient statis-tics consist of indicator functions for each possible config-uration of unary and pairwise terms:

E(y,h|x,x∗;w) =∑j∈V

Φ(y, hj ,xj ,x∗j ;θ) +

∑j,k∈E

Ψ(y, hj , hk;ω) ,

(3)

where the parameters θ and ω are the unary and the pair-wise weights, respectively, that need to be learned. Theunary potential does not depend on more than two hiddenvariables hj and hk, and the pairwise potential may dependon hj and hk, which means that there must be an edge (j, k)in the graphical model. The unary potential is expressed by:

Φ(y, hj ,xj ,x∗j ;θ) =

∑`

φ1,`(y, hj ;θ1,`)

+ φ2(hj ,xj ;θ2) + φ3(hj ,x∗j ;θ3) ,

(4)

and it can be seen as a state function, which consists ofthree different feature functions. The label feature function,which models the relationship between the label y and thehidden variables hj , is expressed by:

φ1,`(y, hj ;θ1,`) =∑λ∈Y

∑a∈H

θ1,`1(y = λ)1(hj = a) , (5)

where 1(·) is the indicator function, which is equal to 1 if itsargument is true, and 0 otherwise. The observation featurefunction, which models the relationship between the hiddenvariables hj and the observations x, is defined by:

φ2(hj ,xj ;θ2) =∑a∈H

θ>2 1(hj = a)xj . (6)

Finally, the privileged feature function, which models therelationship between the hidden variables hj is defined by:

φ3(hj ,x∗j ;θ3) =

∑a∈H

θ>3 1(hj = a)x∗j . (7)

The pairwise potential is a transition function and rep-resents the association between a pair of connected hiddenstates hj and hk and the label y. It is expressed by:

Ψ(y, hj , hk;ω) =∑λ∈Ya,b∈H

∑`

ω`1(y = λ)1(hj = a)1(hk = b) .

(8)

3.2. Parameter learning and inference

In the training step the optimal parameters w∗ are esti-mated by maximizing the following loss function:

L(w) =

N∑i=1

log p(yi|xi,x∗i ;w)− 1

2σ2‖w‖2 . (9)

The first term is the log-likelihood of the posterior proba-bility p(y|x,x∗;w) and quantifies how well the distributionin Eq. (1) defined by the parameter vector w matches thelabels y. The second term is a Gaussian prior with varianceσ2 and works as a regularizer. The loss function is opti-mized using the limited-memory BFGS (LBFGS) method[20] to minimize the negative log-likelihood of the data.

Our goal is to estimate the optimal label configurationover the testing input, where the optimality is expressed interms of a cost function. To this end, we maximize the pos-terior probability and marginalize over the latent variablesh and the privileged information x∗:

y = arg maxy

p(y|x;w)

= arg maxy

∑h

∑x∗

p(y,h|x,x∗;w)p(x∗|x;w) .(10)

To efficiently cope with outlying measurements aboutthe training data, we consider that the training samples xand x∗ jointly follow a Student’s t-distribution. Therefore,the conditional distribution p(x∗|x;w) is also a Student’st-distribution St(x∗|x;µ∗,Σ∗, ν∗), where x∗ forms the firstMx∗ components of (x∗,x)

T , x comprises the remainingM − Mx∗ components, with mean vector µ∗, covariancematrix Σ∗ and ν∗ ∈ [0,∞) corresponds to the degrees offreedom of the distribution [13]. If the data contain out-liers, the degrees of freedom parameter ν∗ is weak and themean and covariance of the data are appropriately weightedin order not to take into account the outliers. An approxi-mate inference is employed for estimation of the marginalprobability (Eq. (10)) by applying the LBP algorithm [12].

4. Multimodal Feature FusionOne drawback of combining features of different modal-

ities is the different frame rate that each modality mayhave. Thus, instead of directly combining multimodal fea-tures together one may employ canonical correlation analy-sis (CCA) [9] to exploit the correlation between the differ-ent modalities by projecting them onto a common subspace

Algorithm 1 Robust privileged probabilistic leaningInput: Original data x, privileged data x∗, class labels y

1: Perform feature extraction from both x and x∗

2: Project x and x∗ onto a common space using Eq. (11)3: w∗ ← arg min

w(−L(w)) /* Train LUPI-HCRF on x

and x∗ using Eq. (9) */4: y ← arg max

yp(y|x;w) /* Marginalize over h and x∗

using Eq. (10) */Output: Predicted labels y

so that the correlation between the input vectors is maxi-mized in the projected space. In this paper, we followeda different approach. Our model is able to learn the rela-tionship between the input data and the privileged features.To this end, we jointly calibrate the different modalities bylearning a multiple output linear regression model [21]. Letx ∈ RM×d be the input raw data and a ∈ RM×p be theset of semantic attributes (privileged features). Our goal isto find a set of weights γ ∈ Rd×p, which relates the priv-ileged features to the regular features by minimizing a dis-tance function across the input samples and their attributes:

arg minγ

‖xγ − a‖2 + η‖γ‖2 , (11)

where ‖γ‖2 is a regularization term and η controls the de-gree of the regularization, which was chosen to give the bestsolution by using cross validation with η ∈ [10−4, 1]. Fol-lowing a constrained least squares (CLS) optimization prob-lem and minimizing ‖γ‖2 subject to xγ = a, Eq. (11) hasa closed form solution γ =

(xTx + ηI

)−1xTa, where I is

the identity matrix. Note that the minimization of Eq. (11)is fast since it needs to be solved only once during training.Finally, we obtain the prediction f of the privileged featuresby multiplying the regular features with the learned weightsf = x · γ. The main steps of the proposed method aresummarized in Algorithm 1.

5. ExperimentsDatasets: The TV human interaction (TVHI) [22]

dataset consists of 300 videos and contains four kinds of in-teractions. The SBU Kinect Interaction (SBU) [40] datasetis a collection of approximately 300 videos that containeight different interaction classes. Finally, the unstructuredsocial activity attribute (USAA) [8] dataset includes eightdifferent semantic class videos of social occasions and con-tains around 100 videos per class for training and testing.

5.1. Implementation details

Feature selection: For the evaluation of our method, weused spatio-temporal interest points (STIP) [15] as our base

Dataset Features (dimension) Regular Privileged

TVHI [22]STIP (162) 3Head orientations (2) 3MFCC (39) 3

SBU [40] STIP (162) 3Pose (15) 3

USAA [8]

STIP (162) 3SIFT (128) 3MFCC (39) 3Semantic attributes (69) 3

Table 1. Types of features used for human activity recognition foreach dataset. The numbers in parentheses indicate the dimensionof the features. The checkmark corresponds to the usage of thespecific information as regular or privileged. Privileged featuresare used only during training.

video representation. These features were selected becausethey can capture salient visual motion patterns in an effi-cient and compact way. In addition, for the TVHI dataset,we also used the provided annotations, which are relatedto the locations of the persons in each video clip. For thisdataset, we used audio features as privileged information.More specifically, we employed the mel-frequency cepstralcoefficients (MFCC) [25] features, resulting in a collectionof 13 MFCC coefficients, and their first and second orderderivatives forming a 39-dimensional feature vector.

Furthermore, for the SBU dataset, as privileged informa-tion, we used the positions of the locations of the joints foreach person in each frame, and six more feature types con-cerning joint distance, joint motion, plane, normal plane,velocity, and normal velocity as described by Yun et al.[40]. Finally, for the USAA dataset we used the providedattribute annotation as privileged information to character-ize each class with a feature vector of semantic attributes.As a representation of the video data, we used the providedSIFT [19], STIP, and MFCC features. Table 1 summarizesall forms of features used either as regular or privileged foreach dataset during training and testing.

Model selection: The model in Fig. 2 was trained byvarying the number of hidden states from 4 to 20, with amaximum of 400 iterations for the termination of LBFGS.The L2 regularization scale term σ was searched within{10−3, . . . , 103} and 5-fold cross validation was used tosplit the datasets into training and test sets, and the averageresults over all the examined configurations are reported.

5.2. Results and discussion

Classification with hand-crafted features: We com-pare the results of our LUPI-HCRF method with the state-of-the-art SVM+ method [34] and other baselines that in-corporate the LUPI paradigm. Also, to demonstrate the ef-ficacy of robust privileged information to the problem ofhuman activity recognition, we compared it with ordinary

(a)

(b) (c)

Figure 3. Confusion matrices for the classification results of theproposed LUPI-HCRF approach for (a) the TVHI [22], (b) theSBU [40], and (c) the USAA [8] datasets.

SVM and HCRF, as if they could access both the originaland the privileged information at test time. This means thatwe do not differentiate between regular and privileged infor-mation, but use both forms of information as regular to inferthe underlying class label instead. Also, for the SVM+ andSVM, we consider a one-versus-all decomposition of multi-class classification scheme and average the results for everypossible configurations. Finally, the optimal parameters forthe SVM and SVM+ were selected using cross validation.

The resulting confusion matrices for all datasets are de-picted in Fig. 3. For the TVHI and SBU datasets, the classi-fication errors between different classes are relatively small,as only a few classes are strongly confused with each other.For the USAA dataset, the different classes may be stronglyconfused (e.g., the class wedding ceremony is confused withthe class graduation party) as the dataset has large intra-class variabilities, while the different classes may share thesame attribute representation as different videos may havebeen captured under similar conditions.

The benefit of using robust privileged information alongwith conventional data instead of using each modality sep-arately or both modalities as regular information is shownin Table 2. The combination of regular and privileged fea-tures has considerably increased the recognition accuracy tomuch higher levels than using each modality separately. Ifonly privileged information is used as regular for classifica-tion, the recognition accuracy is notably lower than whenusing visual information for the classification task. Thus,we may come to the conclusion that finding proper privi-leged information is not always a straightforward problem.

Table 3 compares the proposed approach with state-of-the-art methods on the human activity classification task onthe same datasets. The results indicate that our approachimproved the classification accuracy. On TVHI, we sig-nificantly managed to increase the classification accuracy

Dataset Regular Privileged Accuracy (%)

TVHI [22]

visual 7 60.9audio 7 35.9visual+audio 7 81.3visual audio 84.9

SBU [40]

visual 7 69.8pose 7 62.5visual+pose 7 81.4visual pose 85.4

USAA [8]

visual 7 55.5sem. attributes 7 37.4visual+sem. attributes 7 54.0visual sem. attributes 58.1

Table 2. Comparison of feature combinations for classifying hu-man activities and events on TVHI [22], SBU [40] and USAA [8]datasets. Robust privileged information seems to work in favor ofthe classification task. The crossmark corresponds to the absenceof privileged information during training.

Hand-crafted featuresMethod TVHI SBU USAAWang and Schmid [36] 76.1± 0.4 79.6±0.4 55.5± 0.1Wang and Ji [39] 74.8± 0.2 62.4± 0.3 48.5± 0.2Sharmanska et al. [30] 65.2± 0.1 56.3± 0.2 56.3± 0.2SVM [3] 75.9± 0.6 79.4± 0.4 47.4± 0.1HCRF [24] 81.3± 0.7 81.4± 0.8 54.0± 0.8SVM+ [34] 75.0± 0.2 79.4± 0.3 48.5± 0.1LUPI-HCRF 84.9± 0.8 85.4± 0.4 58.1± 1.4

Table 3. Comparison of the classification accuracies (%) on TVHI,SBU and USAA datasets using hand-crafted features.

by approximately 10% with respect to the SVM+ approach.Also, the improvement of our method with respect to SVM+was about 6% and 10% for the SBU and USAA datasets, re-spectively. This indicates the strength of the LUPI paradigmand demonstrates the need for additional information.

Classification with CNN features: In our experiments,we used CNNs for both end-to-end classification and fea-ture extraction. In both cases, we employed the pre-trainedmodel of Tran et al. [33], which is a 3D ConvNet. Weselected this model because it was trained on a very largedataset, namely Sports 1M dataset [11], which providesvery good features for the activity recognition task, espe-cially in our case where the size of the training data is small.

For the SBU dataset, which is a fairly small dataset, onlya few parameters had to be trained to avoid overfitting. Wereplaced the fully-connected layer of the pre-trained modelwith a new fully-connected layer of size 1024 and retrainedthe model by adding a softmax layer on top of it. For theTVHI dataset, we fine-tuned the last group of convolutionallayers, while for USAA, we fine-tuned the last two groups.Each group has two convolutional layers, while we addeda new fully-connected layer of size 256. For the optimiza-tion process, we used mini-batch gradient descent (SGD)with momentum. The size of the mini-batch was set to 16with a constant momentum of 0.9. For the SBU dataset, the

CNN featuresMethod TVHI SBU USAACNN (end-to-end) [33] 60.5± 1.1 94.2± 0.8 67.4± 0.6SVM [3] 90.0± 0.3 92.8± 0.2 91.9± 0.3HCRF [24] 89.6± 0.5 91.1± 0.4 91.6± 0.8SVM+ [34] 92.5± 0.4 94.8± 0.3 92.3± 0.3LUPI-HCRF 93.2± 0.6 94.9± 0.7 93.9± 0.9

Table 4. Comparison of the classification accuracies (%) on TVHI,SBU and USAA datasets using CNN features.

learning rate was initialized to 0.01 and it was decayed by afactor of 0.1, while the total number of training epochs was1000. For the TVHI and USAA datasets, we used a con-stant learning rate of 10−4 and the total number of trainingepochs was 500 and 250, respectively. For all datasets, weadded a dropout layer after the new fully-connected layerwith probability 0.5. Also, we performed data augmenta-tion on each batch online and 16 consecutive frames wererandomly selected for each video. These frames were ran-domly cropped, resulting in frames of size 112 × 112 andthen flipped with probability 0.5. For classification, we usedthe centered 112 × 112 crop on the frames of each videosequence. Then, for each video, we extracted 10 randomclips of 16 frames and averaged their predictions. Finally,to avoid overfitting, we used early stopping and extractedCNN features from the newly added fully-connected layer.

Table 4 summarizes the comparison of LUPI-HCRF withstate-of-the-art methods using the features extracted fromthe CNN model, and end-to-end learning of the CNN modelusing softmax loss for the classification. The improvementof accuracy, compared to the classification based on thetraditional features (Table 3), indicates that CNNs may ef-ficiently extract informative features without any need tohand design them. We may observe that privileged in-formation works in favor of the classification task in allcases. LUPI-HCRF achieves notably higher recognition ac-curacy with respect to the HCRF model and the SVM+ ap-proaches. Moreover, for both TVHI and USAA datasets,when LUPI-HCRF is compared to the end-to-end CNNmodel, it achieved an improvement of approximately 33%and 27%, respectively. This huge improvement can be ex-plained by the fact that the CNN model uses a very simpleclassifier in the softmax layer, while LUPI-HCRF is a moresophisticated model that can efficiently handle sequentialdata in a more principled way.

The corresponding confusion matrices using the CNN-based features, are depicted in Fig. 4. The combinationof privileged information with the information learned fromthe CNN model feature representation resulted in very smallinter- and intra-class classification errors for all datasets.

Discussion: In general, our method is able to robustlyuse privileged information in a more efficient way than itsSVM+ counterpart by exploiting the hidden dynamics be-

(a)

(b) (c)

Figure 4. Confusion matrices for the classification results of theproposed LUPI-HCRF approach for (a) the TVHI [22], (b) theSBU [40], and (c) the USAA [8] datasets using the CNN features.

tween the videos and the privileged information. However,selecting which features can act as privileged information isnot straightforward. The performance of LUPI-based clas-sifiers relies on the delicate relationship between the regularand the privileged information. Tables 3 and 4 show thatSVM and HCRF perform worse than LUPI-HCRF. This isbecause at training time privileged information comes asground truth but at test time it is not. Also, privileged in-formation is costly or difficult to obtain with respect to pro-ducing additional regular training examples [29].

6. Conclusion

In this paper, we addressed the problem of human activ-ity recognition and proposed a novel probabilistic classifica-tion model based on robust learning by incorporating a Stu-dent’s t-distribution into the LUPI paradigm, called LUPI-HCRF. We evaluated the performance of our method onthree publicly available datasets and tested various forms ofdata that can be used as privileged. The experimental resultsindicated that robust privileged information ameliorates therecognition performance. We demonstrated improved re-sults with respect to the state-of-the-art approaches that mayor may not incorporate privileged information.

Acknowledgments

This work was funded in part by the UH Hugh Roy andLillie Cranz Cullen Endowment Fund and by the EuropeanCommission (H2020-MSCA-IF-2014), under grant agree-ment No. 656094. All statements of fact, opinion or conclu-sions contained herein are those of the authors and shouldnot be construed as representing the official views or poli-cies of the sponsors.

References[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-

embedding for attribute-based classification. In Proc. IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition, pages 819–826, Portland, OR, June 2013.2

[2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould.Dynamic image networks for action recognition. In Proc.IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 3034–3042, Las Vegas, NV, June2016. 2

[3] C. M. Bishop. Pattern Recognition and Machine Learning.Springer, 2006. 5, 6

[4] A. Cherian, J. Mairal, K. Alahari, and C. Schmid. Mixingbody-part sequences for human pose estimation. In Proc.IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 2361–2368, Columbus, OH, June2014. 2

[5] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S.Huang. Semisupervised learning of classifiers: theory, algo-rithms, and their application to human-computer interaction.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 26(12):1553–1566, December 2004. 1

[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-term re-current convolutional networks for visual recognition anddescription. In Proc. IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pages 2625–2634, Boston, MA, June 2015. 2

[7] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutionaltwo-stream network fusion for video action recognition. InProc. IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition, pages 1933–1941, Las Vegas,NV, June 2016. 2

[8] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Attributelearning for understanding unstructured social activity. InProc. 12th European Conference on Computer Vision, vol-ume 7575 of Lecture Notes in Computer Science, pages 530–543, Firenze, Italy, October 2012. 4, 5, 6

[9] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-Taylor.Canonical correlation analysis: an overview with applica-tion to learning methods. Neural Computation, 16(12):2639–2664, Dec. 2004. 3

[10] M. Hoai and A. Zisserman. Talking heads: Detecting hu-mans and recognizing their interactions. In Proc. IEEE Com-puter Society Conference on Computer Vision and PatternRecognition, pages 875–882, Columbus, OH, June 2014. 2

[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification with con-volutional neural networks. In Proc. IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition,pages 1725–1732, Columbus, OH, June 2014. 5

[12] N. Komodakis and G. Tziritas. Image completion usingefficient belief propagation via priority scheduling and dy-namic pruning. IEEE Transactions on Image Processing,16(11):2649–2661, November 2007. 2, 3

[13] S. Kotz and S. Nadarajah. Multivariate t-distributions andtheir applications. Cambridge University Press, Cambridge,2004. 3

[14] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning todetect unseen object classes by between-class attribute trans-fer. In Proc. IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, pages 951–958, Mi-ami Beach, Florida, June 2009. 2

[15] I. Laptev. On space-time interest points. International Jour-nal of Computer Vision, 64(2-3):107–123, September 2005.4

[16] I. Lillo, A. Soto, and J. C. Niebles. Discriminative hier-archical modeling of spatio-temporally composable humanactivities. In Proc. IEEE Computer Society Conference onComputer Vision and Pattern Recognition, pages 812–819,Columbus, OH, June 2014. 2

[17] J. Liu, B. Kuipers, and S. Savarese. Recognizing human ac-tions by attributes. In Proc. IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition, pages3337–3344, Colorado Springs, CO, June 2011. 2

[18] D. Lopez-Paz, L. Bottou, B. Scholkopf, and V. Vapnik. Uni-fying distillation and privileged information. In Proc. Inter-national Conference on Learning Representations, San Jose,Puerto Rico, May 2016. 1

[19] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal on Computer Vision,60(2):91–110, Nov. 2004. 4

[20] J. Nocedal and S. J. Wright. Numerical optimization.Springer series in operations research and financial engineer-ing. Springer, New York, NY, 2nd edition, 2006. 3

[21] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M.Mitchell. Zero-shot learning with semantic output codes. InProc. Advances in Neural Information Processing Systems,pages 1410–1418, Vancouver, British Columbia, Canada,December 2009. 4

[22] A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisser-man. Structured learning of human interactions in TV shows.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 34(12):2441–2453, Dec. 2012. 2, 4, 5, 6

[23] D. Peel and G. J. Mclachlan. Robust mixture modelling us-ing the t distribution. Statistics and Computing, 10:339–348,2000. 1

[24] A. Quattoni, S. Wang, L. P. Morency, M. Collins, and T. Dar-rell. Hidden conditional random fields. IEEE Transactionson Pattern Analysis and Machine Intelligence, 29(10):1848–1852, 2007. 1, 3, 5, 6

[25] L. Rabiner and B. H. Juang. Fundamentals of Speech Recog-nition. Prentice-Hall, Upper Saddle River, NJ, 1993. 4

[26] V. Ramanathan, P. Liang, and L. Fei-Fei. Video event under-standing using natural language descriptions. In Proc. IEEEInternational Conference on Computer Vision, pages 905–912, Sydney, Australia, December 2013. 2

[27] V. Ramanathan, B. Yao, and L. Fei-Fei. Social role discoveryin human events. In Proc. IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition, Portland,OR, June 2013. 2

[28] N. Sarafianos, C. Nikou, and I. A. Kakadiaris. Predictingprivileged information for height estimation. In Proc. Inter-national Conference on Pattern Recognition, Cancun, Mex-ico, December 2016. 1

[29] C. Serra-Toro, V. J. Traver, and F. Pla. Exploring some prac-tical issues of svm+: Is really privileged information thathelps? Pattern Recognition Letters, 42(0):40–46, 2014. 6

[30] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learn-ing to rank using privileged information. In Proc. IEEE In-ternational Conference on Computer Vision, pages 825–832,Sydney, Australia, December 2013. 2, 5

[31] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: an experimentalsurvey. IEEE Transactions on Pattern Analysis and MachineIntelligence, 36(7):1442–1468, 2014. 1

[32] Y. Song, L. P. Morency, and R. Davis. Multi-view latent vari-able discriminative models for action recognition. In Proc.IEEE Computer Society Conference on Computer Vision andPattern Recognition, Providence, Rhode Island, June 2012. 2

[33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3D convolutional net-works. In Proc. IEEE International Conference on ComputerVision, pages 4489–4497, Santiago, Chile, December 2015.2, 5, 6

[34] V. Vapnik and A. Vashist. A new learning paradigm: Learn-ing using privileged information. Neural Networks, 22(5–6):544–557, 2009. 1, 2, 4, 5, 6

[35] M. Vrigkas, C. Nikou, and I. A. Kakadiaris. Active privi-leged learning of human activities from weakly labeled sam-ples. In Proc. 23rd IEEE International Conference on Im-age Processing, pages 3036–3040, Phoenix, AZ, September2016. 2

[36] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In Proc. IEEE International Conference onComputer Vision, pages 3551–3558, Sydney, Australia, De-cember 2013. 2, 5

[37] L. Wang, Y. Qiao, and X. Tang. Action recognition withtrajectory-pooled deep-convolutional descriptors. In Proc.IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 4305–4314, Boston, MA, June2015. 2

[38] Y. Wang and G. Mori. A discriminative latent model of ob-ject classes and attributes. In Proc. 11th European Confer-ence on Computer Vision: Part V, pages 155–168, Heraklion,Crete, Greece, 2010. 2

[39] Z. Wang and Q. Ji. Classifier learning with hidden informa-tion. In Proc. IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, pages 4969–4977,Boston, MA, June 2015. 2, 5

[40] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, andD. Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In Proc.IEEE Computer Society Conference on Computer Vision andPattern Recognition Workshops, pages 28–35, Providence,Rhode Island, June 2012. 2, 4, 5, 6

[41] W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. A key vol-ume mining deep framework for action recognition. In Proc.

IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 1991–1999, Las Vegas, NV, June2016. 2

inferring human activities using robust privileged ... · inferring human activities using robust...

Documents