turbo learning framework for human-object interactions

Turbo Learning Framework for Human-Object Interactions Recognition andHuman Pose Estimation

Wei Feng1, Wentao Liu2, 1, Tong Li1, Jing Peng1, Chen Qian1, Xiaolin Hu2

1 SenseTime Group Ltd.2 Department of Computer Science and Technology, Tsinghua University

{fengwei, litong, pengjing1, qianchen}@sensetime.com, [email protected], [email protected]

Abstract

Human-object interactions (HOI) recognition and pose esti-mation are two closely related tasks. Human pose is an es-sential cue for recognizing actions and localizing the inter-acted objects. Meanwhile, human action and their interactedobjects’ localizations provide guidance for pose estimation.In this paper, we propose a turbo learning framework to per-form HOI recognition and pose estimation simultaneously.First, two modules are designed to enforce message passingbetween the tasks, i.e. pose aware HOI recognition moduleand HOI guided pose estimation module. Then, these twomodules form a closed loop to utilize the complementary in-formation iteratively, which can be trained in an end-to-endmanner. The proposed method achieves the state-of-the-artperformance on two public benchmarks including Verbs inCOCO (V-COCO) and HICO-DET datasets.

IntroductionHuman-object interactions (HOI) recognition (Gkioxari etal. 2017; Gupta et al. 2009; Yao and Fei-Fei 2010; Chenand Grauman 2014) aims to detect and recognize tripletsin the form < human, action, object > from a single im-age. It has attracted increasing attention, due to its wide ap-plications in image understanding and human-computer in-terfaces. Although previous methods have made significantprogress, it is still a challenging task due to the diversity ofhuman activities and the complexity of backgrounds.

Most existing methods perform HOI recognition in twosteps. First, a detector is employed to detect all objects in theimage. Then the action and interacted objects’ locations areobtained mainly based on the human appearance. However,these methods are easily influenced by the change of humanappearance. In contrast to source human appearance, humanpose contains structure information, which is a more robustreference for both action recognition and object localization.Though pose information also comes from appearance fea-tures, from the viewpoint of HOI recognition task, it pro-vides an indirect way to encode appearance features. There-fore, considering pose estimation in HOI recognition is thusa natural and intuitive idea. Gupta and Malik (2015) pro-pose a dataset named Verbs in COCO (V-COCO), in which

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

two kinds of information are labeled at the same time. How-ever, most existing methods treat HOI recognition and poseestimation as separate tasks, which ignore the informationreciprocity between two tasks.

Constructing a closed loop between HOI recognition andpose estimation may bring improvement to each task asshown in Fig 1. On one hand, human pose can improvethe robustness of HOI recognition. As for action recogni-tion, human pose offers accurate analysis of human structurerather than coarse information of human body appearance,thus reducing the impact of changes in human appearance.As for target localization, the locations of some keypointscan reveal the location and direction of the interacted ob-jects. For example, the keypoint of the wrist is very closeto the object in the action hold. Therefore, human pose is apowerful cue for recognizing actions and localizing the in-teracted objects. On the other hand, human action can guidethe distribution of keypoints vice versa. As a result, the rel-ative position between the keypoints is clearer. Meanwhile,their interacted objects’ localization can help localize spe-cific keypoints. So human action and their interacted ob-jects’ localization provide guidance for pose estimation.

In this paper, we propose a turbo learning method to si-multaneously perform HOI recognition and pose estimation.Specifically, a pose aware HOI recognition module is usedto perform HOI recognition based on both image and posefeatures, which can reduce the influence of changes in hu-man appearance. Meanwhile, a HOI guided pose estimationmodule is used to perform pose estimation with clear rela-tive distribution between keypoints, in which HOI featuresare encoded into space location information. Then these twomodules form a closed loop in the similar way as an engineturbo-charger, which feeds the output back to the input toreuse the exhaust gas for better engine efficiency. The feed-back process can gradually improve the results of both tasks.To the best of our knowledge, this is the first unified end-to-end trainable network for simultaneous HOI recognition andpose estimation.

Our contributions are in three folds: (1) A pose awareHOI recognition module is proposed, in which the humanpose can help to extract more accurate human structure in-formation than the source appearance features. (2) A HOIguided pose estimation module is introduced, where HOIrecognition features and image features form an attention

arX

iv:1

903.

0635

5v1

[cs

.CV

] 1

5 M

ar 2

019

Run...

Stand

...Skateboard

+

+

HOI Recognition Pose Estimation

HOI Prior

Pose Prior

Figure 1: HOI recognition and pose estimation can help each other. People doing the same action like running have similarkeypoints distribution. When there is a priori action pattern, the wrong pose estimation might be corrected. Meanwhile, whenthere are several objects in the estimated action-type specific density area (red area), the model will be confused to find the targetobject (green box). However, Some keypoints like wrist can refine the red area and help the model to localize the interactedobject successfully.

mask for pose estimation, which transforms the pattern ofhuman action into the general distribution of keypoints. (3)A turbo learning framework is proposed to sequentially per-form HOI recognition and pose estimation using the twomodules, which can gradually improve the results on bothtasks. The proposed method achieves the state-of-the-art re-sults on two public benchmarks including V-COCO andHICO-DET (Chao et al. 2017).

Related WorkHOI RecognitionHOI recognition is not a new problem in computer vi-sion (Herath et al. 2017). Traditional methods (Hu et al.2013; Delaitre et al. 2010; Pishchulin et al. 2014) usu-ally use many contextual elements in images to predictaction categories, including human pose, manipulated ob-jects, scene, and other people in images (Gupta et al. 2009;Maji et al. 2011; Desai and Ramanan 2012). Likewise, deepmethods also use one or several of these elements to rec-ognize HOIs, but most of them only consider human ap-pearance to be the key point of HOI recognition. For exam-ple, Mallya and Lazebnik (2016) fuse CNN-based humanappearance features and global context features to achievethe state-of-the-art performance on predicting HOI labels;Gkioxari et al. (2017) use merely human features to achievethe state-of-the-art performance in HOI recognition. Beyondthat, Chao et al. (2017) use spatial relations between humanand object positions to recognize HOIs. Shen et al. (2018)focus on the difficulty of obtaining all the possible HOI sam-ples in reality, and propose a zero-shot learning method totackle with the lack of data problem.

Combination of Action Recognition and PoseEstimation

It’s not difficult to understand that there exists an intrinsicconnection between human actions and human poses (Weiet al. 2016; Liu et al. 2018; Cao et al. 2017; Ning et al. 2017;Luvizon et al. 2017; Xiaohan Nie et al. 2015). Different peo-ple may have different skin colors and appearance, dress-ing various clothes, but their poses are similar when theyare doing the same action due to the homogeneity of hu-man body. Intuitively, adding pose information would bebeneficial for deep networks to recognize HOI categories,but relevant researches are scarce, especially in deep learn-ing. In early times, Yao and Fei-Fei (2010) use a randomfield model to encode mutual context of human pose andobjects, but their method needs to generate a set of models,each modeling one type of human pose in one action class.Desai and Ramanan (2012) propose a compositional modelthat uses human pose and interacting objects to predict hu-man actions, but the visual phraselets and tree structure theyuse are too simple to capture sophisticated HOI relations inlarge datasets. In connection with neural networks, Shen etal. (2018) concatenate pre-computed pose features and ap-pearance features to improve model performance on HOIrecognition. Recently, Luvizon et al. (2018) design a singlearchitecture for jointly 2D and 3D pose estimation from stillimages and action recognition from video sequences. Simi-lar to their idea, our method combines two tasks using oneend-to-end trainable framework. On top of that, we adopt anovel turbo learning framework that leads to better results ineach sub-tasks.

HOI Guided Pose Estimation ModulePose Aware HOI Recognition Module

bh

actiontarget

1as1a

Stem Network

reshape

×

mask

+

midH

outH

(a)

(b)

action

target

nasna

actiontarget

nasna

HOI Recognition

PoseEstimation

Stem

Netw

ork

outP

HOI Recognition Pose

EstimationPose

Estimation

actiontarget

2as2a

HOI Recognition

Stage 1 Stage 2

(c) (d)

...

Figure 2: (a) Overview of the proposed method. (b) Unfolded diagram of the turbo learning framework along time. (c) Thearchitecture of pose aware HOI recognition module. (d) The architecture of HOI guided pose estimation module.

MethodAn overview of our framework is illustrated in Fig 2(a). Wepropose an end-to-end turbo learning framework to integratetwo different tasks. After extracting the human appearancefeatures by the stem network, the pose aware HOI recog-nition module is applied to predict HOI with both humanappearance features and human pose features. Then, the pro-posed HOI guided pose estimation module updates the hu-man pose result depending on HOI recognition result. Thetwo processes repeat for several times. The two modulesform a closed loop to gradually improve the results of bothpose estimation and HOI recognition. The closed loop canbe expanded by the time sequence as shown in Fig 2(b).

Pose Aware HOI RecognitionInspired by (Gkioxari et al. 2017), we decompose HOIrecognition into two subtasks: action recognition and targetlocalization. We treat the action recognition task as a multi-label task, since a person can simultaneously perform mul-tiple actions (e.g., sit and hold). To avoid competition be-tween classes, a set of binary sigmoid classifiers are used formulti-label action classification as in (Gkioxari et al. 2017).Hence, the action recognition model outputs a confidencescore sa for action a. Normally, the action score sa is calcu-lated only based on the human appearance features F , whichcan be written as sa = sigmoid(I(F )), where I( ) refers tothe fully connected transformation of the feature maps.

Previous methods predict two subtasks with appearancefeatures extracted from human bounding box. However, wenotice that the human region contains lots of backgroundinformation and the rough feature of human make the HOIrecognition easily influenced by the change in appearance.

Instead of only considering the human appearance feature,we argue that the human pose estimation obtains detailedanalysis of human structure and could provide robust poseprior for HOI recognition.

Therefore, we design two types of pose aware HOI mod-ules to integrate human pose constraints. The first one isa simple multi-task training, in which we extend the HOIrecognition branch with a pose estimation branch. It needs tobe noticed that these two tasks only share stem block. In thesecond design, instead of integrating human pose informa-tion implicitly, we further encode the human pose features asinput of HOI recognition task as shown in Fig 2(c). In orderto make full use of pose information, the pose estimationfeatures fk consist of two parts: the intermediate featuresPmid from the eighth convolution layer of pose estimationbranch, and the pose estimation branch’s output Pout. Thefeature Pmid contains rich information of human pose and itis abstract enough to encode all body part locations. The esti-mation result Pout provides the exact structure of the humanbody. Furthermore, the HOI features in the previous stagealso contribute to the HOI recognition in this stage, whichwill be discussed in the later. Therefore, the concatenationof pose estimation features, human appearance features Fand the HOI features in the previous stage are used as inputh of this module, which can be written as:

h = I([fn−1k , F,Hn−1mid , H

n−1out ]), (1)

where [ ] refers to the concatenation of the feature maps.Hn−1

mid and Hn−1out represent the intermediate features and

recognition results of the previous HOI recognition modulerespectively.

For the target localization task, each action predicts thelocation of the associated object. To enforce the prior that the

interacting objects are around the human body, the positionof each object is represented relative to the human proposalas ˆbo|h.

ˆbo|h = {xo − xhwh

,yo − yhhh

, logwo

wh, log

hohh}, (2)

where ho, wo indicate the height and width of the groundtruth target object, and hh, wh denote the height and widthof the human region proposal (i.e., its IoU overlap with theground-truth box is≥ 0.5.). However, the human region pro-posal and target object may vary in sizes, so we use thesmooth L1 loss as it is less sensitive to outliers. Note thatthe actions without interactive objects will not produce lossin this task.

Predicting the precise location is a challenging task. Forexample, the target object usually does not appear in the re-gion proposal bh for the action throw. Therefore, in the testphase, the target localization branch estimates a probabilityof object location instead of the precise position. We com-bine the predicted probability µa and the detected objectsposition bo|h from the object detection branch to preciselylocalize the target. Specifically, given the human box bh andaction a, the target localization branch outputs the relativeposition of associated object µa, then the target localizationprobability gah,o can be written as:

gah,o = exp(−||bo|h − µa||2/2σ2), (3)

where σ is a hyperparameter to constrain search region,we set it to 0.3 in the experiment. The detected object po-sition bo|h is also encoded as Equation 2. After that, theaccurate localization of object ˆbo|h is obtained by ˆbo|h =argmaxbo|h g

ah,o.

HOI Guided Pose EstimationThe previous module has predicted action confidence andthe position of the associated object, then we propose a HOIguided pose estimation module to improve the prediction ofhuman pose. However, the results of action recognition andtarget localization are predicted by a fully connected net-work and lots of spatial structures are lost. To encode HOIfeature into spatial constraint, which is important to humanpose estimation, we propose a HOI attention mask to refinekeypoint detection.

The architecture of HOI guided pose estimation moduleis shown in Fig 2(d). In this module, an attention mask isgenerated by the HOI features and then the attention mask isperformed on the keypoint feature. The modulated featuresare utilized to generate refined results of human pose esti-mation. The HOI features fa include intermediate featuresHmid and recognition output Hout simultaneously, whereHout consists of the results of action recognition and tar-get localization, and Hmid contains the implicit informationabout human action patterns. More formally, we define theattention mask Attaction as Equation 4, and use R( ) to in-dicate reshape operation.

Attaction = R(sigmoid(I([fa]))), (4)

After that, the modulated keypoint features p are achievedas the element-wise multiplication results of attention maskand pose estimation features in previous stage fn−1k . fn−1kdenotes the features of the previous pose estimation module.Specifically, given attention maskAttaction, we generate theinput feature maps p as follows:

p = Attaction · fn−1k . (5)Then we feed the modulated features p into the keypoint

detection block to predict the human pose estimation resultin stage n. After the attention mask filtering, some false lo-cated keypoints are corrected as shown in experiments. Thekeypoint detection block consists of sequential 3×3 512-dconvolution layers, one deconv layer, one upsample layerand the final convolution layer which projects 512 channelsof feature maps into K masks (K is the number of keypoints).We model the location of ground truth as a one-hot masklike (He et al. 2017). Specifically, the ground truth maskis a one-hot m × m binary mask where only the pixel atthe ground truth location is foreground. Therefore, we re-gard the pose estimation task as a kind of m×m categoriesof classification tasks, so the training objective is to min-imize the cross-entropy loss over an m × m-way softmaxoutput. Similar as the HOI recognition module, the pose es-timation module only calculate the loss of region proposalswhose overlap of ground truth human box bh exceed overlapthreshold. Meanwhile, the invisible keypoints will not causeloss.

Turbo Learning FrameworkBetter pose estimation results lead to better HOI recognitionresult and better HOI recognition results lead to more ro-bust pose estimation. To utilize the flow of complementaryinformation iteratively, we introduce a turbo learning frame-work, which can be expanded as a sequence of pose awareHOI recognition modules and HOI guided pose estimationmodules by the time step. Among them, a pose aware HOIrecognition module and a HOI guided pose estimation mod-ule form a stage as shown in Fig 2(b). On the one hand,the pose features provide the subsequent HOI recognitionmodule an expressive non-parametric encoding of each key-point, allowing the HOI recognition module to learn explicithuman appearance and location of the keypoints that need tobe focused on. On the other hand, the HOI features providethe subsequent pose estimation module a comprehensive en-coding of a action pattern from all action classes, and thegeneral scope of some keypoints. As a result, each stage ofthe turbo learning framework increasingly refines the actionclassification results, the locations of target objects and eachkeypoint. This framework is fully differentiable and there-fore can be end-to-end trainable using backpropagation. Thewhole loss function can be written as:

L =

N∑i=1

(λiposeLipose + λiHOIL

iHOI) + Ldet, (6)

where Ldet is the loss function for the detection task, LiHOI

and Lipose are losses for HOI recognition and pose estima-

tion in stage i. λipose and λiHOI are the hyper-parameters tocontrol the balance in stage i. N is the total number of stages.

In each stage of the turbo learning framework, the inputconsists of three parts: human appearance features F , HOIfeatures fn−1a and the pose features fn−1k . Among them, thelatter two parts are produced in the previous stage. Specifi-cally, the set of input feature maps tn in stage (n < N − 1)can be written as:

tn ={F, fn−1k , fn−1a

}, (7)

where fn−1k and fn−1a denote the pose features and HOI fea-tures in the (n− 1)-th stage respectively. The n-th stage notonly outputs the recognition results [sna , µ

na ] and human pose

heatmap kn, but also the feature maps for the (n + 1)-thstage, which contain pose features fnk and HOI features fna .

ExperimentDatasets and MetricsV-COCO: V-COCO is a subset of COCO (Lin et al. 2014),which is commonly used for HOI recognition. This datasetincludes ∼5k images in the trainval set and ∼5k imagesin the test set. It annotates 17 keypoints of the humanbody and 26 common action classes. Three actions (cut,hit, eat) are annotated with two types of targets: instru-ment and direct object, which means that the action is as-sociated with two objects. For example, one man cuts thepizza with a knife. As accuracy is evaluated separately forthe two types of targets, we extend the 26 action classesto 29 classes, same as in (Gkioxari et al. 2017). For eval-uation, we adopt the two Average Precision (AP) metricsas in (Gupta and Malik 2015). The ’agent AP’ (APagent)evaluates the AP of the pair < human, action >. Notethat APagent does note require localizing the target, so wepay more attention to AProle which evaluates the AP of thetriplet < human, verb, object >.

HICO-DET: the HICO-DET dataset is an extension tothe “Humans Interacting with Common Objects” (HICO)dataset (Chao et al. 2015). In HICO-DET, the annotationof each HOI includes a human bounding box and an objectbounding box with a class label respectively. HICO-DETcontains about 48k images, ∼38k for training and ∼9k fortesting. It includes 600 interaction types, 80 unique objecttypes that are identical to the COCO categories, and 117unique verbs. The official evaluation code of HICO-DETreports the mean AP over Full test set (including all 600HOI categories), Rare test set (including HOIs with less than10 training instances only) and Non-Rare test set (includingHOIs with 10 or more training instances only) respectively.

Implementation DetailsOur implementation is based on Faster R-CNN (Ren et al.2015) built on ResNet-101 (He et al. 2016), and the RegionProposal Network (RPN) is frozen and does not share fea-tures with our framework for convenient ablation. We extract7 × 7 features from region proposals by ROIAlign, and theHOI recognition branch consists of two 1024-d fully con-nected layers followed by dropout layers. The object detec-tion branch consists of 3 residual blocks followed by specificoutput layers, and the pose estimation branch consists of 8convolution layers followed by 2 upsample layers(deconv,

interp, conv) and output layer(keypoint). The weight decayis 0.0001 and the momentum is 0.9. We use synchronizedSGD on 8 GPUs, with each GPU hosting 1 image (the effec-tive mini-batch size per iteration is 8 images). We train thenetwork for 10k iterations with a learning rate of 0.001 andan additional 3k iterations with a learning rate of 0.0001.

For V-COCO, RPN and the object detection branch areinitialized by a Faster R-CNN model pre-trained on MS-COCO, then we train on the V-COCO trainval (5k images)split and report results on the test (5k images) split. We fine-tune the object detection branch and train other branchesfrom scratch.

The objects in HICO-DET are not exhaustively annotated.We adopt the same measure in (Gkioxari et al. 2017), using aResNet50-FPN object detector pre-trained on COCO to de-tect objects in HICO-DET, and the object detection branchis kept frozen during training. Due to the lack of pose anno-tations in HICO-DET, keypoint detection branch cannot betrained as well. Thus, we implement two strategies to adaptto the situations where only the annotation of HOI is avail-able. The details of these two strategies will be discussedlater.

Model Performance AnalysisIn order to demonstrate the effectiveness of turbo learningframework, we conduct some comparison experiments onthe V-COCO dataset. We also compare two strategies onthe HICO-DET dataset, where only the annotation of HOIis available.

HOI Recognition with vs. without Pose Estimation Thepose features are important cues for recognizing actions andlocalizing the interacted objects. Without pose features, ac-tion recognition may be misleading by the human appear-ance. Meanwhile, the target localization may be confusedwhen there are close objects. To demonstrate the impor-tance of pose estimation to HOI recognition, we evaluatea variant of our method in which only has the HOI recog-nition model and the number of stages is one. The resultsare shown in Fig 3(a) and Fig 3(b). Removing the pose es-timation branch shows a degradation of 0.5% in APagent

and 1.1% in AProle, which demonstrates the contribution ofpose estimation to HOI recognition. To demonstrate the ben-efits of regarding pose features as input, we also report theresults of simple multi-task training in Fig 3(a) and Fig 3(b).Multi-task training means that when the number of stages isone, pose estimation branch and HOI recognition branch arejointly learned, but the pose features like Pmid and Pout donot input to the HOI recognition branch. Using explicit in-put shows an improvement of 0.2% in APagent and 0.8% inAProle, which demonstrates the benefits of explicit input.

We also show some qualitative results in Fig 4(a) andFig 4(b). In the Fig 4(a), the action is easily classified as eat-ing because of the presence of pizza, but actually the pizza isfar from the keypoint of the face. With the pose features, theaction is correctly classified as cutting. In the Fig 4(b), theestimated area is in front of the human body. However, theestimated area is near the left wrist with the pose features,which help the model to successfully locate objects.

52

53

54

55

56

57

w/o HOI

recognition

RMPE one stage two stage three stage

65

66

67

68

69

70

71

w/o pose

estimation

multi-task one stage two stage three stage

37

38

39

40

41

42

43

w/o pose

estimation

multi-task one stage two stage three stage

(a) (b) (c)

kp

50AProle

APagentAP

Figure 3: Recognition results on the V-COCO test set. From left to right: APagent, AProle for HOI recognition and AP kp50 for

pose estimation.

baseball

hold

horse

ride

Figure 4: Examples showing that pose estimation helps HOI recognition (a, b) and that HOI recognition helps pose estimation(c, d). In each group of figures, from left to right: original recognition results, the other task’s recognition results and recognitionresults with the other task’s features. (c,d) only show the keypoints to be improved.

snowboardkick

sports ball

skateboard

skateboard

talk on phone

hold

book

stand run hold

tennis racket

hit

tennis racket

look

sports ball

sit horse ride horse smilesit

horse

ride horse

(a)

(b)

(c)

snowboardcell phone

Figure 5: HOI recognition and pose estimation results. (a) Each image shows one detected < human, verb, object > triplet.(b) One person can take several actions and have several interacted objects. (c) Our method can detect multiple persons whotake several actions and have several interacted objects.

Table 1: Results on HICO-DET test set.

Method full rare non-rareGupta & Malik (Gupta and Malik 2015) 7.81 5.37 8.54

InteractNet (Gkioxari et al. 2017) 9.94 7.16 10.77Proposed (Pre-train) 10.9 6.5 12.2

Proposed (Semiautomatic) 11.4 7.3 12.6

Table 2: Results on V-COCO test set.

Method APagent AProle

Gupta & Malik (Gupta and Malik 2015) 65.1 31.8InteractNet (Gkioxari et al. 2017) 69.2 40.0

Proposed 70.3 42.0

Pose Estimation with vs. without HOI Recognition Todemonstrate that HOI can guide the keypoints distribution,we also evaluate a variant of our method which removes theHOI recognition branch, so the pose estimation task only re-lies on the image features extracted by the ROIAlign layer.The pose estimation branch is the same as Mask-RCNN (Heet al. 2017) architecture, so “w/o HOI recognition” is actu-ally the result of the Mask-RCNN baseline, and “RMPE” isthe result of the RMPE (Fang et al. 2017) baseline. But inorder to perform HOI recognition, we need not only detectthe person, but also detect other kinds of objects. Fig 3(c)shows that removing HOI recognition task leads to 1% dropon AP kp

50 , indicating that the HOI recognition features pro-vide guidance on the relative distribution of keypoints.

We show two examples of improvement in keypoints de-tection in Fig 4. In the Fig 4(c), the legs are easy to be missedwhen people ride horse. However, the action “ride” has sim-ilar keypoints distribution. With the HOI recognition fea-tures, the keypoints are correctly detected. In the Fig 4(d),the location of baseball provides guidance for the keypointsof wrists, which improves the keypoints localization.

Benefits of Turbo Learning Framework Turbo learningframework aims to reuse the cooperation of two tasks, andrefine both of the results. To demonstrate the benefits ofturbo learning framework, we show the recognition resultsfrom stage 1 to stage 3 in Fig 3. From stage 1 to stage 2, theimprovement of APagent, AProle and AP kp

50 is 1.7%, 0.2%and 0.4% respectively. From stage 2 to stage 3, the improve-ment ofAPagent,AProle andAP kp

50 is 0.9%, 1.3% and 0.8%respectively. As the number of stages increases, the resultsof both HOI recognition and pose estimation are improved.

Pre-training vs. Semiautomatic Annotation Althoughthere are not many datasets where both annotations are avail-able, e.g. HICO-DET dataset, we implement two strategiesto avoid the lack of pose annotations. The first strategy ispre-training, in which we pre-train the model on the datasetwhich has both annotations. Then we finetune the networkfrom the model pre-trained on V-COCO and remove thekeypoint loss. Still, we allow the weights in keypoint de-tection branch to be updated following gradients producedby HOI recognition loss, and we find it yields better results.The second strategy is semiautomatic annotation, in which

we use the model proposed in (Newell et al. 2017) to semi-automatically annotate the keypoints on HICO-DET dataset.The used model is trained on the COCO dataset, which isreadily applicable to other open source pose detectors likeMask-RCNN. Then we take the keypoints detection resultsas annotations for training the model on HICO-DET dataset.

In Table 1, we show the comparison of two strategies. Theexperiment on HICO-DET in (Gkioxari et al. 2017) has thesimilar settings as ours, except that we do not use the FeaturePyramid Network (FPN) (Lin et al. 2017) and interactionbranch in (Gkioxari et al. 2017). So we adopt their results asa solid baseline. Table 1 shows that pre-training on a datasethaving both annotations will improve the HOI recognitionresults, which is mainly because precise human body struc-ture information learned on V-COCO can help the model tobe more robust to the changes in human appearance. The lastrow of Table 1 shows that semiautomatic annotation can fur-ther improve the HOI recognition results, which verifies thatour model is also applicable to the case where both HOI andkeypoint annotations are not simultaneously available.

Comparsion with the State-of-the-art MethodsAs shown in Tables 1 and 2, the proposed method canachieve the state-of-the-art performance on HICO-DET andV-COCO. As (Gupta and Malik 2015) only reported AProle

on a subset that consists of 19 actions and only evaluatedon the val set of V-COCO, we adopt the solid baseline im-plemented in (Gkioxari et al. 2017). It is worth noting thatour method does not use the FPN and interaction branchas (Gkioxari et al. 2017), but the final AProle still outper-forms 2% on V-COCO, which further proves the effective-ness of our method. Although the keypoints of HICO-DETare obtained by semiautomatic annotation rather than groundtruth, our method still outperforms 1.5% than InteractNet.

We also show some HOI recognition and pose estimationresults of the proposed model. In Fig 5(a), we only show onetriplet < human, action, object >. The results of one per-son with several actions are shown in Fig 5(b), and results ofmultiple persons are shown in Fig 5(c). These results showour model can handle different HOI cases.

ConclusionWe propose a turbo learning method to perform both HOIrecognition and pose estimation. As these two tasks can pro-vide guidance to each other, we introduce two novel mod-ules: pose aware HOI recognition module and HOI guidedpose estimation module, in which each task’s features arealso treated as a part of input to the other task. These twomodules form a closed loop to utilize complementary in-formation iteratively, which are trained end-to-end. As thenumber of iterations increases, both results are improvedgradually. The proposed method has achieved the state-of-the-art performance on V-COCO and HICO-DET datasets.

AcknowledgementThis work was supported in part by the National Natural Sci-ence Foundation of China under Grant Nos. 61332007 and61621136008.

References[Cao et al. 2017] Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh,Y. 2017. Realtime multi-person 2d pose estimation usingpart affinity fields. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, volume 1, 7.

[Chao et al. 2015] Chao, Y.-W.; Wang, Z.; He, Y.; Wang, J.;and Deng, J. 2015. Hico: A benchmark for recognizinghuman-object interactions in images. In Proceedings of theIEEE International Conference on Computer Vision, 1017–1025.

[Chao et al. 2017] Chao, Y.-W.; Liu, Y.; Liu, X.; Zeng, H.;and Deng, J. 2017. Learning to detect human-object inter-actions. In arXiv preprint arXiv:1702.05448.

[Chen and Grauman 2014] Chen, C.-Y., and Grauman, K.2014. Predicting the location of interactees in novel human-object interactions. In Proceedings of the Asian conferenceon computer vision, 351–367. Springer.

[Delaitre et al. 2010] Delaitre, V.; Laptev, I.; and Sivic, J.2010. Recognizing human actions in still images: a studyof bag-of-features and part-based representations. In Pro-ceedings of the British Machine Vision Conference, 1–11.

[Desai and Ramanan 2012] Desai, C., and Ramanan, D.2012. Detecting actions, poses, and objects with relationalphraselets. In Proceedings of the European Conference onComputer Vision, 158–172. Springer.

[Fang et al. 2017] Fang, H.; Xie, S.; Tai, Y.-W.; and Lu, C.2017. Rmpe: Regional multi-person pose estimation. InProceedings of the IEEE International Conference on Com-puter Vision, volume 2.

[Gkioxari et al. 2017] Gkioxari, G.; Girshick, R.; Dollar, P.;and He, K. 2017. Detecting and recognizing human-objectinteractions. In arXiv preprint arXiv:1704.07333.

[Gupta and Malik 2015] Gupta, S., and Malik, J. 2015.Visual semantic role labeling. In arXiv preprintarXiv:1505.04474.

[Gupta et al. 2009] Gupta, A.; Kembhavi, A.; and Davis,L. S. 2009. Observing human-object interactions: Usingspatial and functional compatibility for recognition. In Pro-ceedings of the IEEE Transactions on Pattern Analysis andMachine Intelligence, volume 31, 1775–1789. IEEE.

[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition, 770–778.

[He et al. 2017] He, K.; Gkioxari, G.; Dollar, P.; and Gir-shick, R. 2017. Mask r-cnn. In Proceedings of the IEEEInternational Conference on Computer Vision, 2980–2988.IEEE.

[Herath et al. 2017] Herath, S.; Harandi, M.; and Porikli, F.2017. Going deeper into action recognition: A survey. InProceedings of the Image and Vision Computing, volume 60,4–21. Elsevier.

[Hu et al. 2013] Hu, J.-F.; Zheng, W.-S.; Lai, J.; Gong, S.;and Xiang, T. 2013. Recognising human-object interactionvia exemplar based modelling. In Proceedings of the IEEE

International Conference on Computer Vision, 3144–3151.IEEE.

[Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.;Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014.Microsoft coco: Common objects in context. In Proceedingsof the European Conference on Computer Vision, 740–755.Springer.

[Lin et al. 2017] Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.;Hariharan, B.; and Belongie, S. 2017. Feature pyramid net-works for object detection. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 4.

[Liu et al. 2018] Liu, W.; Chen, J.; Li, C.; Qian, C.; Chu, X.;and Hu, X. 2018. A cascaded inception of inception networkwith attention modulated feature fusion for human pose es-timation. In Thirty-Second AAAI Conference on ArtificialIntelligence.

[Luvizon et al. 2017] Luvizon, D. C.; Tabia, H.; and Picard,D. 2017. Human pose regression by combining indirectpart detection and contextual information. In arXiv preprintarXiv:1710.02322.

[Luvizon et al. 2018] Luvizon, D. C.; Picard, D.; and Tabia,H. 2018. 2d/3d pose estimation and action recogni-tion using multitask deep learning. In arXiv preprintarXiv:1802.09232.

[Maji et al. 2011] Maji, S.; Bourdev, L.; and Malik, J. 2011.Action recognition from a distributed representation of poseand appearance. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 3177–3184.IEEE.

[Mallya and Lazebnik 2016] Mallya, A., and Lazebnik, S.2016. Learning models for actions and person-object inter-actions with transfer to question answering. In Proceedingsof the European Conference on Computer Vision, 414–428.Springer.

[Newell et al. 2017] Newell, A.; Huang, Z.; and Deng, J.2017. Associative embedding: End-to-end learning for jointdetection and grouping. In Advances in Neural InformationProcessing Systems, 2277–2287.

[Ning et al. 2017] Ning, G.; Zhang, Z.; and He, Z. 2017.Knowledge-guided deep fractal neural networks for humanpose estimation. In Proceedings of the IEEE Transactionson Multimedia. IEEE.

[Pishchulin et al. 2014] Pishchulin, L.; Andriluka, M.; andSchiele, B. 2014. Fine-grained activity recognition withholistic and pose based features. In Proceedings of the Ger-man Conference on Pattern Recognition, 678–689. Springer.

[Ren et al. 2015] Ren, S.; He, K.; Girshick, R.; and Sun, J.2015. Faster r-cnn: Towards real-time object detection withregion proposal networks. In Proceedings of the Advancesin Neural Information Processing Systems, 91–99.

[Shen et al. 2018] Shen, L.; Yeung, S.; Hoffman, J.; Mori,G.; and Fei-Fei, L. 2018. Scaling human-object interactionrecognition through zero-shot learning. In 2018 IEEE Win-ter Conference on Applications of Computer Vision, 1568–1576. IEEE.

[Wei et al. 2016] Wei, S.-E.; Ramakrishna, V.; Kanade, T.;and Sheikh, Y. 2016. Convolutional pose machines. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 4724–4732.

[Xiaohan Nie et al. 2015] Xiaohan Nie, B.; Xiong, C.; andZhu, S.-C. 2015. Joint action recognition and pose esti-mation from video. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 1293–1301.

[Yao and Fei-Fei 2010] Yao, B., and Fei-Fei, L. 2010. Mod-eling mutual context of object and human pose in human-object interaction activities. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,17–24. IEEE.

turbo learning framework for human-object interactions

Documents