arxiv:1904.03885v1 [cs.cv] 8 apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators....

12
Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions Peratham Wiriyathammabhum ♠♦ , Abhinav Shrivastava ♠♦ , Vlad I. Morariu , Larry S. Davis ♠♦ University of Maryland: Department of Computer Science, UMIACS[email protected], [email protected] [email protected], [email protected] Abstract This paper presents a new task, the ground- ing of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to bet- ter model linguistic structure. We introduce a new data collection scheme based on gram- matical constraints for surface realization to enable us to investigate the problem of ground- ing spatio-temporal identifying descriptions in videos. We then propose a two-stream modu- lar attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that mo- tion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks re- solve task interference between modules. Fi- nally, we propose a future challenge and a need for a robust system arising from replac- ing ground truth visual annotations with auto- matic video object detector and temporal event localization. 1 Introduction Localizing referring expressions in videos in- volves both static and dynamic information. A referring expression (Dale and Reiter, 1995; Roy and Reiter, 2005) is a linguistic expression that grounds its meaning to a specific referent object in the world. The input video can be very long, have unknown length, contain many objects from the same class, or contain similar actions and in- teractions throughout the video. A successful, grounded communication between a speaker and a listener must ensure that the sentence or discourse provides enough information such that the listener can eliminate all distractors and focus only on the referent object that acts in a specific time interval. That essential information varies from the diver- sity of events in the world. However, a speaker is Figure 1: The first spatio-temporal identifying descrip- tion in the green box grounds to the event that a panda goes down the slide. Another panda can be a context because they are interacting in the same scene. The second identifying description in the blue box grounds to the event that another panda climbs up the slide. likely to mention salient properties and also salient differences based on the referent in comparison to other distractors. The differences can be about ob- ject category, attributes, poses, actions, changes in location, relationships and contexts in the scene. Existing image referring expression datasets (Mao et al., 2016; Johnson et al., 2015; Kazemzadeh et al., 2014; Plummer et al., 2017; Krishna et al., 2017b) do not contain referring expressions that refer to dynamic properties or movements of the referent. These datasets do not require temporal understanding that would require a system to learn that “moving to the right” is dif- ferent from “moving to the left” and “getting up” is different from “lying down”. Existing video re- ferring expression datasets and approaches (Kr- ishna et al., 2017a; Hendricks et al., 2017; Gao et al., 2017; Berzak et al., 2015; Li et al., 2017; Hendricks et al., 2018; Gavrilyuk et al., 2018) focus only on temporal localization but referent arXiv:1904.03885v1 [cs.CV] 8 Apr 2019

Upload: others

Post on 15-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Referring to Objects in Videos usingSpatio-Temporal Identifying Descriptions

Peratham Wiriyathammabhum♠♦, Abhinav Shrivastava♠♦, Vlad I. Morariu♦, Larry S. Davis♠♦

University of Maryland: Department of Computer Science♠, UMIACS♦[email protected], [email protected]@umd.edu, [email protected]

Abstract

This paper presents a new task, the ground-ing of spatio-temporal identifying descriptionsin videos. Previous work suggests potentialbias in existing datasets and emphasizes theneed for a new data creation schema to bet-ter model linguistic structure. We introducea new data collection scheme based on gram-matical constraints for surface realization toenable us to investigate the problem of ground-ing spatio-temporal identifying descriptions invideos. We then propose a two-stream modu-lar attention network that learns and groundsspatio-temporal identifying descriptions basedon appearance and motion. We show that mo-tion modules help to ground motion-relatedwords and also help to learn in appearancemodules because modular neural networks re-solve task interference between modules. Fi-nally, we propose a future challenge and aneed for a robust system arising from replac-ing ground truth visual annotations with auto-matic video object detector and temporal eventlocalization.

1 Introduction

Localizing referring expressions in videos in-volves both static and dynamic information. Areferring expression (Dale and Reiter, 1995; Royand Reiter, 2005) is a linguistic expression thatgrounds its meaning to a specific referent objectin the world. The input video can be very long,have unknown length, contain many objects fromthe same class, or contain similar actions and in-teractions throughout the video. A successful,grounded communication between a speaker and alistener must ensure that the sentence or discourseprovides enough information such that the listenercan eliminate all distractors and focus only on thereferent object that acts in a specific time interval.That essential information varies from the diver-sity of events in the world. However, a speaker is

Figure 1: The first spatio-temporal identifying descrip-tion in the green box grounds to the event that a pandagoes down the slide. Another panda can be a contextbecause they are interacting in the same scene. Thesecond identifying description in the blue box groundsto the event that another panda climbs up the slide.

likely to mention salient properties and also salientdifferences based on the referent in comparison toother distractors. The differences can be about ob-ject category, attributes, poses, actions, changes inlocation, relationships and contexts in the scene.

Existing image referring expression datasets(Mao et al., 2016; Johnson et al., 2015;Kazemzadeh et al., 2014; Plummer et al., 2017;Krishna et al., 2017b) do not contain referringexpressions that refer to dynamic properties ormovements of the referent. These datasets do notrequire temporal understanding that would requirea system to learn that “moving to the right” is dif-ferent from “moving to the left” and “getting up”is different from “lying down”. Existing video re-ferring expression datasets and approaches (Kr-ishna et al., 2017a; Hendricks et al., 2017; Gaoet al., 2017; Berzak et al., 2015; Li et al., 2017;Hendricks et al., 2018; Gavrilyuk et al., 2018)focus only on temporal localization but referent

arX

iv:1

904.

0388

5v1

[cs

.CV

] 8

Apr

201

9

Page 2: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Figure 2: We add a motion stream to modular attentionnetwork. Our motion modules take optical flow inputand model motion information for the subject and itsrelationship.

object localization. In other words, they do notground events in both space and time. Emphasizedby (Cirik et al., 2018), the data collection pro-cess for referring expressions should incorporatelinguistic structure such that the model can learnmore than shallow correlations between pairs of asentence and visual features. That is, a particulardataset should not have a shortcut that only detect-ing nouns (object class) can perform well. Ourdataset mitigates this issue by forcing instance-level recognition. We create a requirement thatgrounding the referring expressions must identifythe target object among many distractors from thecontrast set (same class distractors).

The contributions of this paper are (i) We pro-pose a novel vision and language data collectionscheme based on grammatical constraints for sur-face realization to ground video referring expres-sions in both space and time with lexical corre-lations between vision and language. We col-lected the Spatio-Temporal Video Identifying De-scription Localization (STV-IDL) dataset consist-ing of 199 video sequences from Youtube and7,569 identifying descriptions. (ii) We proposean interpretable system based on two-stream mod-ular attention network that models both appear-ance and motion to ground referring expressionsas instance-level video object detection and eventlocalization. We also perform ablation studies toget insights and identify potential challenges forthe task.

2 Spatio-Temporal Localization

Given ground truth temporal intervals([start, end]) and object tubelets (a sequence

of object bounding box coordinates in a giventemporal interval, {[x0, y0, x1, y1]start,. . . ,[x0, y0, x1, y1]end}), we want to localize an iden-tifying expression ie to the correct target tubelettbtarget not the distractor tubelets tbdistractor asour predicted tubelet r. We evaluate using theaccuracy measure.

For automatic localization, tubelet IoU (Rus-sakovsky et al., 2015) and temporal IoU are usedto evaluate the bounding box and temporal inter-val with the ground truth respectively. Let Ri bethe region in the frame i to be detected,

tubelet IoU =

∑i δ(IoU(ri, Ri) > 0.5)

N, (1)

where the denominator is the number of detectedframe measured by the standard Intersection overUnion (IoU) in an image and N denotes the num-ber of union frames.

temporal IoU =∩(intervali, intervalj)

∪(intervali, intervalj), (2)

where intervali and intervalj are input temporalintervals and the intersection and union functionsare operations over 1-D intervals.

3 Related Work

Spatio-Temporal Localization. Spatio-temporal localization (or action understanding)is a long standing challenge in computer vision.Most existing datasets like LIRIS-HARL (Wolfet al., 2014), J-HMDB (Jhuang et al., 2013),UCF-Sports (Rodriguez et al., 2008), UCF-101(Soomro et al., 2012) or AVA (Gu et al., 2018) lo-calize a spatio-temporal tubelet for human actionsin either trimmed videos or a simple visual set-ting or a fixed lexicon. In contrast to action labels,our work accepts a free-form referring expressionannotation which also contains a richer set of re-lations in the forms of prepositions, adverbs andconjunctions.

Referring Expression Comprehension. Thegoal of referring expression comprehension (Gol-land et al., 2010) is to ground phrases or sentencesinto the specific visual regions that the phraserefers. Prior works in the image domain haveeither focused on using a captioning module togenerate the sentence (Mao et al., 2016; Nagarajaet al., 2016) or learning a joint embedding to com-prehend the sentence by modeling the correspond-ing region unambiguously and localize the region

Page 3: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Figure 3: An overview of our system: an input video is using either ground truth annotations or is fed into bothtubelet object proposal and temporal interval proposal modules. The resulting tubelet and interval proposals arethen fed into an appearance and motion Faster-RCNN to extract the two-stream features. Then, a modular neuralnetwork will rank the tubelets given an input referring expression. The scores are average-pooled, and the systemoutputs the most likely tubelet that contains the reference object.

(Rohrbach et al., 2016; Wang et al., 2016; Huet al., 2017; Yu et al., 2018). For the video domain,(Yamaguchi et al., 2017) further annotated the Ac-tivityNet dataset with one referring expression pervideo for video retrieval with natural languagequery. (Li et al., 2017) uses referring expressionsto help track a target object in a video sequencein a subset of OTB100 (Lu et al., 2014) and Ima-geNet VID (Russakovsky et al., 2015). DiDeMo(Hendricks et al., 2017) and TEMPO (Hendrickset al., 2018) focus on localizing an input sentenceinto the corresponding temporal interval out of afinite number of backgrounds. Importantly, thesedatasets do not consider distractor objects from thesame class. While our work also focuses on thevideo domain, it focuses on localizing objects andevents as spatio-temporal tubelets aligned with aninput expression.

Surface Realization in Vision and Language.Surface realization is a process for generating sur-face forms, like natural language sentences, basedon some underlying representations. For natu-ral language generation, the underlying represen-tation tends to be syntactic features. In visionand language, captioning systems can use mean-ing representation like triplets as an input for asurface realization module to generate a sentence.(Farhadi et al., 2010) uses <Objects, Actions,Scenes>. (Yang et al., 2011) uses part-of-speechas <Nouns, Verbs, Scenes, Prepositions>. (Liet al., 2011) uses <<adj1, obj1>, prep, <adj2,obj2>>where adjectives are object attributes andprepositions are spatial relationships between ob-

jects. TEMPO (Hendricks et al., 2018) and TVQA(Lei et al., 2018) use a compositional format forwords like before or after to specify temporal rela-tionships between events during crowdsourcing.

We incorporate grammatical constraints (Lin-guistic prescription) based on part-of-speech intoour annotation pipeline so that we can crowd-source well-formed sentences from people whichcontain enough meaning representations for visionsystems to locate the target object with visual con-texts. Instead of manually writing sentences basedon context-free grammars like (Yu and Siskind,2013), we ask the annotators to write sentencesin which valid sentences contain at least a nounphrase (NP), a verb phrase (VP) and one of aprepositional phrase (PP), adverb phrase (ADVP)or conjunction phrase (CONJP). The rest of eachsentence are language variations where we expectcrowdsourcing to create more variations comparedto manual annotations by a few annotators. Wewant computer vision models to learn useful andinterpretable features by correlating the expres-sions and videos. So, we want the learned visualsemantics from grounding models to be similarto structural inputs in surface realization systems.Each part-of-speech correlates with a specific vi-sual feature.

4 STV-IDL Dataset

4.1 Dataset Construction

We develop a new data collection schema that en-sures rich correspondences between referring ex-

Page 4: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Table 1: STV-IDL dataset statistics.

Info StatisticsNumber of Videos 199Number of Sentences 7569Average objects per Video 2.85Average words per Sentence 22.65Sentences per Video 38.04

pressions and referred objects in a video usingconstraints. The spatio-temporal relations that weare interested in are about state transitions, that is,what happens before and after the action and howobjects move. The state transitions should be rela-tive to other objects and background. For example,a sentence ‘A man in a green uniform kicking theball then running toward the net.’ is a good videoreferring expression. This sentence is valid onlyin a spatial region that represents a noun phrase‘a man in a green uniform’ and a time interval inwhich an action from the verb phrase ‘hitting theball then running toward the net’ occurs. Also,the action ‘hitting the ball’ comes before his nextaction ‘running toward the net’ which shows theaction steps of ‘hitting’ followed by ‘running’ andthe action ‘running’ has a context object ‘the net.’

First, we ensure that all of our High Defini-tion videos (720p) crawled from Youtube con-tain at least two objects similar to (Mao et al.,2016), but each video will focus on just one ob-ject class to form a contrast set. This constraintprevents a simple video object detector from re-solving referential ambiguity using only nouns byjust outputting based on class information as in(Cirik et al., 2018). Because a simple object de-tector randomly outputs one object from a com-bination of the target and the contrast set. Then,the output is the same as random because the ob-ject confidence scores do not correlate with the re-ferring expression. We want more language cuesto guide the system to seek additional visual con-texts (Divvala et al., 2009) to focus and output onlyone unambiguous object detection. The datasetcontains 13 categories of videos which are eithermulti-player sports or animals. Second, inspiredby (Siskind, 1990; Yu and Siskind, 2013), a sen-tence, consists of a subject and a predicate, canbe viewed as a set of structured labels based onpart-of-speech and each label can be meaningfullygrounded in a video. Besides, annotations can usegrammars for lexical grounding and surface real-

Table 2: STV-IDL part-of-speech statistics. (Please seethe supplementary material for more details.)

Part-of-Speech percentsNoun, singular or mass (NN) 28.1Determiner (DT) 15.3Preposition or 10.9subordinating conjunction (IN)Adjective (JJ) 9.7Possessive pronoun ($PRP) 5.7Verb, 3rd person 5.1singular present (VBZ)Adverbs (RB) 3.6Coordinating conjunction (CC) 3.4

ization. Therefore, we ensure that every referringexpression in our dataset provides grammaticallyrelevant visual grounding based on part-of-speechsuch that a valid sentence must contain at least anoun phrase (NP), a verb phrase (VP) and one of aprepositional phrase (PP), adverb phrase (ADVP)or conjunction phrase (CONJP). We also foundthat the annotators may write relevant sentenceswithout the constraints but the contents are ran-dom and may not be visually grounded either spa-tially or temporally or both in the video. Some ex-ample sentences without the constraints are “Theguy was lucky to save the tennis ball.” and “Thesun is blocking the ball for the back player.”.

4.2 Video Tubelet, Temporal Interval, andExpressions Annotations

We manually identify interesting events in eachvideo and select a keyframe for that action in thepresence of distractors. Then, we manually anno-tate the start and end of that event into an intervallasting around one second. For bounding box an-notation, we use a javascript variant of Vatic (Von-drick et al., 2013; Bolkensteyn) to manually drawa tubelet of bounding boxes in all frames for eachobject of interest in every video. We crowdsourceannotations of referring expressions from AmazonMechanical Turk (AMT). We create a clip segmentwith a bounding box around the target object tofixate the annotator’s attention.

Next, we manually verify the referring expres-sions using another web interface that helps usevaluate if the sentence refers to the target object,is correct based on the video, is different from sen-tences for the distractors and is sufficient to distin-guish the target object from the distractors and the

Page 5: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Figure 4: Stacked two-stream modular attention net-work based on five optical flow image input. We modelthe bounding box sequence is a moving location mod-ule and a relationship module. The motion Faster-RCNN is also trained using a stack of five flow imagesfor frame index fi ∈ [t− 2, t+ 2].

background. The annotation interfaces, paymentand dataset statistics are shown in supplemen-tary material. We refer to the resulting referringexpressions as identifying descriptions (Mitchellet al., 2013) because our expressions are referringexpressions in the verified intervals which may beoverspecified but are also descriptions which maybe underspecified for the whole videos. Our re-ferring expressions are long because we want tomake sure that they are clear enough to provideinput cues for the system. However, it still mightbe not enough to localize an event from the wholevideo because the video has many events and canbe exhaustive to be specific for a particular event.

5 Approach: Two-stream ModularAttention Network

We start by employing a state-of-the-art image re-ferring expression localization, namely, ModularAttention Network (MAttNet) (Yu et al., 2018)for our tasks. This model fits our objective sinceit is a variant of modular neural networks (Audaand Kamel, 1999; Andreas et al., 2016) that isdecomposed based on tasks according to Fodor’smodularity of mind (Fodor, 1985). Therefore, wecan interpret the model in an ablation study oneach neural module for a specific vision subtaskand input type. Also, the model also provideslinguistic interpretability using its language atten-tion module that can visualize different bindingsfrom word symbols in a referring expression toeach visual module as attention am,t where mod-ule m ∈ {subj, loc, rel} (subject, location, rela-

tionship) and t is the index location of the wordthis attention weights its hidden representation theBi-LSTM encodes.

The original MAttNet model (RGB) decom-poses image referring expression grounding intothree modules, a subject module, a location mod-ule, and a relationship module. The network out-put score for an object oi and an expression r is,

S(oi|r) =∑

m∈modules

wmS(oi|qm), (3)

where wm is the weight vector from the languageattention module on the visual module m. qm isthe weighted sum of attention am,t over all wordembedding. S(oi|qm) is the module score from acosine similarity in the joint embedding betweenthe visual representation of oi denoted as vim andqm.

Given a positive pair (oi, ri), the network isdiscriminatively trained by sampling two negativepairs (oi, rj) and (ok, ri) where rj is the expres-sion from other contrast object and ok is the con-trast object from the same frame. The combinedhinged loss Lr is,

Lr =∑i

λ1 max(0,∆ + S(oi, rj)− S(oi, ri))

+ λ2 max(0,∆ + S(ok, ri)− S(oi, ri)). (4)

The loss is linearly combined with other loss termssuch as attribute prediction with cross-entropy lossLatt from the subject module in a multi-task learn-ing setting.

We extend MAttNet to the video domain by ap-plying two things. First, MAttNet uses Faster-RCNN (Girshick, 2015) for feature extraction sowe follow a well-established actor-action detec-tion pipeline which extends image object detec-tion to frame-based spatio-temporal action detec-tion (Peng and Schmid, 2016). With this, we re-frame the problem by replacing action labels withreferring expressions and putting MAttNet on topof Faster-RCNN. Also, we use external object andinterval proposal instead of Region Proposal Net-work (RPN) in Faster-RCNN. Second, we addsubject motion and relationship motion modulesto capture temporal information in a two-streamssetting (Simonyan and Zisserman, 2014). Thesemodules have the same architecture as the subjectand relationship module but are using optical flowas their input. We replace the three channel RGB

Page 6: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

input with a stack of flow-x, flow-y and flow mag-nitude from the flow image. The aim of these mod-ifications, depicted in Figure 2, is to better modelattributes, motion, movements and dynamic con-text in a video.

Previous work (Simonyan and Zisserman,2014) has shown that stacking many optical flowimages can help recognition. So, we train anothervariant of two-stream modular attention networkusing stacked five optical flow frames shown inFigure 4. In this setting, we train the stacked mo-tion Faster-RCNN by stacking flow images Fidx

where frame index idx ∈ [t − 2, t + 2]. Theinput becomes a 15 channel stacked optical flowimage. In addition, we add the moving loca-tion module to further model the movement ofthe location by stacking location features li =

[xminW , ymin

H , xmaxW , ymax

H ,ArearegionAreaimage

] where W andH are width and height of the image. Then thelocation features are concatenated with the loca-tion difference feature of the target object withup to five context objects from the same class,δij = [∆xmin

W , ∆ymin

H , ∆xmaxW , ∆ymax

H ,∆ArearegionAreaimage

]

so that we have a sequence of [li; δij ]idx whereframe index idx ∈ [t−2, t+2]. Then, we place anLSTM on top of the sequence and we forward theconcatenation of all hidden states to a fully con-nected layer and output the final location features.We also make a location sequence and place anLSTM on top of location in the relationship mo-tion module in this stacked optical flow setting.

5.1 Tubelet and Temporal Interval Proposals

We employ the state-of-the-art video object detec-tor, flow-guided feature aggregation (FGFA) (Zhuet al., 2017), finetuned on STV-IDL to generate thetubelet proposals. The per-frame detections fromFGFA are post-processed by linking into tubeletsusing Seq-NMS (Han et al., 2016) based on thetop 300 bounding boxes ranked by the confidenceof the category scores.

For temporal proposals, we implemented a vari-ent of Deep Action Proposals (DAPs) (Escorciaet al., 2016) based on multi-scale proposal. First,we use a temporal sliding window with a fixedlength of L frames and a stride of s (8 in our case).This produces a set of intervals, (bi,ei) where biand ei are the beginning and the end of the in-terval. Then, we extract the C3D features (Tranet al., 2015) from the image frames in that intervalusing the activation in the ‘fc7’ layer, pretrained

Table 3: Identifying Description Localization: mAP foreach collection. (values are in percents.) The fused1MAttNet is the proposed two-stream method and thefused5 MAttNet is the stacked version of the proposedtwo-stream method.

Model mAPrandom 29.68RGB MAttNet 41.51flow MAttNet 39.02flow5 MAttNet 41.90fused1 MAttNet 44.66fused5 MAttNet 42.82

Table 4: Ablation study on fused1 MAttNet: mAP foreach module combination. (values are in percents.)

Model mAPSubject+Location 44.46+Relationship 44.46+Subject Motion 44.46+Relationship Motion 44.66

on the Sports-1M dataset (Karpathy et al., 2014).The feature set f = C3D(ti : ti + δ), ti ∈ [bi, ei]where δ = 16 from the original pretrained model.The duration of each segment Lk also increasesas a power of 2, that is Lk+1 = 2 ∗ Lk. Thefeatures are fed to a 2-layered LSTM to perform{Event/Background} sequence classification.

6 Experiments and Analysis

We want to show how and to what extent mod-ular attention networks ground input expressionswith motion information in videos. So, we per-form two sets of experiments, identifying descrip-tion localization and automatic video object detec-tor and temporal event localization. Similar to (Guet al., 2018), we split the dataset into training, val-idation and test sets at the video level; that is, thereare no overlapping video segments for every split.There are 159 training, 13 validation, and 27 testvideos. The rough ratio is 12:1:2. Implementationdetails are in the supplementary material.

6.1 Identifying Description Localization

Setup. We perform three experiments, localiza-tion with ground truth annotations, module abla-tion study, and word attention study. First, weevaluate our model by selecting the target froma pool of candidate targets plus distractors. Wecompare five models based on input and modules.The five models are (1) MAttNet for RGB in-

Page 7: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Table 5: Ablation study on fused5 MAttNet: mAP foreach module combination. (values are in percents.)

Model mAPSubject+Location 33.97+Relationship 35.32+Subject Motion 35.41+Moving Location 42.84+Relationship Motion 42.82

put (RGB MAttNet/original model/baseline); (2)MAttNet for flow image input (flow MAttNet); (3)MAttNet for stacked five flow image input (flow5MAttNet); (4) two-stream MAttNet for RGB andflow image input (fused1 MAttNet) and (5) two-stream MAttNet for RGB and stacked five flowimage input (fused5 MAttNet). Second, we inter-pret the model by setting the module score weightsfrom language attention module to zeros for themodules we want to turn off in our ablation study.Third, we collect the statistics of the attention ofeach word from the input expressions in the testset to explain how and which kind of words eachmodule attends.

Results. The accuracies in Table 3 show su-perior performance for stacked flow5 and two-streams models. The stacked flow5 model im-proves over the RGB baseline by 0.39% whiletwo-stream fused1 and fused5 models have 3.15%and 1.31% improvement respectively. Both vari-ants of two-stream models, fused1 and fused5,outperform all one-stream models, RGB, flow, andflow5. All models perform better than randomlyselecting an object from the set of tubelets.

The accuracies in Table 4 show that each mod-ule in fused1 learns better since the modules in ap-pearance stream alone have 2.95% improvementsover the RGB only baseline. We further hypothe-size that the reason is the motion stream takes careof motion grounding so the appearance modulescan learn better because of the separation of un-related information into other modules. A modu-lar neural network avoids internal interference be-tween features by training each module indepen-dently and each module will masters its task moreprecisely (Auda and Kamel, 1999). The additionalrelationship motion module also provides comple-mentary information for the additional 0.20% im-provement. The accuracies in Table 5 show thatthe stacked flow5 model focuses mostly on themoving location module which causes the overallimprovement over the RGB baseline. The mov-

ing location is a predictive feature to model mo-tion and spatial location (Yin and Ordonez, 2017),but it prevents other vision modules from becom-ing sufficiently tuned in this setting. We also try tocombine the moving location with the fused1 set-ting. The results degenerate more, and the overallaccuracy is only 37.12%. It is even lower than flowMAttNet model.

Figure 6 shows how the language attention net-work assigns weights to each module by aggre-gating all the weights for each word based onPenn part-of-speech tag during test set predic-tion of the fused1 model to explain the perfor-mance gain. The aggregated statistics show thatmotion words like verbs, prepositions, and con-junctions are ranked higher for flow modules onaverage which means more attention to motion.We also focus on just aggregating verbs in Fig-ure 7 to further explain the modules. The statisticsshow that flow and location modules focus moreon verbs on average compared to their correspond-ing appearance-based modules.

6.2 Automatic Video Object Detector andTemporal Event Localization

Because spatio-temporal detection and localiza-tion is very challenging, we want to identify po-tential challenges for spatio-temporal groundingwhen automatic computer vision systems replacethe ground truth annotations. So, we replacetubelets with top 8 detections from flow-guidedfeature aggregation (FGFA) (Zhu et al., 2017) andtemporal intervals with the proposal system de-scribed in Section 3.2. We create three scenarios:in each scenario, varying amounts of the problemare revealed via the ground truth to separate eachcomponent and measure the hardness of each sub-problem and the impact of one on another.

6.2.1 Automatic Video Object DetectorSetup. We evaluate both the tubelet object propos-als and the pretrained modular attention networks.We replace the groundtruth tubelets to imperfectproposals which contains bounding box perturba-tions and we want to see how the model behaves.

Results. Since all modular attention networksare not trained on tubelet proposals, the resultsfrom the automatic video object detector in Table6 shows performance drops in all models and theperformances are even lower than the object de-tection baseline. The object detection baseline se-lects the tubelet with the highest confidence score

Page 8: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Figure 5: A qualitative result: the first, middle and last frames from an interval in the STV-IDL dataset withan expression, ‘The male tennis player in the near court moves from right to left in order to hit the ball but histeammate outside the court reaches the ball first and just hits it.’ The fused1 MAttNet can properly refer to theobject highlighted in the red box in contrast to the baseline.

Figure 6: Aggregations of output word attention weights for each module on the STV-IDL test set. Part-of-speechtags are CC, DT, IN, JJ, NN, NNS, PRP$, RB, TO and VBZ (left to right).

Figure 7: Aggregations on all verbs for each module.(from left to right: Relationship flow/RGB, Subjectflow/RGB, Location)

Table 6: Visual Object Detection: mAP [email protected] for each model. (values are in percents.)

Model [email protected] MAttNet 35.02flow MAttNet 22.63flow5 MAttNet 28.98fused1 MAttNet 23.93fused5 MAttNet 24.26FGFA most conf. 35.87FGFA 2nd conf. 34.16

from FGFA. We hypothesize that it is from bound-ing box perturbation that may affect both FasterRCNN features and location features. The resultsalso show that the performance drops are more se-vere in two-stream models - we think that it is froman accumulation of errors from both streams.

Table 7: Event Localization: mAP temporal [email protected] each model. (values are in percents.)

Model [email protected] MAttNet 8.72flow MAttNet 7.28flow5 MAttNet 8.79fused1 MAttNet 8.07fused5 MAttNet 7.02speaker LSTM 7.74speaker Bi-LSTM 10.10

6.2.2 Temporal Event LocalizationSetup. We evaluate the event localization compo-nent by removing ground truth temporal intervals.All previous settings so far operate on trimmedvideo segments and focus on ‘where’ the sen-tences refer to. We want to see how the modelbehaves on untrimmed videos in which the systemneeds to answer ‘when’ the referred events occur.The system’s task is to infer the temporal intervals[tk, tk+40) which are likely to correspond to the in-put expressions. We evaluate the system via tem-poral mean Average Precision with temporal IoUsimilar to (Krishna et al., 2017b). Since our iden-tifying descriptions are sentences for the wholevideos, we compare modular attention networksto a video captioner, S2VT (Venugopalan et al.,2015), which is a speaker model (Mao et al., 2016)that output the probability of producing an expres-sion given a video. The S2VT model is trained

Page 9: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Table 8: Spatio-temporal Localization: mAP [email protected] then tracklet [email protected] for each model. (val-ues are in percents.)

Model [email protected] MAttNet 2.75flow MAttNet 2.04flow5 MAttNet 2.62fused1 MAttNet 1.70fused5 MAttNet 1.51

on a different feature set consisting of the imagefeatures from the last layer ‘fc1000’ of ResNet-50(He et al., 2016), the interval (bi,ei) and the cur-rent frame number. This S2VT model is trainedon ground truth intervals and expressions, so it islikely to produce expressions with high probabili-ties on the ground truth event intervals comparedto the background intervals which do not contain‘interesting’ events.

Results. The results in Table 7 shows thatspeaker Bi-LSTM performs the best and even bet-ter than all modular attention networks. We sus-pect that the reason is from the discriminativetraining scheme of the modular attention networksis not suitable for temporal localization. Train-ing with only negative pairs from the same frametakes a week, so it is computationally expensive totrain with all negative pairs from all frames in thewhole video. The top-5 prediction for Bi-LSTMincreases to 26.23% but it is still far from the up-per bound of 71.02%, the recall of the proposalsystem.

6.2.3 Spatio-temporal Localization

Setup. We evaluate our event interval propos-als, tubelet object proposals, and modular atten-tion networks. We fix tubelet Intersection overUnion (tubelet IoU) to 0.5. The evaluation is atwo-step process, temporal IoU then tubelet IoU.We allow tubelet IoU over all frames of the pro-posal interval instead of ground truth interval toshow that the system refers to the right object inan event interval and the tubelet IoU does not de-pend on temporal IoU.

Results. The results in Table 8 show that theperformance further decreases from Table 7. Wesuspect that the reason is also from the discrimi-native training scheme because the models are nottrained on some background frames.

7 Summary

We discussed the problem of grounding spatio-temporal identifying descriptions to spatio-temporal object-event tubelets in videos. Thecritical challenge in this dataset is to ground verbsand motion words in both space and time, andwe show that this is possible by our proposedtwo-stream modular neural network modelswhich have complimentary optical flow inputsto ground verbs and motion words. We validatethis by collecting aggregated statistics on wordattention and found that the two-stream modelsground verbs better. The motion stream alsohelps the appearance stream learn better becauseit abstracts away motion noise from appearance.We further inspected the components in thesystem and revealed potential challenges. A bettertraining scheme such as improved loss functionsor hard example mining for future spatio-temporalgrounding systems should consider both efficiencyand effectiveness.

8 Acknowledgement

The authors thank the anonymous reviewers fortheir insightful comments and suggestions. Wethank Dr. Hal Daume III, Nelson Padua-Perez andDr. Ryan Farrell for very useful advises and dis-cussions. We also thank members of UMD Com-puter Vision Lab (CVL), UMD Human-ComputerInteraction Lab (HCIL) and UMD Computa-tional Linguistics and Information Processing Lab(CLIP) for useful insights and support. We alsothank academic twitter users for useful knowledgeand discussions.

References

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural module networks. In Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pages 39–48.

Gasser Auda and Mohamed Kamel. 1999. Modularneural networks: a survey. International Journal ofNeural Systems, 9(02):129–151.

Yevgeni Berzak, Andrei Barbu, Daniel Harari, BorisKatz, and Shimon Ullman. 2015. Do you see whati mean? visual resolution of linguistic ambiguities.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages1477–1487.

Page 10: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Dinesh Bolkensteyn. vatic.js: A pure javascript videoannotation tool. https://dbolkensteyn.github.io/vatic.js/.

Volkan Cirik, Louis-Philippe Morency, and TaylorBerg-Kirkpatrick. 2018. Visual referring expressionrecognition: What do systems actually learn? InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 2 (Short Papers), volume 2, pages 781–787.

Robert Dale and Ehud Reiter. 1995. Computationalinterpretations of the gricean maxims in the gener-ation of referring expressions. Cognitive science,19(2):233–263.

Santosh K Divvala, Derek Hoiem, James H Hays,Alexei A Efros, and Martial Hebert. 2009. Anempirical study of context in object detection. InComputer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on, pages 1271–1278. IEEE.

Victor Escorcia, Fabian Caba Heilbron, Juan CarlosNiebles, and Bernard Ghanem. 2016. Daps: Deepaction proposals for action understanding. In Euro-pean Conference on Computer Vision, pages 768–784. Springer.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In European conference on computer vision, pages15–29. Springer.

Jerry A Fodor. 1985. Precis of the modularity of mind.Behavioral and brain sciences, 8(1):1–5.

Jiyang Gao, Chen Sun, Zhenheng Yang, and RamNevatia. 2017. Tall: Temporal activity localizationvia language query. In Proceedings of the IEEE In-ternational Conference on Computer Vision, pages5267–5275.

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, andCees GM Snoek. 2018. Actor and action video seg-mentation from a sentence. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 5958–5966.

Ross Girshick. 2015. Fast r-cnn. In Proceedings of theIEEE international conference on computer vision,pages 1440–1448.

Dave Golland, Percy Liang, and Dan Klein. 2010. Agame-theoretic approach to generating spatial de-scriptions. In Proceedings of the 2010 conferenceon empirical methods in natural language process-ing, pages 410–419. Association for ComputationalLinguistics.

Chunhui Gu, Chen Sun, David Ross, Carl Vondrick,Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya-narasimhan, George Toderici, Susanna Ricco, Rahul

Sukthankar, et al. 2018. Ava: A video dataset ofspatio-temporally localized atomic visual actions.In CVPR 2018.

Wei Han, Pooya Khorrami, Tom Le Paine, PrajitRamachandran, Mohammad Babaeizadeh, HonghuiShi, Jianan Li, Shuicheng Yan, and Thomas SHuang. 2016. Seq-nms for video object detection.arXiv preprint arXiv:1602.08465.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2017.Localizing moments in video with natural language.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV).

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2018.Localizing moments in video with temporal lan-guage. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, pages 1380–1390.

Ronghang Hu, Marcus Rohrbach, Jacob Andreas,Trevor Darrell, and Kate Saenko. 2017. Modelingrelationships in referential expressions with com-positional modular networks. In Computer Visionand Pattern Recognition (CVPR), 2017 IEEE Con-ference on, pages 4418–4427. IEEE.

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, CordeliaSchmid, and Michael J Black. 2013. Towards un-derstanding action recognition. In Proceedings ofthe IEEE international conference on computer vi-sion, pages 3192–3199.

Justin Johnson, Ranjay Krishna, Michael Stark, Li-JiaLi, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3668–3678.

Andrej Karpathy, George Toderici, Sanketh Shetty,Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.2014. Large-scale video classification with convo-lutional neural networks. In CVPR.

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara L Berg. 2014. Referitgame: Refer-ring to objects in photographs of natural scenes. InEMNLP, pages 787–798.

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei,and Juan Carlos Niebles. 2017a. Dense-captioningevents in videos. In International Conference onComputer Vision (ICCV).

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,

Page 11: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

et al. 2017b. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg.2018. Tvqa: Localized, compositional video ques-tion answering. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1369–1379.

Siming Li, Girish Kulkarni, Tamara L Berg, Alexan-der C Berg, and Yejin Choi. 2011. Composing sim-ple image descriptions using web-scale n-grams. InProceedings of the Fifteenth Conference on Com-putational Natural Language Learning, pages 220–228. Association for Computational Linguistics.

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GMSnoek, Arnold WM Smeulders, et al. 2017. Track-ing by natural language specification. In CVPR, vol-ume 1, page 5.

Yang Lu, Tianfu Wu, and Song Chun Zhu. 2014. On-line object tracking, learning and parsing with and-or graphs. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages3462–3469.

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan L Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition,pages 11–20.

Margaret Mitchell, Kees Van Deemter, and Ehud Re-iter. 2013. Generating expressions that refer to vis-ible objects. In Proceedings of the 2013 Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics. Association forComputational Linguistics (ACL).

Varun K Nagaraja, Vlad I Morariu, and Larry S Davis.2016. Modeling context between objects for refer-ring expression understanding. In European Confer-ence on Computer Vision, pages 792–807. Springer.

Xiaojiang Peng and Cordelia Schmid. 2016. Multi-region two-stream r-cnn for action detection. In Eu-ropean Conference on Computer Vision, pages 744–759. Springer.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2017. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. International Journal of Com-puter Vision, 123(1):74–93.

Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah.2008. Action mach a spatio-temporal maximum av-erage correlation height filter for action recognition.In Computer Vision and Pattern Recognition, 2008.CVPR 2008. IEEE Conference on, pages 1–8. IEEE.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu,Trevor Darrell, and Bernt Schiele. 2016. Ground-ing of textual phrases in images by reconstruction.In European Conference on Computer Vision, pages817–834. Springer.

Deb Roy and Ehud Reiter. 2005. Connecting languageto the world. Artificial Intelligence, 167(1-2):1–12.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recogni-tion in videos. In Advances in neural informationprocessing systems, pages 568–576.

Jeffrey Mark Siskind. 1990. Acquiring core meaningsof words, represented as jackendoff-style conceptualstructures, from correlated streams of linguistic andnon-linguistic input. In Proceedings of the 28th an-nual meeting on Association for Computational Lin-guistics, pages 143–156. Association for Computa-tional Linguistics.

Khurram Soomro, Amir Roshan Zamir, and MubarakShah. 2012. Ucf101: A dataset of 101 human ac-tions classes from videos in the wild. arXiv preprintarXiv:1212.0402.

Du Tran, Lubomir Bourdev, Rob Fergus, LorenzoTorresani, and Manohar Paluri. 2015. Learningspatiotemporal features with 3d convolutional net-works. In Proceedings of the IEEE internationalconference on computer vision, pages 4489–4497.

Subhashini Venugopalan, Marcus Rohrbach, Jeff Don-ahue, Raymond Mooney, Trevor Darrell, and KateSaenko. 2015. Sequence to sequence – video to text.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV).

Carl Vondrick, Donald Patterson, and Deva Ramanan.2013. Efficiently scaling up crowdsourced video an-notation. International Journal of Computer Vision,101(1):184–204.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning deep structure-preserving image-text em-beddings. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages5005–5013.

Christian Wolf, Eric Lombardi, Julien Mille, Oya Ce-liktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren,Moez Baccouche, Emmanuel Dellandrea, Charles-Edmond Bichot, et al. 2014. Evaluation of video ac-tivity localizations integrating quality and quantitymeasurements. Computer Vision and Image Under-standing, 127:14–30.

Page 12: arXiv:1904.03885v1 [cs.CV] 8 Apr 2019 · 2019. 4. 9. · to manual annotations by a few annotators. We want computer vision models to learn useful and interpretable features by correlating

Masataka Yamaguchi, Kuniaki Saito, YoshitakaUshiku, and Tatsuya Harada. 2017. Spatio-temporalperson retrieval via natural language queries. arXivpreprint arXiv:1704.07945.

Yezhou Yang, Ching Lik Teo, Hal Daume III, and Yian-nis Aloimonos. 2011. Corpus-guided sentence gen-eration of natural images. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, pages 444–454. Association forComputational Linguistics.

Xuwang Yin and Vicente Ordonez. 2017. Obj2text:Generating visually descriptive language from ob-ject layouts. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 177–187.

Haonan Yu and Jeffrey Mark Siskind. 2013. Groundedlanguage learning from video described with sen-tences. In Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 53–63.

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, XinLu, Mohit Bansal, and Tamara L Berg. 2018. Mat-tnet: Modular attention network for referring expres-sion comprehension. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, andYichen Wei. 2017. Flow-guided feature aggrega-tion for video object detection. In Proceedings ofthe IEEE International Conference on Computer Vi-sion, volume 3.