do you see what i mean? visual resolution of linguistic...

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1477–1487,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Do You See What I Mean?Visual Resolution of Linguistic Ambiguities

Yevgeni BerzakCSAIL MIT

[email protected]

Andrei BarbuCSAIL MIT

[email protected]

Daniel HarariCSAIL MIT

[email protected]

Boris KatzCSAIL MIT

[email protected]

Shimon UllmanWeizmann Institute of Science

[email protected]

Abstract

Understanding language goes hand inhand with the ability to integrate complexcontextual information obtained via per-ception. In this work, we present a noveltask for grounded language understanding:disambiguating a sentence given a visualscene which depicts one of the possibleinterpretations of that sentence. To thisend, we introduce a new multimodal cor-pus containing ambiguous sentences, rep-resenting a wide range of syntactic, se-mantic and discourse ambiguities, coupledwith videos that visualize the different in-terpretations for each sentence. We ad-dress this task by extending a vision modelwhich determines if a sentence is depictedby a video. We demonstrate how such amodel can be adjusted to recognize dif-ferent interpretations of the same under-lying sentence, allowing to disambiguatesentences in a unified fashion across thedifferent ambiguity types.

1 Introduction

Ambiguity is one of the defining characteristicsof human languages, and language understand-ing crucially relies on the ability to obtain un-ambiguous representations of linguistic content.While some ambiguities can be resolved usingintra-linguistic contextual cues, the disambigua-tion of many linguistic constructions requires in-tegration of world knowledge and perceptual in-formation obtained from other modalities.

In this work, we focus on the problem ofgrounding language in the visual modality, and in-troduce a novel task for language understandingwhich requires resolving linguistic ambiguities byutilizing the visual context in which the linguisticcontent is expressed. This type of inference is fre-quently called for in human communication thatoccurs in a visual environment, and is crucial forlanguage acquisition, when much of the linguis-tic content refers to the visual surroundings of thechild (Snow, 1972).

Our task is also fundamental to the problem ofgrounding vision in language, by focusing on phe-nomena of linguistic ambiguity, which are preva-lent in language, but typically overlooked whenusing language as a medium for expressing un-derstanding of visual content. Due to such ambi-guities, a superficially appropriate description ofa visual scene may in fact not be sufficient fordemonstrating a correct understanding of the rel-evant visual content. Our task addresses this issueby introducing a deep validation protocol for vi-sual understanding, requiring not only providinga surface description of a visual activity but alsodemonstrating structural understanding at the lev-els of syntax, semantics and discourse.

To enable the systematic study of visuallygrounded processing of ambiguous language, wecreate a new corpus, LAVA (Language and VisionAmbiguities). This corpus contains sentences withlinguistic ambiguities that can only be resolved us-ing external information. The sentences are pairedwith short videos that visualize different interpre-tations of each sentence. Our sentences encom-pass a wide range of syntactic, semantic and dis-

1477

course ambiguities, including ambiguous preposi-tional and verb phrase attachments, conjunctions,logical forms, anaphora and ellipsis. Overall, thecorpus contains 237 sentences, with 2 to 3 inter-pretations per sentence, and an average of 3.37videos that depict visual variations of each sen-tence interpretation, corresponding to a total of1679 videos.

Using this corpus, we address the problem ofselecting the interpretation of an ambiguous sen-tence that matches the content of a given video.Our approach for tackling this task extends thesentence tracker introduced in (Siddharth et al.,2014). The sentence tracker produces a scorewhich determines if a sentence is depicted by avideo. This earlier work had no concept of ambi-guities; it assumed that every sentence had a sin-gle interpretation. We extend this approach to rep-resent multiple interpretations of a sentence, en-abling us to pick the interpretation that is mostcompatible with the video.

To summarize, the contributions of this paperare threefold. First, we introduce a new task for vi-sually grounded language understanding, in whichan ambiguous sentence has to be disambiguatedusing a visual depiction of the sentence’s con-tent. Second, we release a multimodal corpusof sentences coupled with videos which covers awide range of linguistic ambiguities, and enablesa systematic study of linguistic ambiguities in vi-sual contexts. Finally, we present a computationalmodel which disambiguates the sentences in ourcorpus with an accuracy of 75.36%.

2 Related Work

Previous language and vision studies focused onthe development of multimodal word and sentencerepresentations (Bruni et al., 2012; Socher et al.,2013; Silberer and Lapata, 2014; Gong et al.,2014; Lazaridou et al., 2015), as well as methodsfor describing images and videos in natural lan-guage (Farhadi et al., 2010; Kulkarni et al., 2011;Mitchell et al., 2012; Socher et al., 2014; Thoma-son et al., 2014; Karpathy and Fei-Fei, 2014; Sid-dharth et al., 2014; Venugopalan et al., 2015;Vinyals et al., 2015). While these studies handleimportant challenges in multimodal processing oflanguage and vision, they do not provide explicitmodeling of linguistic ambiguities.

Previous work relating ambiguity in language tothe visual modality addressed the problem of word

sense disambiguation (Barnard et al., 2003). How-ever, this work is limited to context independentinterpretation of individual words, and does notconsider structure-related ambiguities. Discourseambiguities were previously studied in work onmultimodal coreference resolution (Ramanathanet al., 2014; Kong et al., 2014). Our work ex-pands this line of research, and addresses furtherdiscourse ambiguities in the interpretation of el-lipsis. More importantly, to the best of our knowl-edge our study is the first to present a systematictreatment of syntactic and semantic sentence levelambiguities in the context of language and vision.

The interactions between linguistic and visualinformation in human sentence processing havebeen extensively studied in psycholinguistics andcognitive psychology (Tanenhaus et al., 1995). Aconsiderable fraction of this work focused on theprocessing of ambiguous language (Spivey et al.,2002; Coco and Keller, 2015), providing evidencefor the importance of visual information for lin-guistic ambiguity resolution by humans. Such in-formation is also vital during language acquisition,when much of the linguistic content perceived bythe child refers to their immediate visual environ-ment (Snow, 1972). Over time, children developmechanisms for grounded disambiguation of lan-guage, manifested among others by the usage oficonic gestures when communicating ambiguouslinguistic content (Kidd and Holler, 2009). Ourstudy leverages such insights to develop a com-plementary framework that enables addressing thechallenge of visually grounded disambiguation oflanguage in the realm of artificial intelligence.

3 Task

In this work we provide a concrete frameworkfor the study of language understanding with vi-sual context by introducing the task of groundedlanguage disambiguation. This task requires tochoose the correct linguistic representation of asentence given a visual context depicted in a video.Specifically, provided with a sentence, n candidateinterpretations of that sentence and a video thatdepicts the content of the sentence, one needs tochoose the interpretation that corresponds to thecontent of the video.

To illustrate this task, consider the example infigure 1, where we are given the sentence “Samapproached the chair with a bag” along with twodifferent linguistic interpretations. In the first in-

1478

S

NP

NNP

Sam

VP

VP

V

VBD

approached

NP

DT

the

NN

chair

PP

IN

with

NP

DT

a

NN

bag

(a) First interpretation

S

NP

NNP

Sam

VP

V

VBD

approached

NP

NP

DT

the

NN

chair

PP

IN

with

NP

DT

a

NN

bag

(b) Second interpretation

(c) Visual context

Figure 1: An example of the visually groundedlanguage disambiguation task. Given the sentence“Sam approached the chair with a bag”, two poten-tial parses, (a) and (b), correspond to two differentsemantic interpretations. In the first interpretationSam has the bag, while in the second reading thebag is on the chair. The task is to select the correctinterpretation given the visual context (c).

terpretation, which corresponds to parse 1(a), Samhas the bag. In the second interpretation associ-ated with parse 1(b), the bag is on the chair ratherthan with Sam. Given the visual context from fig-ure 1(c), the task is to choose which interpretationis most appropriate for the sentence.

4 Approach Overview

To address the grounded language disambiguationtask, we use a compositional approach for deter-mining if a specific interpretation of a sentence isdepicted by a video. In this framework, describedin detail in section 6, a sentence and an accom-panying interpretation encoded in first order logic,give rise to a grounded model that matches a videoagainst the provided sentence interpretation.

The model is comprised of Hidden MarkovModels (HMMs) which encode the semantics ofwords, and trackers which locate objects in videoframes. To represent an interpretation of a sen-tence, word models are combined with trackersthrough a cross-product which respects the seman-tic representation of the sentence to create a singlemodel which recognizes that interpretation.

Given a sentence, we construct an HMM basedrepresentation for each interpretation of that sen-tence. We then detect candidate locations for ob-jects in every frame of the video. Together the re-

S

NP

NNP

Bill

VP

VBD

held

NP

DT

the

NP

JJ

green

NP

NN

chair

CC

and

NN

bag

(a) First interpretation

S

NP

NNP

Bill

VP

VBD

held

NP

DT

the

NP

NP

JJ

green

NN

chair

CC

and

NN

bag

(b) Second interpretation

(c) First visual context (d) Second visual context

Figure 2: Linguistic and visual interpretations ofthe sentence “Bill held the green chair and bag”. Inthe first interpretation (a,c) both the chair and bagare green, while in the second interpretation (b,d)only the chair is green and the bag has a differentcolor.

forestation for the sentence and the candidate ob-ject locations are combined to form a model whichcan determine if a given interpretation is depictedby the video. We test each interpretation and re-port the interpretation with highest likelihood.

5 Corpus

To enable a systematic study of linguistic ambi-guities that are grounded in vision, we compileda corpus with ambiguous sentences describing vi-sual actions. The sentences are formulated suchthat the correct linguistic interpretation of eachsentence can only be determined using external,non-linguistic, information about the depicted ac-tivity. For example, in the sentence “Bill heldthe green chair and bag”, the correct scope of“green” can only be determined by integrating ad-ditional information about the color of the bag.This information is provided in the accompany-ing videos, which visualize the possible interpreta-tions of each sentence. Figure 2 presents the syn-tactic parses for this example along with framesfrom the respective videos. Although our videoscontain visual uncertainty, they are not ambiguouswith respect to the linguistic interpretation they arepresenting, and hence a video always correspondsto a single candidate representation of a sentence.

The corpus covers a wide range of well

1479

known syntactic, semantic and discourse ambigu-ity classes. While the ambiguities are associatedwith various types, different sentence interpreta-tions always represent distinct sentence meanings,and are hence encoded semantically using first or-der logic. For syntactic and discourse ambiguitieswe also provide an additional, ambiguity type spe-cific encoding as described below.

• Syntax Syntactic ambiguities include Prepo-sitional Phrase (PP) attachments, Verb Phrase(VP) attachments, and ambiguities in the in-terpretation of conjunctions. In addition tological forms, sentences with syntactic am-biguities are also accompanied with ContextFree Grammar (CFG) parses of the candidateinterpretations, generated from a determinis-tic CFG parser.

• Semantics The corpus addresses severalclasses of semantic quantification ambigui-ties, in which a syntactically unambiguoussentence may correspond to different logicalforms. For each such sentence we provide therespective logical forms.

• Discourse The corpus contains two typesof discourse ambiguities, Pronoun Anaphoraand Ellipsis, offering examples comprisingtwo sentences. In anaphora ambiguity cases,an ambiguous pronoun in the second sen-tence is given its candidate antecedents in thefirst sentence, as well as a corresponding log-ical form for the meaning of the second sen-tence. In ellipsis cases, a part of the secondsentence, which can constitute either the sub-ject and the verb, or the verb and the object,is omitted. We provide both interpretationsof the omission in the form of a single unam-biguous sentence, and its logical form, whichcombines the meanings of the first and thesecond sentences.

Table 2 lists examples of the different ambiguityclasses, along with the candidate interpretations ofeach example.

The corpus is generated using Part of Speech(POS) tag sequence templates. For each template,the POS tags are replaced with lexical items fromthe corpus lexicon, described in table 3, using allthe visually applicable assignments. This gener-ation process yields an overall of 237 sentences,

Ambiguity Templates #

Synt

ax

PP NNP V DT [JJ] NN1 IN DT [JJ] NN2. 48

VP NNP1 V [IN] NNP2 V [JJ] NN. 60

Conjunction NNP1 [and NNP2] V DT JJ NN1 and NN2. 40NNP V DT NN1 or DT NN2 and DT NN3.

Sem

antic

s Logical Form NNP1 and NNP2 V a NN. 35Someone V the NNS.

Dis

cour

se Anaphora NNP V DT NN1 and DT NN2. It is JJ. 36

Ellipsis NNP1 V NNP2. Also NNP3. 18

Table 1: POS templates for generating the sen-tences in our corpus. The rightmost column rep-resents the number of sentences in each category.The sentences are produced by replacing the POStags with all the visually applicable assignmentsof lexical items from the corpus lexicon shown intable 3.

of which 213 sentences have 2 candidate interpre-tations, and 24 sentences have 3 interpretations.Table 1 presents the corpus templates for each am-biguity class, along with the number of sentencesgenerated from each template.

The corpus videos are filmed in an indoorenvironment containing background objects andpedestrians. To account for the manner of per-forming actions, videos are shot twice with differ-ent actors. Whenever applicable, we also filmedthe actions from two different directions (e.g. ap-proach from the left, and approach from the right).Finally, all videos were shot with two camerasfrom two different view points. Taking these vari-ations into account, the resulting video corpuscontains 7.1 videos per sentence and 3.37 videosper sentence interpretation, corresponding to a to-tal of 1679 videos. The average video length is3.02 seconds (90.78 frames), with in an overall of1.4 hours of footage (152434 frames).

A custom corpus is required for this task be-cause no existing corpus, containing either videosor images, systematically covers multimodal am-biguities. Datasets such as UCF Sports (Ro-driguez et al., 2008), YouTube (Liu et al., 2009),and HMDB (Kuehne et al., 2011) which come outof the activity recognition community are accom-panied by action labels, not sentences, and do notcontrol for the content of the videos aside from theprincipal action being performed. Datasets for im-age and video captioning, such as MSCOCO (Linet al., 2014) and TACOS (Regneri et al., 2013),

1480

Ambiguity Example Linguistic interpretations Visual setups

Synt

axPP Claire left the green chair with

a yellow bag.Claire [left the green chair] [with a yellow bag]. The bag is with Claire.

Claire left [the green chair with a yellow bag]. Bag is on the chair.VP Claire looked at Bill picking

up a chair.Claire looked at [Bill [picking up a chair]]. Bill picks up the chair.

Claire [looked at Bill] [picking up a chair]. Claire picks up the chair.Conjunction Claire held a green bag and

chair.Claire held a [green [bag and chair]]. The chair is green.

Claire held a [[green bag] and [chair]]. The chair is not green.Claire held the chair or thebag and the telescope.

Claire held [[the chair] or [the bag and the telescope]]. Claire holds the chair.

Claire held [[the chair or the bag] and [the telescope]]. Claire holds the chair and the telescope.

Sem

antic

s

Logical Form Claire and Bill moved a chair. chair(x), move(Claire, x), move(Bill, x) Claire and Bill move the same chair.chair(x), chair(y), move(Claire, x),move(Bill, y), x 6= y

Claire and Bill move different chairs.

Someone moved the twochairs.

chair(x), chair(y), x 6= y, person(u),move(u, x), move(u, y)

One person moves both chairs.

chair(x), chair(y), x 6= y, person(u), person(v),u 6= v, move(u, x), move(v, y)

Each chair moved by a different person.

Dis

cour

se

Anaphora Claire held the bag and the It = bag The bag is yellow.chair. It is yellow. It = chair The chair is yellow.

Ellipsis Claire looked at Bill. Claire looked at Bill and Sam. Claire looks at Bill and Sam.Also Sam. Claire and Sam looked at Bill. Claire and Sam look at Bill.

Table 2: An overview of the different ambiguity types, along with examples of ambiguous sentenceswith their linguistic and visual interpretations. Note that similarly to semantic ambiguities, syntacticand discourse ambiguities are also provided with first order logic formulas for the resulting sentenceinterpretations. Table 4 shows additional examples for each ambiguity type, with frames from samplevideos corresponding to the different interpretations of each sentence.

Syntactic Category Visual Category WordsNouns Objects, People chair, bag, telescope, someone, proper names

Verbs Actions pick up, put down, hold, move (transitive), look at, approach, leave

Prepositions Spacial Relations with, left of, right of, on

Adjectives Visual Properties yellow, green

Table 3: The lexicon used to instantiate the templates in figure 1 in order to generate the corpus.

aim to control for more aspects of the videos thanjust the main action being performed but they donot provide the range of ambiguities discussedhere. The closest dataset is that of Siddharth et al.(2014) as it controls for object appearance, color,action, and direction of motion, making it morelikely to be suitable for evaluating disambiguationtasks. Unfortunately, that dataset was designed toavoid ambiguities, and therefore is not suitable forevaluating the work described here.

6 Model

To perform the disambiguation task, we extendthe sentence recognition model of Siddharth etal. (2014) which represents sentences as compo-sitions of words. Given a sentence, its first orderlogic interpretation and a video, our model pro-duces a score which determines if the sentence isdepicted by the video. It simultaneously tracks theparticipants in the events described by the sentencewhile recognizing the events themselves. This al-

lows it to be flexible in the presence of noise byintegrating top-down information from the sen-tence with bottom-up information from object andproperty detectors. Each word in the query sen-tence is represented by an HMM (Baum et al.,1970), which recognizes tracks (i.e. paths of de-tections in a video for a specific object) that satisfythe semantics of the given word. In essence, thismodel can be described as having two layers, onein which object tracking occurs and one in whichwords observe tracks and filter tracks that do notsatisfy the word constraints.

Given a sentence interpretation, we constructa sentence-specific model which recognizes if avideo depicts the sentence as follows. Each pred-icate in the first order logic formula has a cor-responding HMM, which can recognize if thatpredicate is true of a video given its arguments.Each variable has a corresponding tracker whichattempts to physically locate the bounding boxcorresponding to that variable in each frame of a

1481

PP Attachment Sam looked at Bill with a telescope.

VP Attachment Bill approached the person holding a green chair.

Conjunction Sam and Bill picked up the yellow bag and chair.

Logical Form Someone put down the bags.

Anaphora Sam picked up the bag and the chair. It is yellow.

Ellipsis Sam left Bill. Also Clark.

Table 4: Examples of the six ambiguity classes described in table 2. The example sentences have atleast two interpretations, which are depicted by different videos. Three frames from each such video areshown on the left and on the right below each sentence.

1482

predicate 1 predicate W

......

......

. . .

. . .

. . .

. . .

h a

× · · ·×...

......

...

. . .

. . .

. . .

. . .

h a×

......

......

. . .

. . .

. . .

. . .

f g

t = 1 t = 2 t = 3 t = T

× · · ·×...

......

...

. . .

. . .

. . .

. . .

f g

t = 1 t = 2 t = 3 t = T

track 1 track L

person person moved chair

agent1-track agent2-track patient-track

6=

person person moved chair

agent1-track agent2-track patient1-track patient2-track

6= 6=

Figure 3: (left) Tracker lattices for every sentence participant are combined with predicate HMMs. TheMAP estimate in the resulting cross-product lattice simultaneously finds the best tracks and the best statesequences for every predicate. (right) Two interpretations of the sentence “Claire and Bill moved a chair”having different first order logic formulas. The top interpretation corresponds to Bill and Claire movingthe same chair, while the bottom one describes them moving different chairs. Predicates are highlightedin blue at the top and variables are highlighted in red at the bottom. Each predicate has a correspondingHMM which recognizes its presence in a video. Each variable has a corresponding tracker which locatesit in a video. Lines connect predicates and the variables which fill their argument slots. Some predicates,such as move and 6=, take multiple arguments. Some predicates, such as move, are applied multiple timesbetween different pairs of variables.

video. This creates a bipartite graph: HMMs thatrepresent predicates are connected to trackers thatrepresent variables. The trackers themselves aresimilar to the HMMs, in that they comprise a lat-tice of potential bounding boxes in every frame.To construct a joint model for a sentence interpre-tation, we take the cross product of HMMs andtrackers, taking only those cross products dictatedby the structure of the formula corresponding tothe desired interpretation. Given a video, we em-ploy an object detector to generate candidate de-tections in each frame, construct trackers whichselect one of these detections in each frame, and fi-nally construct the overall model from HMMs andtrackers.

Provided an interpretation and its correspond-ing formula composed of P predicates and V vari-ables, along with a collection of object detections,bframe

detection index, in each frame of a video of lengthT the model computes the score of the video-sentence pair by finding the optimal detection foreach participant in every frame. This is in essencethe Viterbi algorithm (Viterbi, 1971), the MAP al-gorithm for HMMs, applied to finding optimal ob-ject detections jframe

variable for each participant, and theoptimal state kframe

predicate for each predicate HMM, inevery frame. Each detection is scored by its con-fidence from the object detector, f and each ob-ject track is scored by a motion coherence metric gwhich determines if the motion of the track agreeswith the underlying optical flow. Each predicate,

p, is scored by the probability of observing a par-ticular detection in a given state hp, and by theprobability of transitioning between states ap. Thestructure of the formula and the fact that multi-ple predicates often refer to the same variables isrecorded by θ, a mapping between predicates andtheir arguments. The model computes the MAPestimate as:

maxj11 ,..., jT1

...j1V ,..., jTV

maxk11,..., kT1

...k1P ,..., kTP

V∑v=1

T∑t=1

f(btjtv

) +

T∑t=2

g(bt−1

jt−1v

, btjtv

)+

P∑p=1

T∑t=1

hp(ktp, bt

jtθ1p

, btjtθ2p

) +

T∑t=2

ap(kt−1p , kt

p)

for sentences which have words that refer to atmost two tracks (i.e. transitive verbs or binarypredicates) but is trivially extended to arbitrary ar-ities. Figure 3 provides a visual overview of themodel as a cross-product of tracker models andword models.

Our model extends the approach of Siddharth etal. (2014) in several ways. First, we depart fromthe dependency based representation used in thatwork, and recast the model to encode first orderlogic formulas. Note that some complex first or-der logic formulas cannot be directly encoded inthe model and require additional inference steps.This extension enables us to represent ambiguitiesin which a given sentence has multiple logical in-terpretations for the same syntactic parse.

1483

Second, we introduce several model compo-nents which are not specific to disambiguation, butare required to encode linguistic constructions thatare present in our corpus and could not be handledby the model of Siddharth et al. (2014). These newcomponents are the predicate “not equal”, disjunc-tion, and conjunction. The key addition amongthese components is support for the new predicate“not equal”, which enforces that two tracks, i.e.objects, are distinct from each other. For example,in the sentence “Claire and Bill moved a chair”one would want to ensure that the two movers aredistinct entities. In earlier work, this was not re-quired because the sentences tested in that workwere designed to distinguish objects based on con-straints rather than identity. In other words, theremight have been two different people but theywere distinguished in the sentence by their actionsor appearance. To faithfully recognize that two ac-tors are moving the chair in the earlier example,we must ensure that they are disjoint from eachother. In order to do this we create a new HMMfor this predicate, which assigns low probabilityto tracks that heavily overlap, forcing the model tofit two different actors in the previous example. Bycombining the new first order logic based seman-tic representation in lieu of a syntactic represen-tation with a more expressive model, we can en-code the sentence interpretations required to per-form the disambiguation task.

Figure 3(left) shows an example of two differ-ent interpretations of the above discussed sentence“Claire and Bill moved a chair”. Object track-ers, which correspond to variables in the first orderlogic representation of the sentence interpretation,are shown in red. Predicates which constrain thepossible bindings of the trackers, corresponding topredicates in the representation of the sentence,are shown in blue. Links represent the argumentstructure of the first order logic formula, and de-termine the cross products that are taken betweenthe predicate HMMs and tracker lattices in orderto form the joint model which recognizes the en-tire interpretation in a video.

The resulting model provides a single unifiedformalism for representing all the ambiguities intable 2. Moreover, this approach can be tuned todifferent levels of specificity. We can create mod-els that are specific to one interpretation of a sen-tence or that are generic, and accept multiple inter-pretations by eliding constraints that are not com-

mon between the different interpretations. This al-lows the model, like humans, to defer deciding ona particular interpretation or to infer that multipleinterpretation of the sentence are plausible.

7 Experimental Results

We tested the performance of the model describedin the previous section on the LAVA dataset pre-sented in section 5. Each video in the dataset waspre-processed with object detectors for humans,bags, chairs, and telescopes. We employed a mix-ture of CNN (Krizhevsky et al., 2012) and DPM(Felzenszwalb et al., 2010) detectors, trained onheld out sections of our corpus. For each objectclass we generated proposals from both the CNNand the DPM detectors, and trained a scoring func-tion to map both results into the same space. Thescoring function consisted of a sigmoid over theconfidence of the detectors trained on the sameheld out portion of the training set. As none of thedisambiguation examples discussed here rely onthe specific identity of the actors, we did not detecttheir identity. Instead, any sentence which con-tains names was automatically converted to onewhich contains arbitrary “person” labels.

The sentences in our corpus have either two orthree interpretations. Each interpretation has oneor more associated videos where the scene wasshot from a different angle, carried out either bydifferent actors, with different objects, or in differ-ent directions of motion. For each sentence-videopair, we performed a 1-out-of-2 or 1-out-of-3 clas-sification task to determine which of the interpre-tations of the corresponding sentence best fits thatvideo. Overall chance performance on our datasetis 49.04%, slightly lower than 50% due to the 1-out-of-3 classification examples.

The model presented here achieved an accuracyof 75.36% over the entire corpus averaged acrossall error categories. This demonstrates that themodel is largely capable of capturing the under-lying task and that similar compositional cross-modal models may do the same. For each of the3 major ambiguity classes we had an accuracy of84.26% for syntactic ambiguities, 72.28% for se-mantic ambiguities, and 64.44% for discourse am-biguities.

The most significant source of model failuresare poor object detections. Objects are often ro-tated and presented at angles that are difficult torecognize. Certain object classes like the telescope

1484

are much more difficult to recognize due to theirsmall size and the fact that hands tend to largelyocclude them. This accounts for the degraded per-formance of the semantic ambiguities relative tothe syntactic ambiguities, as many more seman-tic ambiguities involved the telescope. Object de-tector performance is similarly responsible for thelower performance of the discourse ambiguitieswhich relied much more on the accuracy of theperson detector as many sentences involve onlypeople interacting with each other without any ad-ditional objects. This degrades performance by re-moving a helpful constraint for inference, accord-ing to which people tend to be close to the objectsthey are manipulating. In addition, these sentencesintroduced more visual uncertainty as they ofteninvolved three actors.

The remaining errors are due to the event mod-els. HMMs can fixate on short sequences of eventswhich seem as if they are part of an action, but infact are just noise or the prefix of another action.Ideally, one would want an event model which hasa global view of the action, if an object went upfrom the beginning to the end of the video whilea person was holding it, it’s likely that the objectwas being picked up. The event models used herecannot enforce this constraint, they merely assertthat the object was moving up for some number offrames; an event which can happen due to noisein the object detectors. Enforcing such local con-straints instead of the global constraint of the mo-tion of the object over the video makes joint track-ing and event recognition tractable in the frame-work presented here but can lead to errors. Findingmodels which strike a better balance between localinformation and global constraints while maintain-ing tractable inference remains an area of futurework.

8 Conclusion

We present a novel framework for studying am-biguous utterances expressed in a visual context.In particular, we formulate a new task for resolv-ing structural ambiguities using visual signal. Thisis a fundamental task for humans, involving com-plex cognitive processing, and is a key challengefor language acquisition during childhood. Werelease a multimodal corpus that enables to ad-dress this task, as well as support further inves-tigation of ambiguity related phenomena in visu-ally grounded language processing. Finally, we

present a unified approach for resolving ambigu-ous descriptions of videos, achieving good perfor-mance on our corpus.

While our current investigation focuses onstructural inference, we intend to extend this lineof work to learning scenarios, in which the agenthas to deduce the meaning of words and sentencesfrom structurally ambiguous input. Furthermore,our framework can be beneficial for image andvideo retrieval applications in which the query isexpressed in natural language. Given an ambigu-ous query, our approach will enable matching andclustering the retrieved results according to the dif-ferent query interpretations.

Acknowledgments

This material is based upon work supported bythe Center for Brains, Minds, and Machines(CBMM), funded by NSF STC award CCF-1231216. SU was also supported by ERC Ad-vanced Grant 269627 Digital Baby.

1485

ReferencesKobus Barnard, Matthew Johnson, and David Forsyth.

2003. Word sense disambiguation with pictures. InProceedings of the HLT-NAACL 2003 workshop onLearning word meaning from non-linguistic data-Volume 6, pages 1–5. Association for ComputationalLinguistics.

L. E. Baum, T. Petrie, G. Soules, and N. Weiss. 1970.A maximization technique occuring in the statisticalanalysis of probabilistic functions of Markov chains.The Annals of Mathematical Statistics, 41(1):164–171.

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in tech-nicolor. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 136–145. Asso-ciation for Computational Linguistics.

Moreno I Coco and Frank Keller. 2015. The interac-tion of visual and linguistic saliency during syntacticambiguity resolution. The Quarterly Journal of Ex-perimental Psychology, 68(1):46–74.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from im-ages. In Computer Vision–ECCV 2010, pages 15–29. Springer.

Pedro F Felzenszwalb, Ross B Girshick, DavidMcAllester, and Deva Ramanan. 2010. Objectdetection with discriminatively trained part-basedmodels. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 32(9):1627–1645.

Yunchao Gong, Liwei Wang, Micah Hodosh, Ju-lia Hockenmaier, and Svetlana Lazebnik. 2014.Improving image-sentence embeddings using largeweakly annotated photo collections. In ComputerVision–ECCV 2014, pages 529–545. Springer.

Andrej Karpathy and Li Fei-Fei. 2014. Deep visual-semantic alignments for generating image descrip-tions. arXiv preprint arXiv:1412.2306.

Evan Kidd and Judith Holler. 2009. Children’s use ofgesture to resolve lexical ambiguity. DevelopmentalScience, 12(6):903–913.

Chen Kong, Dahua Lin, Mayank Bansal, Raquel Urta-sun, and Sanja Fidler. 2014. What are you talkingabout? text-to-image coreference. In Computer Vi-sion and Pattern Recognition (CVPR), pages 3558–3565. IEEE.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105.

Hildegard Kuehne, Hueihan Jhuang, Estıbaliz Garrote,Tomaso Poggio, and Thomas Serre. 2011. Hmdb:a large video database for human motion recogni-tion. In Computer Vision (ICCV), 2011 IEEE Inter-national Conference on, pages 2556–2563. IEEE.

G Kulkarni, V Premraj, S Dhar, Siming Li, Yejin Choi,AC Berg, and TL Berg. 2011. Baby talk: Un-derstanding and generating simple image descrip-tions. In Proceedings of the 2011 IEEE Conferenceon Computer Vision and Pattern Recognition, pages1601–1608. IEEE Computer Society.

Angeliki Lazaridou, Nghia The Pham, and MarcoBaroni. 2015. Combining language and vi-sion with a multimodal skip-gram model. CoRR,abs/1501.02598.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer.

Jingen Liu, Jiebo Luo, and Mubarak Shah. 2009. Rec-ognizing realistic actions from videos in the wild.In Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on, pages 1996–2003. IEEE.

Margaret Mitchell, Xufeng Han, Jesse Dodge, AlyssaMensch, Amit Goyal, Alex Berg, Kota Yamaguchi,Tamara Berg, Karl Stratos, and Hal Daume III.2012. Midge: Generating image descriptions fromcomputer vision detections. In Proceedings of the13th Conference of the European Chapter of the As-sociation for Computational Linguistics, pages 747–756. Association for Computational Linguistics.

Vignesh Ramanathan, Armand Joulin, Percy Liang,and Li Fei-Fei. 2014. Linking people in videos withtheir names using coreference resolution. In Com-puter Vision–ECCV 2014, pages 95–110. Springer.

Michaela Regneri, Marcus Rohrbach, Dominikus Wet-zel, Stefan Thater, Bernt Schiele, and ManfredPinkal. 2013. Grounding action descriptions invideos. Transactions of the Association for Com-putational Linguistics, 1:25–36.

Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah.2008. Action MACH A Spatio-temporal MaximumAverage Correlation Height Filter for Action Recog-nition. In Computer Vision and Pattern Recognition,pages 1–8.

Narayanaswamy Siddharth, Andrei Barbu, and Jef-frey Mark Siskind. 2014. Seeing what you’re told:Sentence-guided activity recognition in video. InComputer Vision and Pattern Recognition (CVPR),pages 732–739. IEEE.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In Proceedings of ACL, pages 721–732.

1486

Catherine E Snow. 1972. Mothers’ speech to childrenlearning language. Child development, pages 549–565.

Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-shot learningthrough cross-modal transfer. In Advances in Neu-ral Information Processing Systems, pages 935–943.

Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014.Grounded compositional semantics for finding anddescribing images with sentences. Transactionsof the Association for Computational Linguistics,2:207–218.

Michael J Spivey, Michael K Tanenhaus, Kathleen MEberhard, and Julie C Sedivy. 2002. Eye move-ments and spoken language comprehension: Effectsof visual context on syntactic ambiguity resolution.Cognitive psychology, 45(4):447–481.

Michael K Tanenhaus, Michael J Spivey-Knowlton,Kathleen M Eberhard, and Julie C Sedivy. 1995.Integration of visual and linguistic informationin spoken language comprehension. Science,268(5217):1632–1634.

Jesse Thomason, Subhashini Venugopalan, SergioGuadarrama, Kate Saenko, and Raymond Mooney.2014. Integrating language and vision to generatenatural language descriptions of videos in the wild.In Proceedings of the 25th International Conferenceon Computational Linguistics (COLING), August.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,Marcus Rohrbach, Raymond Mooney, and KateSaenko. 2015. Translating videos to natural lan-guage using deep recurrent neural networks. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Den-ver, Colorado.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-tern Recognition (CVPR).

A. J. Viterbi. 1971. Convolutional codes and their per-formance in communication systems. Communica-tions of the IEEE, 19:751–772, October.

1487

do you see what i mean? visual resolution of linguistic...

Documents