knowledge acquisition for visual question answering via iterative … › content_cvpr_2017 ›...

1
Introduction Knowledge Acquisition for Visual Question Answering via Iterative Querying Yuke Zhu, Joseph J. Lim, Li Fei-Fei Computer Science Department, Stanford University Most of today’s VQA methods make their predic6ons based on a predefined set of informa6on These VQA models have been shown to be “myopic” (tend to fail on novel instances) [1] Current state-of-the-art VQA models have indicated that they could benefit from beIer visually grounded evidence [4] We push one step further by enabling a VQA model to ask for and collect "clues" – in par6cular, visually grounded evidence. References [1] Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual ques6on answering models. EMNLP, 2016 [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual ques6on answering. ICCV, 2015 [3] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Mul6modal compact bilinear pooling for visual ques6on answering and visual grounding. EMNLP, 2016. [4] A. Jabri, A. Joulin, and L. van der Maaten. Revisi6ng visual ques6on answering baselines. ECCV, 2016. [5] R. Krishna,et al.. Visual genome: Connec6ng language and vision using crowdsourced dense image annota6ons. IJCV, 2016. [6] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical ques6on- image co-aIen6on for visual ques6on answering. NIPS, 2016 [7] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Ques6on Answering in Images. CVPR, 2016. question response query knowledge sources query generator core network memory encoder Is there woman in the image? Who is under the umbrella? answer Two women. (woman, x, y, w, h) answer decoder question encoder memory bank . . . woman query scoring network Is there __ in the image? (c) flowchart of our iterative VQA model memory (c) flow of the iterative VQA model answer decoder answer question question encoder core network query generator answer question knowledge sources (a) our iterative VQA model Model Overview When was this picture taken? A. During a wedding. B. During a bar mitzvah. C. During a funeral. D. During a Sunday church service. Q: What is interacting with woman? R: (woman, wearing, wedding gown) memory bank Q: What object can you see? R: (woman, 204, 35, 54, 103) memory bank (a) (b) initial memory #1 initial memory #1 memory #2 An example itera6ve query process Query types and templates Response formats What object can you see? (object, x, y, w, h) Is there object in the image? (object, x, y, w, h) How does object look? (object, attribute) What is interacting with object1? (object2, relation) Who is holding the orange? Where is the plate? What color drink is in the nearest glass? Is there glass in the image? Is there glass in the image? What is interacting with glass? Blue. Clear. time Prediction: What object can you see? A man. What object can you see? A woman. Is there orange in the image? time Prediction: In the cabinet. On the table. On the floor. Is there plate in the image? What new object can you see? How does plate look? Prediction: time (plate, x, y, w, h) (floor, x, y, w, h) (plate, green) (man, x, y, w, h) (glass, x, y, w, h) (glass, x, y, w, h) (water, of, glass) Clear. Red. Brown. Blue. (woman, x, y, w, h) (orange, x, y, w, h) Blue. A woman. On the floor. On the floor. In the dishwasher. On the table. In the cabinet. A man. A girl. A boy. A woman. Visual7W Visual7W Visual7W VQA VQA VQA baseline query #1 query #2 query #3 Is there anything in this scene used to put out fires? What would happen if the driver pressed the accelerator? Is there chair in the image? Is there sheep in the image? Is there sheep in the image? Backwards. Movement. Dead sheep. time Prediction: (chair, x, y, w, h) (sheep, x, y, w, h) (sheep, x, y, w, h) Backwards Dead sheep Movement Dead sheep. Note: false detection Is there bicycle in the image? Is there person in the image? Is there fire hydrant in the image? No. Yes. time Prediction: (bicycle, x, y, w, h) (person, x, y, w, h) (fire hydrant, x, y, w, h) Yes No ... No. How many trains are there? What object can you see? Is there train in the image? What object can you see? 0. 1. time Prediction: (clock, x, y, w, h) (train, x, y, w, h) (clock, x, y, w, h) 1 0 ... 1. Note: false detection Datasets Visual7W telling task [7] VQA Real Mul6pleChoice Challenges [1] Results New state-of-the-art on Visual7W On par with VQA challenge winning model Method Visual7W LSTM-AIen6on [6] 0.543 MCB [3] 0.622 MLP [4] 0.648 MLP + all knowledge 0.658 MLP + uniform sampling 0.653 MLP + query generator 0.679 Knowledge Sources Visual Genome scene graphs [5] Faster R-CNN detectors Method VQA (dev) VQA (standard) Two-layer LSTM [2] 0.627 0.631 Co-AIen6on [6] 0.658 0.661 MCB + AI. + GloVe [3] 0.691 - MCB Ensemble + Genome [3] 0.702 0.701 MLP [4] 0.659 - MLP + query generator 0.691 0.689 Iterative Training Algorithm 1: procedure 2: Generate random query rollouts R (0) 3: Train initial core network C (0) with rollout R (0) 4: Generate training samples S (0) for query scoring network with C (0) 5: Train initial query scoring network G (0) with S (0) 6: for t =1,..., N do 7: Generate query rollouts R (t) with query scoring network G (t-1) 8: Finetune core network C (t) from C (t-1) with rollout R (t) 9: Generate training samples S (t) for query scoring network from C (t) 10: Finetune query scoring network G (t) from G (t-1) with S (t) 11: end for 12: return {G (N ) , C (N ) } 13: end procedure Experiments Query types and response formats Workflow of the itera6ve VQA model Standard VQA model Our itera6ve VQA model

Upload: others

Post on 30-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Acquisition for Visual Question Answering via Iterative … › content_cvpr_2017 › poster › ... · 2018-01-23 · Knowledge Acquisition for Visual Question Answering

Introduction

Knowledge Acquisition for Visual Question Answering via Iterative Querying Yuke Zhu, Joseph J. Lim, Li Fei-Fei

Computer Science Department, Stanford University

•  Most of today’s VQA methods make their predic6ons based on a predefined set of informa6on

•  These VQA models have been shown to be “myopic” (tend to fail on novel instances) [1]

•  Current state-of-the-art VQA models have indicated that they could benefit from beIer visually grounded evidence [4]

•  We push one step further by enabling a VQA model to ask for and collect "clues" – in par6cular, visually grounded evidence.

References

[1] Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual ques6on answering models. EMNLP, 2016

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual ques6on answering. ICCV, 2015

[3] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Mul6modal compact bilinear pooling for visual ques6on answering and visual grounding. EMNLP, 2016.

[4] A. Jabri, A. Joulin, and L. van der Maaten. Revisi6ng visual ques6on answering baselines. ECCV, 2016.

[5] R. Krishna,et al.. Visual genome: Connec6ng language and vision using crowdsourced dense image annota6ons. IJCV, 2016.

[6] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical ques6on-image co-aIen6on for visual ques6on answering. NIPS, 2016

[7] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Ques6on Answering in Images. CVPR, 2016.

question

responsequery

knowledge sources

query generator core network

memoryencoder

Is there woman in the image?

Who is under the umbrella?

answer

Two women.(woman, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

woman

queryscoringnetwork

Is there __ in the image?

(c) flowchart of our iterative VQA model

memory

answerdecoder

(a) state-of-the-art visual QA model

corenetwork

query generator

(b) iterative QA model

answerquestion questionencoder

answer

question knowledge sources

question

responsequery

knowledge sources

query generator core network

memoryencoder

Where is umbrella in the image?

Who is under the umbrella?

answer

Two women.(umbrella, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

umbrella

queryscoringnetwork

Where is __ in the image?

(c) flowchart of our iterative QA model

memory

(a) standard VQA model

(a) our iterative VQA model (a) flow of the iterative VQA model

(c) flow of the iterative VQA model

(a) standard VQA model

(b) our iterative VQA model

question

responsequery

knowledge sources

query generator core network

memoryencoder

Is there woman in the image?

Who is under the umbrella?

answer

Two women.(woman, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

woman

queryscoringnetwork

Is there __ in the image?

(c) flowchart of our iterative VQA model

memory

answerdecoder

(a) state-of-the-art visual QA model

corenetwork

query generator

(b) iterative QA model

answerquestion questionencoder

answer

question knowledge sources

question

responsequery

knowledge sources

query generator core network

memoryencoder

Where is umbrella in the image?

Who is under the umbrella?

answer

Two women.(umbrella, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

umbrella

queryscoringnetwork

Where is __ in the image?

(c) flowchart of our iterative QA model

memory

(a) standard VQA model

(a) our iterative VQA model (a) flow of the iterative VQA model

(c) flow of the iterative VQA model

(a) standard VQA model

(b) our iterative VQA model

question

responsequery

knowledge sources

query generator core network

memoryencoder

Is there woman in the image?

Who is under the umbrella?

answer

Two women.(woman, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

woman

queryscoringnetwork

Is there __ in the image?

(c) flowchart of our iterative VQA model

memory

answerdecoder

(a) state-of-the-art visual QA model

corenetwork

query generator

(b) iterative QA model

answerquestion questionencoder

answer

question knowledge sources

question

responsequery

knowledge sources

query generator core network

memoryencoder

Where is umbrella in the image?

Who is under the umbrella?

answer

Two women.(umbrella, x, y, w, h) answer

decoder

questionencoder

memorybank

. . .

umbrella

queryscoringnetwork

Where is __ in the image?

(c) flowchart of our iterative QA model

memory

(a) standard VQA model

(a) our iterative VQA model (a) flow of the iterative VQA model

(c) flow of the iterative VQA model

(a) standard VQA model

(b) our iterative VQA model

Model Overview

When was this picture taken?

A. During a wedding.B. During a bar mitzvah.C. During a funeral.D. During a Sunday church service.

Q: What is interacting with woman?

R: (woman, wearing, wedding gown)

memory bank

Q: What object can you see?

R: (woman, 204, 35, 54, 103)

memory bank

(a) (b)

initi

al

mem

ory

#1

initi

alm

emor

y #1

mem

ory

#2

An example itera6ve query process

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#397

CVPR#397

CVPR 2017 Submission #397. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

When was this picture taken?

A. During a wedding.B. During a bar mitzvah.C. During a funeral.D. During a Sunday church service.

Q: What is interacting with woman?

R: (woman, wearing, wedding gown)

memory bank

Q: What object can you see?

R: (woman, 204, 35, 54, 103)

memory bank

(a) (b)

initi

al

mem

ory

#1

initi

alm

emor

y #1

mem

ory

#2

Figure 3: An example iterative query process. At each time,the model proposes a task-driven query (Q) to request use-ful evidence from knowledge sources. The response (R) isencoded into a memory and added to the memory bank.

3.3. Query Generator

The query generator bridges our model with the externalknowledge sources. It proposes queries based on the mem-ory state to obtain best relevant evidence. While the moststraightforward strategy would be to paraphrase the targetquestion as a query to an omniscient source, we don’t havesuch oracle in practice. Hence, it is essential to define usefulquery types for communication and to devise a good querystrategy for effectiveness.

An error analysis in previous work [?] indicates that alack of visual grounding, i.e., facts about the objects in animage, is a key problem in current VQA systems. Suchvisual grounding would help resolving the underlying un-certainty from noisy vision models. Hence, we definefour query types that the model can use to request visu-ally grounded evidence. In Table ??, the bold words in thequery templates are the free arguments to construct the fi-nal queries. Such evidence in the responses is sometimesreferred to as episodic memories [?], as they are groundedon a specific image. These responses can be harvested fromeither human annotation or a pretrained predictive model.

Table 1: Query Types and Response Formats

Query types and templates Response formatsWhat object can you see? (object, x, y, w, h)Is there object in the image? (object, x, y, w, h)How does object look? (object, attribute)What is interacting with object1? (object2, relation)

Now, we need a strategy to generate the best query toask at the current memory state. Reinforcement learning(RL) approaches are commonly used to learn such a query-ing policy. However, we found that standard deep RLmethods such as DQN [?] have convergence issues in ourproblem setting with a large discrete action space. To ad-

dress this limitation, we use a tree expansion method witha greedy scoring function instead. We use supervised learn-ing method to train a query scoring network, which evalu-ates query candidates at the current state.

Our query scoring network is an MLP model, similar tothe core network, followed by two-level hierarchical soft-max for the query types and the query objects correspond-ingly (see Fig. ??(a)). It takes an image-question-memorytriplet as input; however in contrast to the core network, itdoes not take answer vectors as input. As we don’t haveground-truth labels of the optimal queries at each step, weautomatically generate the training samples by Monte-Carlorollouts. Fig. ??(b) demonstrates a rollout procedure of thequery tree expansion method. Each node in the tree repre-sents a query candidate. At each step, we maintain a set ofnouns that have been seen in question and responses, andbranch out queries from this set. The noun set is initializedby all the noun entities in the question. This set constrainsthe width of the search tree, making computation tractable.During test, the query scoring network computes a scorefor each terminal node. The model proposes the next querywith the highest score.

3.4. Learning

As mentioned in Sec. ??, the core network, therefore,can be trained end-to-end. However, at each step, the querygenerator makes a hard decision on which query to propose,introducing a non-differentiable operation, yet there existsinterdependence between the core network and the querygenerator. Thus, we devise an EM-style training procedure,where we freeze the core network while training the queryscoring network, and vice versa (see Algorithm ??).

We bootstrap with a uniformly random strategy as theseed query generator, as we initially don’t have a trainedquery scoring network. The initial core network is trainedwith random rollouts using backpropagation. In subse-quent iterations, the core network is trained with rolloutsgenerated from a trained query generator (i.e., tree expan-sion + query scoring network) from previous step. Freezingthe core network, we then train the query scoring network.

We train the query scoring network with the image-question-memory triplets as input. The training set is au-tomatically generated by the core network on Monte-Carlorollouts, as depicted in Fig. ??(c). In each rollout, we adda pair of input and label (i.e., query type and query object)to the training set if the newly added memory flipped previ-ously incorrect predictions to the correct answers.

3.5. Implementation Details

We follow the same network setup and the same hyper-parameters as [?]. Both the core network and the query scor-ing network have 8,192 hidden units. We use dropout (0.5)after the first layer, and ReLU as the non-linearity. Both

4

Who is holding the orange?

Where is the plate?

What color drink is in the nearest glass?

Is there glass in the image? Is there glass in the image? What is interacting with glass?

Brown. Blue. Clear.time

Prediction:

What object can you see?

A man. A man.

What object can you see?

A woman.

Is there orange in the image?

timePrediction:

In the cabinet. On the table. On the floor.

Is there plate in the image? What new object can you see? How does plate look?

Prediction:time

(plate, x, y, w, h) (floor, x, y, w, h) (plate, green)

(man, x, y, w, h)

(glass, x, y, w, h) (glass, x, y, w, h) (water, of, glass)Clear.Red.Brown.Blue.

(woman, x, y, w, h) (orange, x, y, w, h)

Blue.

A woman.

On the floor.

On the floor.In the dishwasher.On the table.In the cabinet.

A man.A girl.A boy.A woman.

Vis

ual7

WV

isua

l7W

Vis

ual7

WV

QA

VQ

A V

QA

baseline query #1 query #2 query #3

Is there anything in this scene used to put out fires?

What would happen if the driver pressed the accelerator?

Is there chair in the image? Is there sheep in the image? Is there sheep in the image?

Backwards. Movement. Dead sheep.time

Prediction:

(chair, x, y, w, h) (sheep, x, y, w, h) (sheep, x, y, w, h)

BackwardsDead sheepMovement

Dead sheep.

Note: false detection

Is there bicycle in the image? Is there person in the image? Is there fire hydrant in the image?

No. No. Yes.time

Prediction:

(bicycle, x, y, w, h) (person, x, y, w, h) (fire hydrant, x, y, w, h)

YesNo...

No.

How many trains are there?

What object can you see? Is there train in the image? What object can you see?

0. 0. 1.time

Prediction:

(clock, x, y, w, h) (train, x, y, w, h) (clock, x, y, w, h)

10...

1.

Note: false detection

Datasets Visual7W telling task [7] VQA Real Mul6pleChoice Challenges [1]

Results New state-of-the-art on Visual7W On par with VQA challenge winning model

Method Visual7W

LSTM-AIen6on [6] 0.543

MCB [3] 0.622

MLP [4] 0.648

MLP + all knowledge 0.658

MLP + uniform sampling 0.653

MLP + query generator 0.679

Knowledge Sources Visual Genome scene graphs [5] Faster R-CNN detectors

Method VQA(dev) VQA(standard)

Two-layer LSTM [2] 0.627 0.631

Co-AIen6on [6] 0.658 0.661

MCB + AI. + GloVe [3] 0.691 -

MCB Ensemble + Genome [3] 0.702 0.701

MLP [4] 0.659 -

MLP + query generator 0.691 0.689

Iterative Training Algorithm 432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#397

CVPR#397

CVPR 2017 Submission #397. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Algorithm 1 Training Procedure for Iterative QA Model1: procedure2: Generate random query rollouts R(0)

3: Train initial core network C(0) with rollout R(0)

4: Generate training samples S(0) for query scoring network with C(0)

5: Train initial query scoring network G(0) with S(0)

6: for t = 1, . . . , N do . Iterate N times7: Generate query rollouts R(t) with query scoring network G(t�1)

8: Finetune core network C(t) from C(t�1) with rollout R(t)

9: Generate training samples S(t) for query scoring network from C(t)

10: Finetune query scoring network G(t) from G(t�1) with S(t)

11: end for12: return {G(N), C(N)}13: end procedure

queryscoringnetwork

query type

. . . query object

What is interacting with __?

child

image question memory bank

(a) Query scoring network

Q: How is the child protecting her head?

Is there dog in the image? How does child look like?0.5

0.1

What is interacting with child?

(helmet, x, y, w, h)

set {helmet, child, head}

seed set {child, head}

0.2

memorystate

queryscoringnetwork

query type

. . . query token

questionencoding

What object is interacting with __?

child

0 1 2 3 4 5time

answ

er lo

ss

+1 q1

+1 q2

-1 q3-1 q5

initi

al

mem

ory

#1

mem

ory

#2

mem

ory

#3

...

memory bank

(b) Query search tree expansion

memory bank

initi

alm

emor

y #1

mem

ory

#2

answercorrectness

flipped

new training sample

input: image, question, memory

label: query type, query object

Monte-Carlorollout

(c) Training sample generation

Figure 4: (a) Query scoring network. The network uses hierarchical softmax to evaluate each query given the current memorystate and the question; (b) Query search tree expansion. The model starts with a seed set of nouns, which is used to generatequeries; new nouns from query responses are added to the set and used to expand the query tree further; (c) Generatingtraining samples for query scoring network. We perform Monte-Carlo rollouts on the query search tree, and use feedbackfrom the core network as the labels.

Both networks are trained using SDG with momentum anda base learning rate of 0.01. We perform Monte-Carlorollouts by the query generator using an ✏-greedy strategy(Line 7 in Algorithm 1), where ✏ is annealed from 1.0 to 0.1as the iterative training procedure proceeds.

4. Experiments

Our main goal, throughout experiments, is to examinehow the acquired evidence from the iterative QA model im-pact performances of answering questions on images. Weaim to investigate two major aspects of our model: 1) im-pacts of querying strategies for task-driven knowledge ac-quisition, and 2) contributions and limitations of differentknowledge sources to answering questions on images. Wefirst report quantitative results in Sec. 4.2 and then performdetailed analysis in Sec. 4.3.

4.1. Experiment Setups

Datasets. Our experiments are conducted on the Visual7Wtelling task [41] and the VQA Real Multiple Choice chal-lenge [3]. The Visual7W telling dataset includes 69,817questions for training, 28,020 for validation, and 42,031 fortesting. The performance is measured by the percentage ofquestions that are correctly answered. The VQA Real Mul-tiple Choice challenge has 248,349 questions for training,121,512 for validation, and 244,302 for testing. The perfor-mance is reported by an evaluation metric proposed by [3].

Knowledge Sources. As indicated by the error analysis inJabri et al. [17], the key limitation of today’s VQA mod-els is a lack of visual grounding of objects and concepts.We thus design our query types and responses in Table 1to acquire visually grounding evidence from the knowledgesources. Such knowledge can be obtained by human an-

5

Experiments

Query types and response formats Workflow of the itera6ve VQA model Standard VQA model Our itera6ve VQA model