don’t just assume; look and answer: overcoming priors for visual question...

3
Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering Aishwarya Agrawal * , Dhruv Batra , Devi Parikh , Aniruddha Kembhavi * Virginia Tech, Georgia Institute of Technology, Allen Institute for Artificial Intelligence [email protected], {dbatra, parikh}@gatech.edu, [email protected] 1. Introduction Automatically answering questions about visual content is considered to be one of the holy grails of artificial intelli- gence. Visual Question Answering (VQA) poses a rich set of challenges spanning various domains such as computer vision, natural language processing, knowledge representa- tion, and reasoning. In the last few years, VQA has received a lot of attention – a number of VQA datasets have been cu- rated [1–9] and a variety of deep-learning models have been developed [1, 10–27]. However, a number of studies have found that despite recent progress, today’s VQA models are heavily driven by superficial correlations in the training data and lack suffi- cient visual grounding [8,9,28,29]. Intuitively, it seems that when faced with a difficult learning problem, models resort to latching onto the language priors in the training datasets to the point of ignoring the image – e.g., overwhelmingly replying to ‘how many X?’ questions with ‘2’ (irrespective of X), ‘what color is . . . ?’ with ‘white’, ‘is the . . . ?’ with ‘yes’. One reason for this emergent dissatisfactory behav- ior is the fundamentally problematic nature of IID train-test splits in the presence of strong priors. In existing datasets, strong priors present in the training data also carry over to the test data. As a result, models that rely on a sort of so- phisticated memorization of the training data demonstrate acceptable performance on the test set. This is problem- atic for benchmarking progress in computer vision and AI because it becomes unclear what the source of the improve- ments is and to what extent VQA models have learned to ground concepts in images or reason about novel composi- tions of known concepts. To help disentangle these factors, we present a new split of the VQA v1.0 dataset [1], called Visual Question An- swering under Changing Priors (VQA-CP v1.0). This new dataset is created by re-organizing the train and val splits of the VQA v1.0 dataset in such a way that the distribution of answers per question-type (‘how many’, ‘what color is’, etc.) is by design different in train and test splits. To demonstrate the difficulty of our VQA-CP v1.0 splits, What color is the dog? [White] Q+[A] Q-type Image What color is the dog? [Black] Q+[A] Q-type Image white red blue green yellow Training Prior SAN White GVQA Black Train Test Models Is the person wearing shorts? [No] Q+[A] Q-type Image Is the person wearing shorts? [Yes] Q+[A] Q-type Image no woman female Training Prior SAN No GVQA Yes Models Example 1 Example 2 Figure 1. We propose a novel split of the VQA v1.0 dataset [1] – VQA-CP v1.0 – that enables stress testing VQA models under mismatched priors between train and test. VQA-CP v1.0 is cre- ated such that the distribution of answers per question-type (‘what color is the’, ‘is the person’) is by design different in the train split (majority answers: ‘white’, ‘no’), compared to the test split (majority answers: ‘black’, ‘yes’). Existing VQA models tend to largely rely on strong language priors in existing datasets and suffer significant performance degradation on VQA-CP v1.0. We propose a novel model (GVQA) that explicitly grounds visual con- cepts in images, and consequently significantly outperforms exist- ing VQA models. we report the performance of several existing VQA mod- els [11, 14, 17, 21, 30] on VQA-CP v1.0. Our key finding is that the performance of all tested existing models drops significantly when trained and evaluated on VQA-CP v1.0 compared to training and evaluation on VQA v1.0 dataset. Our primary technical contribution is a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specif- ically designed to prevent it from ‘cheating’ by primarily relying on priors in the training data. GVQA is motivated by the following intuition – questions in VQA provide two key pieces of information: (1) What should be recognized? Or what visual concepts in the image need to be reasoned about to answer the question (e.g., ‘What color is the plate?’ requires looking at the plate in the image), (2) What can be said? Or what the space of plausible an- swers is (e.g., ‘What color . . . ?’ questions need to be an- swered with names of colors). 1

Upload: others

Post on 28-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answeringaish/vqa-prior-dist_2_page.pdf · 2017-07-06 · Don’t Just Assume; Look and Answer: Overcoming

Don’t Just Assume; Look and Answer:Overcoming Priors for Visual Question Answering

Aishwarya Agrawal∗, Dhruv Batra‡, Devi Parikh‡, Aniruddha Kembhavi†∗Virginia Tech, ‡Georgia Institute of Technology, †Allen Institute for Artificial [email protected], {dbatra, parikh}@gatech.edu, [email protected]

1. Introduction

Automatically answering questions about visual contentis considered to be one of the holy grails of artificial intelli-gence. Visual Question Answering (VQA) poses a rich setof challenges spanning various domains such as computervision, natural language processing, knowledge representa-tion, and reasoning. In the last few years, VQA has receiveda lot of attention – a number of VQA datasets have been cu-rated [1–9] and a variety of deep-learning models have beendeveloped [1, 10–27].

However, a number of studies have found that despiterecent progress, today’s VQA models are heavily driven bysuperficial correlations in the training data and lack suffi-cient visual grounding [8,9,28,29]. Intuitively, it seems thatwhen faced with a difficult learning problem, models resortto latching onto the language priors in the training datasetsto the point of ignoring the image – e.g., overwhelminglyreplying to ‘how many X?’ questions with ‘2’ (irrespectiveof X), ‘what color is . . . ?’ with ‘white’, ‘is the . . . ?’ with‘yes’. One reason for this emergent dissatisfactory behav-ior is the fundamentally problematic nature of IID train-testsplits in the presence of strong priors. In existing datasets,strong priors present in the training data also carry over tothe test data. As a result, models that rely on a sort of so-phisticated memorization of the training data demonstrateacceptable performance on the test set. This is problem-atic for benchmarking progress in computer vision and AIbecause it becomes unclear what the source of the improve-ments is and to what extent VQA models have learned toground concepts in images or reason about novel composi-tions of known concepts.

To help disentangle these factors, we present a new splitof the VQA v1.0 dataset [1], called Visual Question An-swering under Changing Priors (VQA-CP v1.0). This newdataset is created by re-organizing the train and val splitsof the VQA v1.0 dataset in such a way that the distributionof answers per question-type (‘how many’, ‘what color is’,etc.) is by design different in train and test splits.

To demonstrate the difficulty of our VQA-CP v1.0 splits,

What color is the dog? [White]Q+[A]Q-type

Image

What color is the dog? [Black]Q+[A]Q-type

Image

whiteredbluegreenyellow

Training Prior

SAN

White

GVQA

Black

Train Test

Models

Is the person wearing shorts? [No]Q+[A]Q-type

Image

Is the person wearing shorts? [Yes]Q+[A]Q-type

Imageno

woman

female

Training Prior

SAN

No

GVQA

Yes

Models

Exa

mple

1E

xa

mple

2

Figure 1. We propose a novel split of the VQA v1.0 dataset [1]– VQA-CP v1.0 – that enables stress testing VQA models undermismatched priors between train and test. VQA-CP v1.0 is cre-ated such that the distribution of answers per question-type (‘whatcolor is the’, ‘is the person’) is by design different in the trainsplit (majority answers: ‘white’, ‘no’), compared to the test split(majority answers: ‘black’, ‘yes’). Existing VQA models tendto largely rely on strong language priors in existing datasets andsuffer significant performance degradation on VQA-CP v1.0. Wepropose a novel model (GVQA) that explicitly grounds visual con-cepts in images, and consequently significantly outperforms exist-ing VQA models.

we report the performance of several existing VQA mod-els [11, 14, 17, 21, 30] on VQA-CP v1.0. Our key findingis that the performance of all tested existing models dropssignificantly when trained and evaluated on VQA-CP v1.0compared to training and evaluation on VQA v1.0 dataset.

Our primary technical contribution is a novel GroundedVisual Question Answering model (GVQA) that containsinductive biases and restrictions in the architecture specif-ically designed to prevent it from ‘cheating’ by primarilyrelying on priors in the training data. GVQA is motivatedby the following intuition – questions in VQA provide twokey pieces of information:(1) What should be recognized? Or what visual concepts inthe image need to be reasoned about to answer the question(e.g., ‘What color is the plate?’ requires looking at the platein the image),(2) What can be said? Or what the space of plausible an-swers is (e.g., ‘What color . . . ?’ questions need to be an-swered with names of colors).

1

Page 2: Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answeringaish/vqa-prior-dist_2_page.pdf · 2017-07-06 · Don’t Just Assume; Look and Answer: Overcoming

Model CNN Overall Yes/No Yes No Norm Y/N Number Other

per Q-type prior - 08.39 14.70 15.94 11.00 13.47 08.34 02.14d-LSTM [30] - 23.51 34.53 27.77 59.02 43.39 11.40 17.42d-LSTM+n-I [30] VGG-16 23.51 34.53 25.14 62.09 43.61 11.40 17.42SAN [11] VGG-16 26.88 35.34 26.69 60.72 43.70 11.34 24.70NMN [14] VGG-16 29.64 38.85 33.83 53.55 43.69 11.23 27.88MCB [21] ResNet-152 34.39 37.96 29.66 62.32 45.99 11.80 39.90GVQA (Ours) VGG-16 39.23 64.72 63.98 66.65 65.32 11.87 24.86

Table 1. Accuracies of our model compared to existing VQA models on the VQA-CP v1.0 dataset.

2. Visual Question Answering under ChangingPriors (VQA-CP v1.0)

In VQA-CP v1.0, the distribution of answers for a givenquestion type is by design different in train and test splits(Fig. 2), unlike VQA v1.0 dataset where the distributionfor a given question type is similar across train and valsplits [1]. For instance, in VQA-CP v1.0, ‘tennis’ is themost frequent answer for the question type ‘what sport’ intrain split whereas ‘skiing’ is the most frequent answer forthe same question type in VQA-CP v1.0 test split. Simi-lar differences can be seen for other question types as well– ‘what animal’, ‘what color’, ‘how many’, ‘what brand’.Please visit our poster to learn more about VQA-CP v1.0.

Figure 2. Distribution of answers per question type vary signifi-cantly between VQA-CP v1.0 train (top) and test (bottom) splits.For instance, ‘white’ is the most common answers in train for‘What color’, whereas ‘black’ is the most common answer in test.

To show the difficulty of VQA-CP v1.0, we train andevaluate the following VQA models on VQA-CP v1.0 andcompare with their performance on VQA v1.0 – 1) themodel [30] from the VQA paper [1] which we refer toas d-LSTM+n-I, 2) The Neural Module Networks (NMN)[14] which are designed to be compositional in nature, 3)Stacked Attention Networks (SAN) [11] which is one of the

widely used models for VQA, and 4) Multimodal CompactBilinear Pooling (MCB) [21] which won the VQA Chal-lenge 2016. From Table 2, we can see that the performanceof all the existing VQA models drops significantly in theVQA-CP v1.0 setting compared to the VQA v1.0 setting.

Model Dataset Overall Yes/No Number Other

d-LSTM+n-I [30] VQA 54.23 79.81 33.26 40.35VQA-CP 23.51 34.53 11.40 17.42

NMN [14] VQA 54.83 80.39 33.45 41.07VQA-CP 29.64 38.85 11.23 27.88

SAN [11] VQA 55.86 78.54 33.46 44.51VQA-CP 26.88 35.34 11.34 24.70

MCB [21] VQA 60.97 81.62 34.56 52.16VQA-CP 34.39 37.96 11.80 39.90

Table 2. Accuracies of existing VQA models on the VQA v1.0 valsplit when trained on VQA v1.0 train split and those on VQA-CPv1.0 test split when trained on VQA-CP v1.0 train split.

3. Grounded Visual Question Answering(GVQA) Results

Table 1 shows the performance of the GVQA model incomparison to several existing VQA models on the VQA-CP v1.0 dataset using the VQA evaluation metric [1]. Itis insightful to look at the accuracies of ‘Yes’ and ‘No’ in-dividually because the distribution of ‘Yes’ and ‘No’ in thetest set is biased towards ‘Yes’ (74.64%). Hence, a modelwhich says ‘Yes’ for all Yes/No questions, would score100% for Yes but perform poorly on No and normalizedYes/No (Norm Y/N) accuracies.

As shown in Table 1, our proposed GVQA signifi-cantly outperforms past VQA models on the VQA-CP v1.0dataset. In particular, GVQA does much better on Y/Nquestions. Interestingly, GVQA also outperforms MCB onthe overall metric as well as the Yes/No category, in spiteof the MCB model using the more powerful ResNet-152architecture in its image pipeline, compared to VGG-16 inGVQA. Please visit our poster to learn about GVQA.

2

Page 3: Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answeringaish/vqa-prior-dist_2_page.pdf · 2017-07-06 · Don’t Just Assume; Look and Answer: Overcoming

References[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret

Mitchell, Dhruv Batra, C. Lawrence Zitnick, and DeviParikh. VQA: Visual Question Answering. In ICCV, 2015.

[2] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome:Connecting language and vision using crowdsourced denseimage annotations. arXiv preprint arXiv:1602.07332, 2016.

[3] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei.Visual7w: Grounded question answering in images. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4995–5004, 2016.

[4] Donald Geman, Stuart Geman, Neil Hallonquist, and Lau-rent Younes. Visual turing test for computer vision sys-tems. Proceedings of the National Academy of Sciences,112(12):3618–3623, 2015.

[5] Mateusz Malinowski and Mario Fritz. A multi-world ap-proach to question answering about real-world scenes basedon uncertain input. In Advances in Neural Information Pro-cessing Systems, pages 1682–1690, 2014.

[6] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, LeiWang, and Wei Xu. Are you talking to a machine? datasetand methods for multilingual image question. In Advances inNeural Information Processing Systems, pages 2296–2304,2015.

[7] Mengye Ren, Ryan Kiros, and Richard Zemel. Explor-ing models and data for image question answering. In Ad-vances in Neural Information Processing Systems, pages2953–2961, 2015.

[8] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-tra, and Devi Parikh. Making the v in vqa matter: Elevatingthe role of image understanding in visual question answer-ing. arXiv preprint arXiv:1612.00837, 2016.

[9] Peng Zhang, Yash Goyal, Douglas Summers-Stay, DhruvBatra, and Devi Parikh. Yin and Yang: Balancing and an-swering binary visual questions. In CVPR, 2016.

[10] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao,Wei Xu, and Ram Nevatia. ABC-CNN: an attention basedconvolutional neural network for visual question answering.CoRR, abs/1511.05960, 2015.

[11] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, andAlexander J. Smola. Stacked attention networks for imagequestion answering. In CVPR, 2016.

[12] Huijuan Xu and Kate Saenko. Ask, attend and answer: Ex-ploring question-guided spatial attention for visual questionanswering. In ECCV, 2016.

[13] Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. Com-positional memory for visual question answering. CoRR,abs/1511.05676, 2015.

[14] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and DanKlein. Deep compositional question answering with neuralmodule networks. In CVPR, 2016.

[15] Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel,and Anthony R. Dick. Explicit knowledge-based reasoningfor visual question answering. CoRR, abs/1511.02570, 2015.

[16] Kushal Kafle and Christopher Kanan. Answer-type predic-tion for visual question answering. In CVPR, 2016.

[17] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.Hierarchical question-image co-attention for visual questionanswering. In NIPS, 2016.

[18] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and DanKlein. Learning to compose neural networks for questionanswering. In NAACL, 2016.

[19] Kevin J. Shih, Saurabh Singh, and Derek Hoiem. Where tolook: Focus regions for visual question answering. In CVPR,2016.

[20] Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-OhHeo, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang.Multimodal residual learning for visual QA. In NIPS, 2016.

[21] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach,Trevor Darrell, and Marcus Rohrbach. Multimodal com-pact bilinear pooling for visual question answering and vi-sual grounding. In EMNLP, 2016.

[22] Hyeonwoo Noh and Bohyung Han. Training recurrent an-swering units with joint loss minimization for vqa. CoRR,abs/1606.03647, 2016.

[23] Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A focused dy-namic attention model for visual question answering. CoRR,abs/1604.01485, 2016.

[24] Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hen-gel, and Anthony R. Dick. Ask me anything: Free-form vi-sual question answering based on knowledge from externalsources. In CVPR, 2016.

[25] Caiming Xiong, Stephen Merity, and Richard Socher. Dy-namic memory networks for visual and textual question an-swering. In ICML, 2016.

[26] Xiang Sean Zhou and Thomas S. Huang. Relevance feed-back in image retrieval: A comprehensive review. Proceed-ings of ACM Multimedia Systems, 2003.

[27] Kuniaki Saito, Andrew Shin, Yoshitaka Ushiku, and TatsuyaHarada. Dualnet: Domain-invariant network for visual ques-tion answering. CoRR, abs/1606.06108, 2016.

[28] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Ana-lyzing the behavior of visual question answering models. InEMNLP, 2016.

[29] Justin Johnson, Bharath Hariharan, Laurens van der Maaten,Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr:A diagnostic dataset for compositional language and elemen-tary visual reasoning. 2017.

[30] Jiasen Lu, Xiao Lin, Dhruv Batra, and Devi Parikh.Deeper lstm and normalized cnn visual question answeringmodel. https://github.com/VT-vision-lab/VQA_LSTM_CNN, 2015.

3