multimodal compact bilinear pooling for vqa - vqa: …multimodal compact bilinear pooling for vqa...

36
Multimodal Compact Bilinear Pooling for VQA Akira Fukui 1,2 , Dong Huk Park 1 , Daylen Yang 1 , Anna Rohrbach 1,3 , Trevor Darrell 1 , Marcus Rohrbach 1 1 UC Berkeley EECS, CA, United States 2 Sony Corp., Tokyo, Japan 3 Max Planck Institute for Informatics, Saarbrücken, Germany

Upload: others

Post on 25-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Multimodal Compact Bilinear Pooling for VQA

AkiraFukui1,2,DongHuk Park1,DaylenYang1,AnnaRohrbach1,3,TrevorDarrell1,MarcusRohrbach1

1UCBerkeleyEECS,CA,UnitedStates2SonyCorp.,Tokyo,Japan3MaxPlanckInstituteforInformatics,Saarbrücken,Germany

Page 2: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Multimodallanguageandvisualunderstanding

Atablefulloffoodforafeast

Description

Page 3: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Multimodallanguageandvisualunderstanding

Thebowlwiththebrownsouce

Grounding

Page 4: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Multimodallanguageandvisualunderstanding

Gravy

VisualQuestionAnswering

Whatisthebrownsouce?

Page 5: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

¨ Allelementscaninteract¨ Multiplicativeinteraction

Page 6: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

þ Allelementscaninteract¨ Multiplicativeinteraction- Difficulttolearnoutputclassification

Concatenate

FC FC

RELU

Page 7: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

¨ Allelementscaninteractþ Multiplicativeinteraction- Difficulttolearninputembedding

ElementwiseMultiplication

FC FC

RELU⨀

Page 8: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

þ Allelementscaninteractþ Multiplicativeinteraction

OuterProduct/BilinearPooling

FC[Lin ICCV 2015]

[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015

Page 9: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

þ Allelementscaninteractþ Multiplicativeinteraction¨ High#activations&computation¨ High#parameters

OuterProduct/BilinearPooling

FC

[Lin ICCV 2015]

2048

2048 4 million

4 million x 1000

[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015

Page 10: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

spoonplatebowltablefoodcorn…person

MultimodalCompactBilinearPooling

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters

CompactBilinearPooling

FC2048

2048

16k x 1000

MCB

[Gao CVPR 16]

[GaoCVPR16]YangGao,OscarBeijbom,NingZhang,andTrevorDarrell.Compactbilinearpooling.CVPR2016

[ICLRWorkshops2016] Fine-grainedposeprediction,normalization,andrecognitionNZhang,EShelhamer,YGao,TDarrell

Page 11: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

MultimodalCompactBilinearPooling

Isthisgoingtobeafeast?

CNNLSTM

Yes

þ Allelementscaninteractþ Multiplicativeinteraction¨ Low#activations&computationþ Low#parameters

FC

16k x 1000

RandomProjection:Countsketch Ψ

[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013

Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)

Page 12: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

MultimodalCompactBilinearPooling

Isthisgoingtobeafeast?

CNNLSTM

Yes

þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters

FC

16k x 1000

Ψ

Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)

Ψ

Convolution

[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013

Page 13: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

MultimodalCompactBilinearPooling

Isthisgoingtobeafeast?

CNNLSTM

Yes

þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters

FC

16k x 1000

Ψ

Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)

Ψ

FFT

FFTFFT

-1

Convolution

[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013

Page 14: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Relatedwork

• Alternativeapproachtomultiplicativeinteractions- DPPNet:Hyeonwoo Noh,PaulHongsuck Seo,andBohyung Han.Imagequestionansweringusingconvolutionalneuralnetworkwithdynamicparameterprediction.CVPR2016

Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classificationnetwork and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from thecandidate weights obtained from the parameter prediction network.

by constructing multiple branches connected to a commonCNN architecture. In this work, we hope to solve the het-erogeneous recognition tasks using a single CNN by adapt-ing the weights in the dynamic parameter layer. Since thetask is defined by the question in ImageQA, the weightsin the layer are determined depending on the question sen-tence. In addition, a hashing trick is employed to predicta large number of weights in the dynamic parameter layerand avoid parameter explosion.

3.2. Problem FormulationImageQA systems predict the best answer a given an im-

age I and a question q. Conventional approaches [16, 23]typically construct a joint feature vector based on two inputsI and q and solve a classification problem for ImageQA us-ing the following equation:

a = argmaxa∈Ω

p(a|I, q;θ) (1)

where Ω is a set of all possible answers and θ is a vectorfor the parameters in the network. On the contrary, we usethe question to predict weights in the classifier and solve theproblem. We find the solution by

a = argmaxa∈Ω

p(a|I;θs,θd(q)) (2)

where θs and θd(q) denote static and dynamic parameters,respectively. Note that the values of θd(q) are determinedby the question q.

4. Network ArchitectureFigure 2 illustrates the overall architecture of the pro-

posed algorithm. The network is composed of two sub-networks: classification network and parameter predictionnetwork. The classification network is a CNN. One of the

fully-connected layers in the CNN is the dynamic parame-ter layer, and the weights in the layer are determined adap-tively by the parameter prediction network. The parame-ter prediction network has GRU cells and a fully-connectedlayer. It takes a question as its input, and generates a real-valued vector, which corresponds to candidate weights forthe dynamic parameter layer in the classification network.Given an image and a question, our algorithm estimatesthe weights in the dynamic parameter layer through hash-ing with the candidate weights obtained from the parameterprediction network. Then, it feeds the input image to theclassification network to obtain the final answer. More de-tails of the proposed network are discussed in the followingsubsections.

4.1. Classification Network

The classification network is constructed based on VGG16-layer net [24], which is pre-trained on ImageNet [6]. Weremove the last layer in the network and attach three fully-connected layers. The second last fully-connected layer isthe dynamic parameter layer whose weights are determinedby the parameter prediction network, and the last fully-connected layer is the classification layer whose output di-mensionality is equal to the number of possible answers.The probability for each answer is computed by applying asoftmax function to the output vector of the final layer.

We put the dynamic parameter layer in the second lastfully-connected layer instead of the classification layer be-cause it involves the smallest number of parameters. As thenumber of parameters in the classification layer increases inproportion to the number of possible answers, predicting theweights for the classification layer may not be a good op-tion to general ImageQA problems in terms of scalability.Our choice for the dynamic parameter layer can be inter-preted as follows. By fixing the classification layer while

32

Page 15: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Experimentalsetup(withoutAttention)

• Solver- Cross-entropy-loss,Adam,learningrate0.0007

• FeatureExtraction- ResNet 152,image:448x448

• Answers- 3000mostfrequentontrain- Samplingwithprobabilityofanswers

• Trainedontrain/validatedonval /testedontest-dev

Isthisgoingtobeafeast?

CNN(ResNet152)

LSTM,drop

LSTM,drop

FullConnected

Yes

2048

MCB16k

L2Norm

alization

Embed,Tanh

13k ~ 20k

300

1024

1024

2048

Softmax

SignedSqrt

L2Norm

alization

16k

16k

3000

Page 16: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

58.7

58.5

59.8

57.8

56.4

58.6

57.1

58.4

57.5

56.5

54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0

MCB(128x128->4k)

FullBilinear(128x128->16k)

MCB(2048x2048->16k)

EltwiseProduct+FC+FC

EltwiseProduct+FC

EltwiseProduct

Concat+FC+FC

Concat+FC

Concat

EltwiseSum

Trainedontrain,test-devAcc.[%]

AblationComparisontoothermultimodalmethods

AblationComparisontoothermultimodalmethods

• MCBachieveshighestaccuracy

Page 17: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

58.7

58.5

59.8

57.8

56.4

58.6

57.1

58.4

57.5

56.5

54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0

MCB(128x128->4k)

FullBilinear(128x128->16k)

MCB(2048x2048->16k)

EltwiseProduct+FC+FC

EltwiseProduct+FC

EltwiseProduct

Concat+FC+FC

Concat+FC

Concat

EltwiseSum

Trainedontrain,test-devAcc.[%]

AblationComparisontoothermultimodalmethods

AblationComparisontoothermultimodalmethods

• MCBcomparabletoFullBilinear

Page 18: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

DimensionalityofMCB

• DimensionalityofMCBdecidestheperformanceofouterproductapproximation

2048

MCB

Dim size

2048

58.4

58.8

59.459.7

59.859.7

57.5

58.0

58.5

59.0

59.5

60.0

1024 2048 4096 8192 16000 32000

test-devAcc.[%]

VQAOpen-Endedtest-devaccuracy

Page 19: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Multimodallanguageandvisualunderstanding

Gravy

VisualQuestionAnswering

Whatisthebrownsouce?

Page 20: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

MCBwithAttention

• PredictspatialattentionswithMCB

CNN(ResNet152)

Tile

MCB

WE,LSTM

16k x14x14SignedSqrt

L2Norm

alization2048x14x14

2048x14x14

Conv,Relu512 x 14 x 14

Softmax1 x 14 x 14

WeightedSum

Conv

FullConnected

CornMCB16k

Softmax

SignedSqrt

L2Norm

alization

16k

16k

3000

2048

2048

Whatistheyellowfood?

Attentionforcaptioning:- K.Xu,Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttentionAttentionforVQA:- H.Xu,K.Saenko Ask,AttendandAnswer:ExploringQuestion-GuidedSpatialAttentionforVisualQuestionAnswering- J.Lu HierarchalQuestion-ImageCo-AttentionforVisualQuestionAnswering

Page 21: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

AttentionVisualizations

Isthispersonwearingahat?Yes[Groundtruth:Yes]

Page 22: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

ResultsonMCBwithAttention

• MCBperformswellwithAttention

58.4

59.8

58.4

62.5

56.0

57.0

58.0

59.0

60.0

61.0

62.0

63.0

Concat+FC MCB Concat+FC+Attention

MCB+Attention

test-devAcc.[%]

PerformanceofAttentionModels

Page 23: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

62.564.2 65.1

66.7

70.2

58

60

62

64

66

68

70

72

MCB+Attention(train)

MCB+Attention(train+val)

MCB+Attention

(train+val+genome)

MCB+Attention+Ensemble

(train+val+genome)

MCB+Attention+Ensemble

(train+val+genome)Multiple Choice

test-devAcc.[%]

VQAOpen-Ended accuracyforgenomeandensemble

• DataAugmentation- VQAdatafromVisualGenomeDataset- Additional1MQuestionandanswerpairs- Removedarticles,Singlewordanswer

• Ensembles- Averagetheoutput ofSoftmax overmodels

Techniquestoimproveperformance

Visual genome: Connecting language and vision using crowdsourced dense image annotations.

Page 24: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

MCBonotherDatasetsandTasks

• Visual7w(MultipleChoice) • VisualGrounding

Our architecture for Visual 7w : MCB with Attention and Answer Encoding.

Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.

43.843.9

47.746.5

47.447.9

48.7

40 42 44 46 48 50

Plummer et al.Wang et al.

Rohrbach et al.Concat

Eltwise ProdEltwise Prod + Conv

MCB

Accuracy on Flickr30k Entities

54.3

52.8

62.2

45 50 55 60 65

Zhu et al.

Concat + Attention

MCB + Attention

Accuracy on Visual7W

Page 25: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Examples for VQA

Page 26: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What is the woman feeding the giraffe?Carrot[Groundtruth: Carrot]

Attention Visualizations

Page 27: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What color is her shirt?Purple[Groundtruth: Purple]

Attention Visualizations

Page 28: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What is her hairstyle for the picture?Ponytail[Groundtruth: Ponytail]

Attention Visualizations

Page 29: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What color is the chain on the red dress?Pink[Groundtruth: Gold]

Attention Visualizations

• Correct Attention, Incorrect Fine-grained Recognition

Page 30: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Is the man going to fall down?No[Groundtruth: No]

Attention Visualizations

Page 31: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What is the surface of the court made of?Clay[Groundtruth: Clay]

Attention Visualizations

Page 32: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What sport is being played?Tennis[Groundtruth: Tennis]

Attention Visualizations

Page 33: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What does the shop sell?Clocks[Groundtruth: Hot Dogs]

Attention Visualizations

• Incorrect Attention

Page 34: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

What credit card company is on the banner in the background?Budweiser[Groundtruth: Mastercard]

Attention Visualizations

• Correct Attention, Incorrect Concept Association

Page 35: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Conclusions

• MultimodalCompactBilinearPooling- AllelementsinteractMultiplicatively- Compact andEfficient

• MCBwithAttention- Successfullypredictspatialattention

• GeneralizationCapability- Performanceimprovementinothervisionandlanguagetasks- Visual7W,VisualGrounding

- Compatiblewithothermodels- Applicabletogeneralmultimodaltasks,notonlyonvisionandlanguage

Page 36: Multimodal Compact Bilinear Pooling for VQA - VQA: …Multimodal Compact Bilinear Pooling for VQA Akira Fukui1,2, Dong Huk Park1, Daylen Yang1, Anna Rohrbach1,3, Trevor Darrell1, Marcus

Thankyouforyourattention!

Demo:demo.berkeleyvision.orgCode:https://github.com/akirafukui/vqa-mcb/

MultimodalCompactBilinearPoolingforVisualQuestionAnsweringandGrounding

AkiraFukui,DongHuk Park,DaylenYang,AnnaRohrbach,TrevorDarrell,MarcusRohrbach

Arxiv 2016