multimodal residual learning for visual qa
TRANSCRIPT
Multimodal Residual Learning for Visual QA
NamHyuk Ahn
Table of Contents
1. Visual QA
2. Stacked Attention Network (SAN)
3. Residual Learning
4. Multimodal Residual Network (MRN)
Visual QAEvaluation Metric
- Robust to variabilityinter-human
- Human accuracy is almost 90
- 248,349 Training questions (82,783 Images)
- 121,512 Validation questions (40,504 Images)
- 244,302 Testing questions (81,434 Images)
Stacked Attention Network
Motivation
- Answering question requires multi-step reasoning
- With {bicycles, window, street, baskets, dogs} objects
- To answer good question,pinpoint relevant region.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network (SAN)
- SAN allows multi-step reasoning for visual QA
- Extension of Attention mechanism which successfully applied in captioning, translation etc.
Q: what are sitting in the basket on a bicycle
Stacked Attention Network
- Image Model• Extract image feature using
CNN
- Question Model• Extract semantic vector
using CNN or LSTM
- Stacked Attention• Multi-step reasoning
with attention layer
Stacked AttentionMulti-step reasoning
using attention layer
Image / Question Model- Image Model
• Get feature map from raw pixel Image
• Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512)
• Additional layer to fit to question feature
- Question Model•
Stacked Attention Model
- Global image feature leads to suboptimal due to noise from irrelevant object / region.
- Instead use SAM to pinpoint relevant region
- Given image feature matrixand question vector ,
14x14 attention distribution
- Get weighted sum of image vectors from each region.
-refined query vector
Result
Residual Learning
Problem of degradation- More depth, more accurate but deep network can
vanish/explode gradient
• BN, Xavier Init, Dropout can handle (~30 layer)
- More deeper, degradation problem occur
• Not only overfit, but also increase training error
Residual Network (ResNet)
Residual Block- To avoid degradation
problem, add shortcut connection.
- Element-wise addition with F(x) and shortcut connection, and pass through ReLU.
- Similar to LSTM
http://torch.ch/blog/2016/02/04/resnets.html
Shortcut connection
Multimodal Residual Network
Introduction
- Extend deep residual learning for visual QA
- Achieving the state-of-the-art results on visual QA dataset (not today :(.
- Introducing a method to visualize spatial attention effect of joint residual mappings
Background
SAN- But question info contribute
weakly, it cause bottleneck
Baseline [Lu et al.]- With just elem-wise multiple,
visual and question feature embed very well.
MRN- Shortcut mapping and
stacking architecture
- No weighted-sum
- Instead use global multiplication [Lu et al.] does.
Quantitative Analysis- (a) shows large improvement
over SAN, (b) is better.
- (c) add extra embedding in question cause overfitting.
- (d) identity shortcut cause degradation (extra linear mapping is needed).
- (e) performs reasonable, but extra shortcut is not essential.
Quantitative Analysis
# of Learning blocks- 58.85% (L=1), 59.44% (L=2),
60.53% (L=3), 60.42% (L=4)
Visual Features- ResNet-152 is significantly
better than VGGNet
- Even though ResNet has less feature dim (2048 vs 4096).
# of Answer Class- Trade-off relation among
answer type, but 2k is best
- Implicit attention with multiplication
- Get high-resolution attention map
Reference
- Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015).
- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016).
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.