deepfix: a fully convolutional neural network for predicting human fixations (upc reading group)
TRANSCRIPT
DeepFix: A Fully ConvolutionalNeural Network for Predicting
Human Fixations
Srinivas S S Kruthiventi, Kumar Ayush, and R. Venkatesh Babu (arXiv October 2015) [URL]
Slides by Xavier Giró-i-Nieto, from the Computer Vision Reading Group. (27/10/2015)https://imatge.upc.edu/web/teaching/computer-vision-reading-group
Introduction
2
Introduction
3
Bottom-up attention
AutomaticReflexiveStimulus-driven
Introduction
4
Top-down attention
Subjective’s prior knowledgeExpectationsTask orientedMemoryBehavioral goals
Introduction
5
Visual Attentional Mechanisms
Bottom-upAutomaticReflexiveStimulus-driven
Top-downSubjective’s prior knowledgeExpectationsTask orientedMemoryBehavioral goals
Introduction
Introduction
7
DeepFixClassic method
The ingredients
10
Very deep network
11
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014)
● Inspired by Oxford’s VGG net (19 layers).● 20 layers● Small kernel sizes.
Fully convolutional network (FCN)
12
● Fully connected layers at the end are replaced by convolutional layers with very large receptive fields.
● They capture the global context of the scene.
● End-to-end training
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431-3440)
13
Inception layers
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9)
● GoogLeNet● Different kernel sizes
operating in parallel.
14
Location Biased Convolutional (LBC) layer
● Centre-bias●
The network
15
Architecture
16
Small convolutional filters of 3x3 with stride of 1 to allow a large depth without increasing the memory requirement
Architecture
17
Max pooling layers (in red) reduce computation.
Architecture
18
Gradual increase in the amount of channels to progressively learn richer semantic representations: 64, 128, 256, 512...
Architecture
19
Weights initialized from VGG-16 net for stable and effective learning
Architecture
20
Convolution kernel 3x3 with hole size 2 have a receptive field of 5x5.
Architecture
21
Capture multi-scale semantic structure using two inception style convolutional modules
Architecture
22
Very large receptive fields of 25x25 by introducing holes of size 6 in kernels
Architecture
23
Location Biased Convolutional (LBC) layers
Architecture
24
Location Biased Convolutional (LBC) layers
Architecture
25
constant during training learnt during training
weights from c’th filter in a convolutional layer
input blob
Architecture
26
Final output W/8xH/8 is upsampled.
Experiments
27
Training
28
2nd stage
MIT 1003
CAT2000Mouse clicks from Microsoft CoCo
Not mentioned how to go from eye fixations to heat mapa !!
Training
29
● End to end (as JuntingNet)● Caffeframework● 1 day in K40 GOU!
Results
30
Results
31
Results
32
Results
33
Results
34