deep learning for computer vision – iii
TRANSCRIPT
![Page 1: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/1.jpg)
IIIT
Hyd
erab
ad
Deep Learning for Computer Vision – III
C. V. Jawahar
![Page 2: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/2.jpg)
IIIT
Hyd
erab
ad
1. “Deeper the better”
Are there deeper networks?
![Page 3: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/3.jpg)
AlexNet
5 Convolution + 2 Fully Connected Layers
![Page 4: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/4.jpg)
Recent Success of “Deep Learning”:
ImageNet Challenge
Mostly Deeper
Networks
Smaller
Convolutions
Many Specific
Enhancements
Method Top-Error Rate
SIFT+FV [CVPR 2011] ~25.7%
AlexNet [NIPS 2012] ~15%
OverFeat [ICLR 2014] ~ 13%
ZeilerNet [ImageNet 2013] ~11%
Oxford-VGG [ICLR 2015] ~7%
GoogLeNet [CVPR 2015] ~6%, ~4.5%
MSRA [arXiv 2015] ~3.5% ( released on 10
December 2015! )
Human Performance 3 to 5 %
Top-5 Error on Imagenet Classification Challenge (1000 classes)
![Page 5: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/5.jpg)
VGG-Net
• More layers lead to more nonlinearities
• Smaller receptive fields:
– less parameters; faster
– two 3 X 3 leads to 5 X 5
• No normalization
![Page 6: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/6.jpg)
VGG-Net
![Page 7: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/7.jpg)
VGG-Net Results
![Page 8: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/8.jpg)
GoogleNet etc.
• Deeper (22 layers)
• Smaller filters
• Computationally and parameter
efficient
• Inception module
• Overfeat
– Winner of Imagenet 2013
– Learn to predict object boundaries
![Page 9: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/9.jpg)
IIIT
Hyd
erab
ad
2. “Off the Shelf Features”
Are these features useful beyond the
imagenet task?
![Page 10: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/10.jpg)
CNN Features are Generic
CNN Features can be used for wider applications:
1. Train the CNN (deep network) on a very large database such
as imagenet.
2. Reuse CNN to solve smaller problems
1. Remove the last layer (classification layer)
2. Output is the code/feature representation
![Page 11: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/11.jpg)
Examples
![Page 12: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/12.jpg)
Off the shelf
• MIT 67 Indoor Scene Classification
– CNN features outperform hand-crafted like Gist, SIFT
and HOG.
Razavian, CVPRW‟14MIT 67 Scene dataset
![Page 13: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/13.jpg)
More ..
H3D Human AttributesUIUC 64 Object attribute
![Page 14: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/14.jpg)
IIIT
Hyd
erab
ad
3. “Fine tuning and Transfer Learning”
Can we further improve the
features?
![Page 15: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/15.jpg)
Settings
• Extend to more classes
– Extend from 1000 classes to another new 100
• Extend to new tasks
– Extend from object classification to scene classification
• Extend to new data sets
– Extend from imagenet to PASCAL
![Page 16: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/16.jpg)
Transfer Learning
• A key observation that we noticed in visualization:-
CO
NV
PO
OL
NO
RM
CO
NV
PO
OL
NO
RM
FC
xn
SOFT
MA
X
CO
NV
PO
OL
NO
RM
Gabor/Color blobs Dog Face
General Specific
Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep
neural networks? NIPS ‟14
![Page 17: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/17.jpg)
Transfer Learning
• A key observation that we noticed in visualization:-
• Further ques?
– Can we quantify the layer generality/specificity?
– Where does the transition occur?
– Is the transition sudden or spread over layers?
Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep
neural networks? NIPS ‟14
CO
NV
PO
OL
NO
RM
CO
NV
PO
OL
NO
RM
FC
xn
SOFT
MA
X
CO
NV
PO
OL
NO
RM
General Specific
![Page 18: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/18.jpg)
Transfer Learning• Transfer performance
experiment– Task A and B
– Types of networks• Selffer (BnB/ BnB+)
• Transfer (AnB+)
– Datasets• Random split
• Dissimilar split
• Observations– Higher level neurons are
more specialized.
– There exists co-adapted neurons between layers which makes optimization difficult.
Yosinski J, Clune J, BengioY, and Lipson H. How transferable are features in deep
neural networks? NIPS ‟14
![Page 19: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/19.jpg)
Transfer Learning
• Take away message
CO
NV
PO
OL
NO
RM
CO
NV
PO
OL
NO
RM
FC
xnSO
FT
MA
X
CO
NV
PO
OL
NO
RM
Notes
If dataset is
small retrain
the softmax
CO
NV
PO
OL
NO
RM
CO
NV
PO
OL
NO
RM
FC
xn
SOFT
MA
X
CO
NV
PO
OL
NO
RM
If dataset is
reasonable
retrain larger
portion with
fine tuning of
initial layers
– Initializing a network with transferred features almost always gives better generalization
![Page 20: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/20.jpg)
Transfer Learning
Razavian et. al.
CVPRW‟2014
Chatfield et. al.
BMVC‟2014
![Page 21: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/21.jpg)
IIIT
Hyd
erab
ad
4. “Classification Vs Detection”
Can we also use these features for
localization?
![Page 22: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/22.jpg)
R-CNN: Region with CNN Features
• Rich feature hierarchies for accurate object
detection and semantic segmentation
Input Image Extract region
proposal
(~2k/image)
Compute CNN Features Classify Regions
(linear SVM)
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic
segmentation.“ CVPR, 2014
![Page 23: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/23.jpg)
R-CNN: Training
![Page 24: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/24.jpg)
R-CNN: At test time – Step 1
• Proposal-method agnostic, many choices
– Selective Search [van de Sande, Uijlings et al.]
– MCG [Arbelaez et al.]
– BING [Ming et al.]
– CPMC [Carreira & Sminchisescu]
Input Image Extract region
proposal
(~2k/image)
![Page 25: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/25.jpg)
R-CNN: At test time – Step 2
Input Image Extract region
proposal
(~2k/image)
a. Crop
Compute CNN Features
b. Scale (anisotropic)
227 x 227Extract and Dilate Proposal
![Page 26: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/26.jpg)
R-CNN: At test time – Step 3
Input Image Extract region
proposal
(~2k/image)
Compute CNN Features
…
b. Scale (anisotropic)
4096 dimensional
fc7 feature vector
Classify Regions
(linear SVM)
Person? 1.6
…
Horse? -0.3
…
Linear classifiers
(SVM or softmax)
![Page 27: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/27.jpg)
R-CNN: At test time – Step 4
• Object Proposal Refinement (Bounding box regression)
Linear Regression
on CNN Features
Original Image Predicted object bounding box
![Page 28: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/28.jpg)
R-CNN: Results
• Evaluation: mAP
![Page 29: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/29.jpg)
IIIT
Hyd
erab
ad
5. “Features have semantics”
Can we understand or interpret
these CNN features?
![Page 30: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/30.jpg)
Visualizing CNNs• CNNs are cool but some of the below
questions need answers before we move
forward :-
– How do I interpret the learned filters?
– What is it that stimulates/excites a neuron?
– How do I decide the architecture or
improve existing ones?
To answer we need to probe the
learned a models:-
– Deconvolutional Networks. [Zeiler et.al.
ICCV‟11, ECCV‟14]
– Synthesize images [Simonyan et.al ICLR‟14,
Mahendran et.al CVPR‟15]
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014
Visualizing the first conv.
layer is possible but how
about the later layers.
Source: Krizhevsky et.al. NIPS‟12
![Page 31: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/31.jpg)
• Map activations back to input pixel space
• Deconvnet – maps features back to pixels
• Occlusion sensitivity - revealing parts of the
scene that are important for classification
Visualizing CNN
Convolution (learned)
Non-linearity
Unpooling
Feature maps
Input Image
![Page 32: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/32.jpg)
Visualizing CNNs
• Deconvnets
– Non-parametric approach.
– Projects the feature
activation back to input
space.
– Analyses a trained model
and use validation data to
interpret the feature
activation.
– Visualizes a single activation
and not the joint activity.
– Helps in understanding the
generalizing ability of CNNs.
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014
Source: Zeiler e.t. al. ECCV‟14
![Page 33: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/33.jpg)
Visualizing CNNs
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014
Grass !
Source: Zeiler e.t. al. ECCV‟14
A. How do I interpret the learned filters?
![Page 34: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/34.jpg)
Visualizing CNNs
Zeiler and Fergus , Visualizing and Understanding Convolutional Networks,. ECCV 2014
Source: Zeiler e.t. al. ECCV‟14
A. What is it that
stimulates/excites a neuron?
A. How do I decide the
architecture or improve
existing ones?
Old NewNewOld
![Page 35: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/35.jpg)
Visualizing CNNs• Class Model
Visualization
• Image-Specific Class
Saliency Visualization
Washing Machine
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. CoRR 2014
![Page 36: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/36.jpg)
• Class Model Visualization
– Find an L2 normalized image which maximizes the CI class
score
Here Sc(I) is the score of class „c‟ before soft max.
– Initialize with mean image.
– Back-propagate to update the input pixels, keeping the
weights of intermediate layer fixed.
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. CoRR 2014
Some more
results
Visualizing CNNs
![Page 37: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/37.jpg)
Visualizing CNNs
• Image-Specific Class Saliency Visualization
– Understanding the spatial support of a class in a specific
image.
Nonlinear mapping
but approximated using
first order Taylor expansion
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. CoRR 2014
Orig. Image Spatial
Support
Object
localization
mask
Grab
Cut
![Page 38: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/38.jpg)
• Rank the pixels of the image
based on their influence on
class score function
• The maps were extracted
using a single back-
propagation pass through a
classification ConvNet
Image-Specific Class Saliency
Visualization
![Page 39: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/39.jpg)
Given an encoding of an image. Is it possible to reconstruct the Image ?
• Inversion technique for analysis of deep CNN. [ Mahendran et al.
CVPR 2015 ]
• Find an image such that:
– Its code is similar to a given code
– It “looks natural” (image prior regularization)
• Layer after layer, progressively more invariant and abstract notion
of the image content is formed in the network
Understanding Deep Image
Representations
![Page 40: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/40.jpg)
Visualizing CNNs
• Another interesting ques.
– Given a CNN code, is it possible to reconstruct the
original image?
Aravindh Mahendran and Andrea Vedaldi, Understanding Deep Image Representations by Inverting Them,
CVPR‟15
Reconstructions from 1000-d
cnn code from the last layer
before applying softmax
![Page 41: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/41.jpg)
Reconstructions from intermediate layers
Understanding Deep Image
Representations
![Page 42: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/42.jpg)
IIIT
Hyd
erab
ad
6. “More generic last layer”
Simple and instructional
transformation
![Page 43: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/43.jpg)
Two Stages
Features
C
L
A
S
S
I
F
I
E
R
I
m
a
g
e
L
A
B
E
L
![Page 44: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/44.jpg)
Training with Hinge Loss
• Loss functions.
– Classification
• Hinge Loss
Hinge loss is a convex function but not
differentiable but sub-gradient exists.
Sub-gradient w.r.t. xi
CONV
POOL
NORM
CONV
POOL
NORM
FC
LOSS
xn
yn
![Page 45: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/45.jpg)
FC-Softmax replaced by SVM-
HingeLoss
CIFAR-10
MNIST (error rates)
Softmax: 0.99
DLSVM: 0.87
• Train SVM as a Neural Network.
• Multiclass as max over k SVMs
Expression Recognition
Y. Tang, ``Deep Learning using Linear Support Vector Machines”, ICML 2013
• Did not possibly have many rigorous
follow up work immediately
![Page 46: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/46.jpg)
IIIT
Hyd
erab
ad
7. “Beyond 0/1 Loss”
Can the last layer do tasks beyond
simple classification?
![Page 47: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/47.jpg)
Beyond 0/1 Loss
• Embedding loss
– Contrastive loss• Discriminating between input (positive & negative) pairs.
– Ranking/Triplet loss• Defines a relative similarity ranking between input pairs.
– Useful for learning similarity metric with applications in:-• Verification
• Dimensionality reduction
• Recognition
![Page 48: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/48.jpg)
Metric learning
• Learn a function that maps input patterns into a
target space such that the simple distance in the
target space (Euclidean) approximates the
“semantic distance” in the input space.
• Semantic distance define invariance to
– illumination
– poses
– geometric variation
– … (Problem specific)
![Page 49: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/49.jpg)
Siamese Architecture
• Given a family of functions GW(X) parameterized by W, find W such that the similarity metric DW(X1, X2) is small for similar pairs and large for dissimilar pairs:-
Loss function
Loss function for similar pairs
Loss function for dissimilar pairs
Raia Hadsell, Sumit Chopra, Yann LeCun, Dimensionality Reduction by Learning an Invariant Mapping. CVPR
2006
![Page 50: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/50.jpg)
Face Verification
• Verification Metric
– weighted similarity
– Siamese network
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf, DeepFace: Closing the Gap to Human-Level
Performance in Face Verification. CVPR 2014
120 M parameters
Most from locally connected layers
![Page 51: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/51.jpg)
Face and Human
[Taigman et al. 2014]
Face recognition
Human Pose estimation
Accuracy from 96.33% to 97.35% (Human acc: 97.53%)
Accuracy from 62.0% to 69.0%
![Page 52: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/52.jpg)
Face Verification
ROC curve on LFW dataset
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf, DeepFace: Closinga the Gap to Human-Level
Performance in Face Verification. CVPR 2014
![Page 53: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/53.jpg)
Triplet Network
• Motivations from LMNN.
• Triplet pair (query,
positive, negative) defines
the notion of ranking
between the samples.
• Useful for verification
problems and fine grained
image similarity models.
Triplet network architecture
![Page 54: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/54.jpg)
Triplet Loss
Pair wise relevance score
Distance in Embedding Space
Triplet Loss
![Page 55: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/55.jpg)
Fine grained classification
Ranking Results
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo
Chen, Ying Wu “Learning Fine-grained Image Similarity with Deep Ranking, CVPR 2014
![Page 56: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/56.jpg)
Face Recognition and Clustering
• Deep architecture inspired by GoogLeNet and Zeiler&Fergus.
• Employs triplet loss for verification, recognition and clustering.
• Constraints the embedding to d-dimensional hyper sphere.
Florian Schroff, Dmitry Kalenichenko, James Philbin, FaceNet: A unified embedding for face
recognition and clustering. CVPR 2015
Clustering Results
![Page 57: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/57.jpg)
Mining Triplets
• Training from easy pairs would result in slow
convergence.
• Picking the hardest positive, negative samples is good but
can lead to outliers (also computationally expensive).
• Picking semi-hard examples is another alternative
where:-
• These negative samples are further away from anchor
but lie inside the margin m
![Page 58: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/58.jpg)
IIIT
Hyd
erab
ad
8. “Structured Prediction”
![Page 59: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/59.jpg)
MAP
(Inference)
Graph based Models for Semantic
Segmentation
Input image
Final segmentation
Training of Potentials (Learning)
Graph construction
![Page 60: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/60.jpg)
Structured Prediction
Ex: Semantic Segmentation
• Label every pixel in
image to the category of
the object it belongs to
![Page 61: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/61.jpg)
Semantic Segmentation -
Introduction• Problem
– Labelling each Pixel by looking at a small region around
is difficult, the category of a pixel may depend on
relatively short-range information, but may also
depend on long-range information.
• Solution
– Use of Multi-scale Convolutional Networks – can take
into account a large input windows, while keeping the
number of free parameters to minimum1.
1. Farabet, Clement, et al. "Learning hierarchical features for scene labeling.“, Pattern
Analysis and Machine Intelligence 2013.
![Page 62: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/62.jpg)
Semantic Segmentation -
Architecture
• The architecture has two main components:
– Multi-scale convolutional representation
• Convolutional networks provides a simple framework to learn
hierarchies of features, composed of multiple stages
– Graph based classification
• Superpixels, Conditional Random Fields, Multilevel cut with
class purity criterion
![Page 63: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/63.jpg)
Multi-scale Convolutional Network
• The outputs of the N networks – upsampled und
concatenated so as to produce ,
where u is an upsampling function
• This has the capability of modelling global relationships
within a scene, but might still be prone to errors
![Page 64: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/64.jpg)
Graph Based Classification –
Strategy 1 - CRF• Classical CRF model is constructed on superpixels.
• Each pixel in image is a vertex in graph, the edges are
added between every neighbour nodes an energy
function is defined.
• CRF energy minimized using alpha expansions.
![Page 65: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/65.jpg)
Graph Based Classification –
Strategy 2 – Multilevel Parsing
• Parameter-free Multilevel
parsing
– Method to analyse a family of
segmentation and
automatically discover the
best observation level for
each pixel in the image
• Optimal Purity Cover
– Optimization problem for
search for most adapted
neighbourhood of a pixel
![Page 66: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/66.jpg)
Semantic Segmentation: Results
![Page 67: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/67.jpg)
IIIT
Hyd
erab
ad
9. “Applications in Action
Recognition”
![Page 68: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/68.jpg)
Problem Space
• Video Surveillance
• Video classification and indexing
• Image Search
• Patient monitoring and assisted care
• Automatic description generation
![Page 69: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/69.jpg)
Popular Datasets
Dataset Number
of Action
Classes
Clips Backgroun
d
Camera
Motion
Release
Year
Resources
KTH 6 600 Static Slight 2004 Actor
Staged
Hollywoo
d 2
12 1707 Dynamic Yes 2009 Movies
HMDB
51
51 6766 Dynamic Yes 2011 Movies,
YouTube,
Web
UCF 101 101 13320 Dynamic Yes 2012 Youtube
Sports
1M
487 1,133,158 Dynamic Yes 2014 Youtube
![Page 70: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/70.jpg)
Dense Trajectories
Visualization of dense
trajectories for a “kiss”
action. Red dots indicate
the point positions in
the current frame
Visualization of
improved dense
trajectories. White
trajectories are removed
due to camera motion.
The red dots are the
trajectory positions in
the current frame.
Dense Trajectories [ Wang et al, CVPR 2011], Improved Dense Trajectories [Wang et al. ICCV 2013]
![Page 71: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/71.jpg)
Two Streams
Simonyan et al, NIPS 20014
Limited Training Data (Videos)
Similar to AlexNet architecture for each stream
![Page 72: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/72.jpg)
Deep Video
Large-scale Video Classification with Convolutional Neural Networks [Karpathy et al, CVPR 2014
Multiresolution CNN architecture.
Input frames are fed into two
separate streams of processing: a
context stream that models low-
resolution image and a fovea
stream that processes high-
resolution center crop. Both streams
consist of alternating convolution
(red), normalization
(green) and pooling (blue) layers.
Both streams converge to two fully
connected layers (yellow).
![Page 73: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/73.jpg)
TDD: trajectory Pooled Deep-
Convolutional Descriptors
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors (Wang et al, CVPR 2015)
![Page 74: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/74.jpg)
C3D
Learning Spatiotemporal Features with 3D Convolutional Networks [Tran et al, ICCV 2015]
• C3D‟s 3D CNN architecture allow 3D convolution and
3D pooling. This preserves temporal information while
computing features for video data.
![Page 75: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/75.jpg)
Performance of various approaches
Method Year KTH Hollywood
2
HMDB
51
UCF
101
Sports
1M
STIP 2004 84.3% 20.2%
Laptev et al 2008 91.8%
Dense Trajectory 2011 94.2% 58.2% 46.6%
improved Dense
Trajectory
2013 64.3% 57.2% 85.9%
Two Stream CNN 2014 59.4% 88.0%
Deep video 2014 65.4% 63.9%
TDD* 2015 65.9% 91.5%
C3D 2015 90.4% 85.2%
Sparse spatio-temporal interest points
Features based on Point tracking
CNN based deep learned descriptors
● TDD combines trajectory
based approach with Deep
learned descriptors
![Page 76: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/76.jpg)
IIIT
Hyd
erab
ad
10. Applications in Human Pose
Estimation
![Page 77: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/77.jpg)
Pose Estimation
Goal: to recovers the pose of an articulated object which
consists of joints and rigid parts.
Slide taken from authors, Yang et al.
![Page 78: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/78.jpg)
Pose Estimation
Part based Models
Matching = Local part evidence + Global constraint
![Page 79: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/79.jpg)
Pose Estimation - Results
![Page 80: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/80.jpg)
Deep Poselets [FG 2015]Deep poselets are repetitive atomic configurations
![Page 81: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/81.jpg)
Results: Deep Poselets
• Evaluation measure: Average
precision.
• Comparison: Poselets are trained
using HOG feature.
Method AP-test
HOG 32.6
CNN before fine-tuning 48.6
CNN after fine-tuning 56.0
Nataraj Jammalamadaka et al. Face and Gesture, 2015
![Page 82: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/82.jpg)
Results: Deep Poselets
78.1
1863
AP
#positives
in train set
40.4
698
AP
#positives
in train set
Rank 1 Rank 6 Rank 11 Rank 16
Rank 21 Rank 26 Rank 31 Rank 36Rank 21 Rank 26 Rank 31 Rank 36
Rank 1 Rank 6 Rank 11 Rank 16
29.2
101
AP
#positives
in train set
Rank 1 Rank 6 Rank 11
Rank 16 Rank 21 Rank 26
Nataraj Jammalamadaka et al. Face and Gesture, 2015
![Page 83: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/83.jpg)
Deep Pose
• Pose estimation is formulated as a Deep Neural
Network (DNN) based regression problem towards
body joint.
• Presents a cascade of DNN-based pose predictors.
• The pipeline consists of:
– Pose estimation as DNN based regression
– Refining pose estimates as DNN based refiner
Toshev, Alexander, and Christian Szegedy. "Deeppose: Human pose estimation
via deep neural networks." CVPR, 2014
![Page 84: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/84.jpg)
Deep Pose: DNN based Regressor
• Train a function ψ which for an image x regresses to
a normalized pose vector
• Estimates rough pose but insufficient to precisely
localize body joints.
![Page 85: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/85.jpg)
Deep Pose: DNN based Refiner
• To achieve better precision for pose, a cascade of
pose regressors are trained.
• At each stage, DNN regressors are trained to predict
a displacement of the joint locations from previous
stage to the true location
![Page 86: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/86.jpg)
Deep Pose: Results
Predicted poses in red and ground truth poses in green for the first
three stages of a cascade for three examples.
![Page 87: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/87.jpg)
IIIT
Hyd
erab
ad
11 Other Applications
![Page 88: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/88.jpg)
Scene Text Recognition: The Problem
CAPOGIRO
Recognize a cropped word
Lexicons = English dictionaryLexicons = Grocery item list
![Page 89: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/89.jpg)
IIIT 5K-word dataset
• The largest public dataset
• Large variations
• Character level annotation
• Used by several groups: XeroxResearch – Europe, CVC-Spain,
HUST-China, Univ. of Maryland -
USA, Univ. of Oxford - UK
Available at: http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.htmlAva
![Page 90: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/90.jpg)
Quantitative Results: closed vocab.Method SVT-
Word
ICDAR(50) IIIT-5K
(small)
ABBYY 9.0 35 56 24
PICT[ECCV‟10] 59 - -
PLEX[ICCV‟11] 56 72 -
Ours
[CVPR’12, BMVC’12]
78 88 78
Shi et al. [CVPR‟13] 74 87 -
Label Embedding
[BMVC‟13]
- - 76
Goel et al. [ICDAR‟13] 78 90 77
PhotoOCR [ICCV‟13] 90 - -
Deep Features
[ECCV‟14]
86 96 -Deep
learning
Energy min.
More
suitable
for small
lexicon
[Mishra et al., CVPR’12, BMVC’12]; , JV&Z achieve more than 90 for all the task with CNN
![Page 91: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/91.jpg)
Stereo and 3D
the Torch7 environment [1]. The hyperparameters of the
stereo method were:
N lo = 4, ⌘= 4, ⇧ 1 = 1, σ = 5.656,
Nhi = 8, ⌧= 0.0442, ⇧ 2 = 32, ⌧BF = 5,
Phi = 1, ⌧SO = 0.0625.
5.3. Results
Our method achieves an error rate of 2.61% on the
KITTI stereo test set and iscurrently ranked first on theon-
line leaderboard. Table 1 compares the error rates of the
best performing stereo algorithms on this dataset.
Rank Method Error
1 MC-CNN This paper 2.61%
2 SPS-StFl Yamaguchi et al. [20] 2.83%
3 VC-SF Vogel et al. [16] 3.05%
4 CoP Anonymous submission 3.30%
5 SPS-St Yamaguchi et al. [20] 3.39%
6 PCBP-SS Yamaguchi et al. [19] 3.40%
7 DDS-SS Anonymous submission 3.83%
8 StereoSLIC Yamaguchi et al. [19] 3.92%
9 PR-Sf+E Vogel et al. [17] 4.02%
10 PCBP Yamaguchi et al. [18] 4.04%
Table 1. The KITTI stereo leaderboard as it stands in November
2014.
A selected set of examples, together with predictions
from our method, are shown in Figure 5.
5.4. Runtime
We measure the runtime of our implementation on a
computer with a Nvidia GeForce GTX Titan GPU. Train-
ing takes 5 hours. Predicting a single image pair takes 100
seconds. It is evident from Table 2 that the majority of time
during prediction is spent in the forward pass of the convo-
lutional neural network.
Component Runtime
Convolutional neural network 95 s
Semiglobal matching 3 s
Cross-based cost aggregation 2 s
Everything else 0.03 s
Table 2. Time required for prediction of each component.
5.5. Training set size
We would like to know if more training data would lead
to a better stereo method. To answer this question, we train
our convolutional neural network on many instances of the
KITTI stereo dataset whilevarying thetraining set size. The
results of the experiment are depicted in Figure 4. We ob-
20 40 60 80 100 120 140 160
Number of t raining stereo pairs
3.25 %
3.3 %
3.35 %
3.4 %
3.45 %
3.5 %
3.55 %
3.6 %
3.65 %
Err
or
Figure 4. The error on the test set as a function of the number of
stereo pairs in the training set.
serve an almost linear relationship between the training set
size and error on the test set. These results imply that our
method will improve as larger datasets become available in
the future.
6. Conclusion
Our result on the KITTI stereo dataset seems to suggest
that convolutional neural networks are a good fit for com-
puting thestereo matching cost. Training on bigger datasets
will reduce the error rate even further. Using supervised
learning in the stereo method itself could also be benefi-
cial. Our method is not yet suitable for real-time applica-
tions such as robot navigation. Future work will focus on
improving the network’s runtime performance.
References
[1] Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011).
Torch7: A matlab-like environment for machine learn-
ing. In BigLearn, NIPS Workshop, number EPFL-
CONF-192376.
[2] Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).
Vision meets robotics: The KITTI dataset. International
Journal of Robotics Research (IJRR).
[3] Haeusler, R., Nair, R., and Kondermann, D. (2013). En-
semble learning for confidencemeasures in stereo vision.
In Computer Vision and Pattern Recognition (CVPR),
2013 IEEE Conference on, pages305–312. IEEE.
[4] Hirschmuller, H. (2008). Stereo processing by
semiglobal matching and mutual information. Pattern
Analysis and Machine Intelligence, IEEE Transactions
on, 30(2):328–341.
[5] Hirschmuller, H. and Scharstein, D. (2009). Evalua-
tion of stereo matching costs on images with radiometric
Zbontar and LeCum, “Computing Stereo Matching Cost with a Convolutional Neural Network‟‟, CVPR15
![Page 92: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/92.jpg)
3D: Surface Normals
Wang and Gupta, Arxiv 2015
![Page 93: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/93.jpg)
Summary
• Many developments over Alexnet
– Many problems had enhanced baselines
• Effective features
– For a variety of task
– Better understanding of what happens in the net.
• Final layer
– Classifier or regressor with different loss functions
– One can have a feature mapping (metric learning)
– One can use traditional structured prediction models
![Page 94: Deep Learning for Computer Vision – III](https://reader033.vdocuments.site/reader033/viewer/2022061101/629b355771a8a73da712c844/html5/thumbnails/94.jpg)
IIIT
Hyd
erab
ad
Thanks