computer vision lecture 15...g/16 topics of this lecture •deep learning motivation...
TRANSCRIPT
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
6/1
7
Computer Vision – Lecture 15
Deep Learning for Object Categorization
21.12.2016
Bastian Leibe
RWTH Aachen
http://www.vision.rwth-aachen.de
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
6/1
7
Course Outline
• Image Processing Basics
• Segmentation & Grouping
• Object Recognition
• Object Categorization I
Sliding Window based Object Detection
• Local Features & Matching
Local Features – Detection and Description
Recognition with Local Features
Indexing & Visual Vocabularies
• Object Categorization II
Bag-of-Words Approaches & Part-based Approaches
Deep Learning Methods
• 3D Reconstruction 3
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
4B. Leibe
Recap: Part-Based Models
• Fischler & Elschlager 1973
• Model has two components
parts
(2D image fragments)
structure
(configuration of parts)
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
5B. Leibe
Recap: Implicit Shape Model – Representation
• Learn appearance codebook
Extract local features at interest points
Clustering appearance codebook
• Learn spatial distributions
Match codebook to training images
Record matching positions on object
Training images
(+reference segmentation)
Appearance codebook…
………
…
Spatial occurrence distributionsx
y
sx
y
s
x
y
s
x
y
s
+ local figure-ground labels
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Recap: Deformable Part-Based Model
6B. Leibe
Root filters
coarse resolution
Part filters
finer resolution
Deformation
models
Slide credit: Pedro Felzenszwalb
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Recap: Object Hypothesis
• Multiscale model captures features at two resolutions
7B. Leibe
Score of object
hypothesis is sum of
filter scores minus
deformation costs
Score of filter:
dot product of filter
with HOG features
underneath it
Slide credit: Pedro Felzenszwalb
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Recap: Score of a Hypothesis
8B. LeibeSlide credit: Pedro Felzenszwalb
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Topics of This Lecture
• Deep Learning Motivation
• Convolutional Neural Networks Convolutional Layers
Pooling Layers
Nonlinearities
• CNN Architectures LeNet
AlexNet
VGGNet
GoogLeNet
• Applications
9B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
We’ve finally got there!
10B. Leibe
Deep Learning
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Traditional Recognition Approach
• Characteristics
Features are not learned, but engineered
Trainable classifier is often generic (e.g., SVM)
Many successes in 2000-2010.
11B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Traditional Recognition Approach
• Features are key to recent progress in recognition
Multitude of hand-designed features currently in use
SIFT, HOG, ………….
Where next? Better classifiers? Or keep building more features?
12
DPM
[Felzenszwalb
et al., PAMI’07]
Dense SIFT+LBP+HOG BOW Classifier
[Yan & Huan ‘10]
(Winner of PASCAL 2010 Challenge)
Slide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
What About Learning the Features?
• Learn a feature hierarchy all the way from pixels to
classifier
Each layer extracts features from the output of previous layer
Train all layers jointly
13B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
“Shallow” vs. “Deep” Architectures
14B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Background: Perceptrons
15Slide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Inspiration: Neuron Cells
16Slide credit: Svetlana Lazebnik, Rob Fergus
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Background: Multi-Layer Neural Networks
• Nonlinear classifier
Training: find network weights w to minimize the error between
true training labels tn and estimated labels fw(xn):
Minimization can be done by gradient descent, provided f is
differentiable
– Training method: error backpropagation.
17B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Hubel/Wiesel Architecture
• D. Hubel, T. Wiesel (1959, 1962, Nobel Prize 1981)
Visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
18B. LeibeSlide credit: Svetlana Lazebnik, Rob Fergus
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Neural Networks (CNN, ConvNet)
• Neural network with specialized connectivity structure
Stack multiple stages of feature extractors
Higher stages compute more global, more invariant features
Classification layer at the end
19B. Leibe
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Slide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Topics of This Lecture
• Deep Learning Motivation
• Convolutional Neural Networks Convolutional Layers
Pooling Layers
Nonlinearities
• CNN Architectures LeNet
AlexNet
VGGNet
GoogLeNet
• Applications
20B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Structure
• Feed-forward feature extraction
1. Convolve input with learned filters
2. Non-linearity
3. Spatial pooling
4. (Normalization)
• Supervised training of convolutional
filters by back-propagating
classification error
21B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Fully connected network
E.g. 1000£1000 image
1M hidden units
1T parameters!
• Ideas to improve this
Spatial correlation is local
22B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Locally connected net
E.g. 1000£1000 image
1M hidden units10£10 receptive fields
100M parameters!
• Ideas to improve this
Spatial correlation is local
Want translation invariance
23B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Convolutional net
Share the same parameters
across different locations
Convolutions with learned
kernels
24B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Convolutional net
Share the same parameters
across different locations
Convolutions with learned
kernels
• Learn multiple filters
E.g. 1000£1000 image
100 filters10£10 filter size
10k parameters
• Result: Response map
size: 1000£1000£100
Only memory, not params!25
B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Important Conceptual Shift
• Before
• Now:
26B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Note: Connectivity is
Local in space (5£5 inside 32£32)
But full in depth (all 3 depth channels)
27B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy
Hidden neuron
in next layer
Exampleimage: 32£32£3 volume
Before: Full connectivity32£32£3 weights
Now: Local connectivity
One neuron connects to, e.g.,5£5£3 region.
Only 5£5£3 shared weights.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• All Neural Net activations arranged in 3 dimensions
Multiple neurons all looking at the same input region,
stacked in depth
28B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• All Neural Net activations arranged in 3 dimensions
Multiple neurons all looking at the same input region,
stacked in depth
Form a single [1£1£depth] depth column in output volume.
29B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Naming convention:
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
30B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
31B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
32B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
33B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
34B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1 5£5 output
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
35B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1 5£5 output
What about stride 2?
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
36B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1 5£5 output
What about stride 2?
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
37B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1 5£5 output
What about stride 2? 3£3 output
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolution Layers
• Replicate this column of hidden neurons across space,
with some stride.
• In practice, common to zero-pad the border.
Preserves the size of the input spatially.
38B. LeibeSlide credit: FeiFei Li, Andrej Karpathy
Example:7£7 input
assume 3£3 connectivity
stride 1 5£5 output
What about stride 2? 3£3 output
0
0
0 00 0
0
0
0
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Activation Maps of Convolutional Filters
39B. Leibe
5£5 filters
Slide adapted from FeiFei Li, Andrej Karpathy
Activation maps
Each activation map is a depth
slice through the output volume.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Effect of Multiple Convolution Layers
40B. LeibeSlide credit: Yann LeCun
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Commonly Used Nonlinearities
• Sigmoid
• Hyperbolic tangent
• Rectified linear unit (ReLU)
41B. Leibe
Preferred option for deep networks
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Let’s assume the filter is
an eye detector
How can we make the
detection robust to the
exact location of the eye?
42B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Convolutional Networks: Intuition
• Let’s assume the filter is
an eye detector
How can we make the
detection robust to the
exact location of the eye?
• Solution:
By pooling (e.g., max or avg)
filter responses at different
spatial locations, we gain
robustness to the exact
spatial location of features.
43B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Max Pooling
• Effect:
Make the representation smaller without losing too much
information
Achieve robustness to translations44
B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Max Pooling
• Note
Pooling happens independently across each slice, preserving the
number of slices.
45B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Compare: SIFT Descriptor
46B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Compare: Spatial Pyramid Matching
47B. LeibeSlide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Topics of This Lecture
• Deep Learning Motivation
• Convolutional Neural Networks Convolutional Layers
Pooling Layers
Nonlinearities
• CNN Architectures LeNet
AlexNet
VGGNet
GoogLeNet
• Applications
48B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
CNN Architectures: LeNet (1998)
• Early convolutional architecture
2 Convolutional layers, 2 pooling layers
Fully-connected NN layers for classification
Successfully used for handwritten digit recognition (MNIST)
49B. Leibe
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Slide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
ImageNet Challenge 2012
• ImageNet
~14M labeled internet images
20k classes
Human labels via Amazon
Mechanical Turk
• Challenge (ILSVRC)
1.2 million training images
1000 classes
Goal: Predict ground-truth
class within top-5 responses
Currently one of the top benchmarks in Computer Vision
50B. Leibe
[Deng et al., CVPR’09]
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
CNN Architectures: AlexNet (2012)
• Similar framework as LeNet, but
Bigger model (7 hidden layers, 650k units, 60M parameters)
More data (106 images instead of 103)
GPU implementation
Better regularization and up-to-date tricks for training (Dropout)
51Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep
Convolutional Neural Networks, NIPS 2012.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
ILSVRC 2012 Results
• AlexNet almost halved the error rate
16.4% error (top-5) vs. 26.2% for the next best approach
A revolution in Computer Vision
Acquired by Google in Jan ‘13, deployed in Google+ in May ‘1352
B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
AlexNet Results
53B. Leibe
Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
AlexNet Results
54Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012
Test image Retrieved images
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
CNN Architectures: VGGNet (2014/15)
55B. Leibe
Image source: Hirokatsu Kataoka
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale
Image Recognition, ICLR 2015
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
CNN Architectures: VGGNet (2014/15)
• Main ideas
Deeper network
Stacked convolutional
layers with smaller
filters (+ nonlinearity)
Detailed evaluation
of all components
• Results
Improved ILSVRC top-5
error rate to 6.7%.
56B. Leibe
Image source: Simonyan & Zisserman
Mainly used
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Comparison: AlexNet vs. VGGNet
• Receptive fields in the first layer
AlexNet: 11£11, stride 4
Zeiler & Fergus: 7£7, stride 2
VGGNet: 3£3, stride 1
• Why that?
If you stack three 3£3 on top of another 3£3 layer, you
effectively get a 5£5 receptive field.
With three 3£3 layers, the receptive field is already 7£7.
But much fewer parameters: 3¢32 = 27 instead of 72 = 49.
In addition, non-linearities in-between 3£3 layers for additional
discriminativity.
57B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
CNN Architectures: GoogLeNet (2014)
• Main ideas
“Inception” module as modular component
Learns filters at several scales within each module
58B. Leibe
C. Szegedy, W. Liu, Y. Jia, et al, Going Deeper with Convolutions,
arXiv:1409.4842, 2014.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
GoogLeNet Visualization
59B. Leibe
Inception
module+ copies
Auxiliary classification
outputs for training the
lower layers (deprecated)
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Results on ILSVRC
• VGGNet and GoogLeNet perform at similar level
Comparison: human performance ~5% [Karpathy]
60B. Leibe
Image source: Simonyan & Zisserman
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Newest Development: Residual Networks
61B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Newest Development: Residual Networks
• Core component
Skip connections
bypassing each layer
Better propagation of
gradients to the deeper
layers
62B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
ImageNet Performance
63B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Topics of This Lecture
• Deep Learning Motivation
• Convolutional Neural Networks Convolutional Layers
Pooling Layers
Nonlinearities
• CNN Architectures LeNet
AlexNet
VGGNet
GoogLeNet
• Applications
64B. Leibe
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
The Learned Features are Generic
• Experiment: feature transfer
Train network on ImageNet
Chop off last layer and train classification layer on CalTech256
State of the art accuracy already with only 6 training images65
B. LeibeImage source: M. Zeiler, R. Fergus
state of the art
level (pre-CNN)
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Other Tasks: Detection
• Results on PASCAL VOC Detection benchmark
Pre-CNN state of the art: 35.1% mAP [Uijlings et al., 2013]
33.4% mAP DPM
R-CNN: 53.7% mAP
66
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation, CVPR 2014
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Faster R-CNN (based on ResNets)
67B. Leibe
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,
CVPR 2016.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Faster R-CNN (based on ResNets)
68B. Leibe
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,
CVPR 2016.
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Other Tasks: Semantic Segmentation
69B. Leibe
[Farabet et al. ICML 2012, PAMI 2013]
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Semantic Segmentation
• More recent results
Based on an extension of ResNets
[Pohlen, Hermans, Mathias, Leibe, arXiv 2016]
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Other Tasks: Face Verification
72
Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-
Level Performance in Face Verification, CVPR 2014
Slide credit: Svetlana Lazebnik
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Commercial Recognition Services
• E.g.,
• Be careful when taking test images from Google Search
Chances are they may have been seen in the training set...73
B. LeibeImage source: clarifai.com
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
Commercial Recognition Services
74B. Leibe
Image source: clarifai.com
Perc
eptu
al
and S
enso
ry A
ugm
ente
d C
om
puti
ng
Co
mp
ute
r V
isio
n W
S 1
5/1
6
References and Further Reading
• LeNet
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based
learning applied to document recognition, Proceedings of the IEEE
86(11): 2278–2324, 1998.
• AlexNet
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification
with Deep Convolutional Neural Networks, NIPS 2012.
• VGGNet
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for
Large-Scale Image Recognition, ICLR 2015
• GoogLeNet
C. Szegedy, W. Liu, Y. Jia, et al, Going Deeper with Convolutions,
arXiv:1409.4842, 2014.
B. Leibe75