computer vision lecture 15...g/16 topics of this lecture •deep learning motivation...

73
Perceptual and Sensory Augmented Computing Computer Vision WS 16/17 Computer Vision – Lecture 15 Deep Learning for Object Categorization 21.12.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de [email protected]

Upload: others

Post on 11-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

6/1

7

Computer Vision – Lecture 15

Deep Learning for Object Categorization

21.12.2016

Bastian Leibe

RWTH Aachen

http://www.vision.rwth-aachen.de

[email protected]

Page 2: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

6/1

7

Course Outline

• Image Processing Basics

• Segmentation & Grouping

• Object Recognition

• Object Categorization I

Sliding Window based Object Detection

• Local Features & Matching

Local Features – Detection and Description

Recognition with Local Features

Indexing & Visual Vocabularies

• Object Categorization II

Bag-of-Words Approaches & Part-based Approaches

Deep Learning Methods

• 3D Reconstruction 3

Page 3: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

4B. Leibe

Recap: Part-Based Models

• Fischler & Elschlager 1973

• Model has two components

parts

(2D image fragments)

structure

(configuration of parts)

Page 4: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

5B. Leibe

Recap: Implicit Shape Model – Representation

• Learn appearance codebook

Extract local features at interest points

Clustering appearance codebook

• Learn spatial distributions

Match codebook to training images

Record matching positions on object

Training images

(+reference segmentation)

Appearance codebook…

………

Spatial occurrence distributionsx

y

sx

y

s

x

y

s

x

y

s

+ local figure-ground labels

Page 5: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Recap: Deformable Part-Based Model

6B. Leibe

Root filters

coarse resolution

Part filters

finer resolution

Deformation

models

Slide credit: Pedro Felzenszwalb

Page 6: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Recap: Object Hypothesis

• Multiscale model captures features at two resolutions

7B. Leibe

Score of object

hypothesis is sum of

filter scores minus

deformation costs

Score of filter:

dot product of filter

with HOG features

underneath it

Slide credit: Pedro Felzenszwalb

Page 7: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Recap: Score of a Hypothesis

8B. LeibeSlide credit: Pedro Felzenszwalb

Page 8: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Topics of This Lecture

• Deep Learning Motivation

• Convolutional Neural Networks Convolutional Layers

Pooling Layers

Nonlinearities

• CNN Architectures LeNet

AlexNet

VGGNet

GoogLeNet

• Applications

9B. Leibe

Page 9: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

We’ve finally got there!

10B. Leibe

Deep Learning

Page 10: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Traditional Recognition Approach

• Characteristics

Features are not learned, but engineered

Trainable classifier is often generic (e.g., SVM)

Many successes in 2000-2010.

11B. LeibeSlide credit: Svetlana Lazebnik

Page 11: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Traditional Recognition Approach

• Features are key to recent progress in recognition

Multitude of hand-designed features currently in use

SIFT, HOG, ………….

Where next? Better classifiers? Or keep building more features?

12

DPM

[Felzenszwalb

et al., PAMI’07]

Dense SIFT+LBP+HOG BOW Classifier

[Yan & Huan ‘10]

(Winner of PASCAL 2010 Challenge)

Slide credit: Svetlana Lazebnik

Page 12: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

What About Learning the Features?

• Learn a feature hierarchy all the way from pixels to

classifier

Each layer extracts features from the output of previous layer

Train all layers jointly

13B. LeibeSlide credit: Svetlana Lazebnik

Page 13: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

“Shallow” vs. “Deep” Architectures

14B. LeibeSlide credit: Svetlana Lazebnik

Page 14: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Background: Perceptrons

15Slide credit: Svetlana Lazebnik

Page 15: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Inspiration: Neuron Cells

16Slide credit: Svetlana Lazebnik, Rob Fergus

Page 16: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Background: Multi-Layer Neural Networks

• Nonlinear classifier

Training: find network weights w to minimize the error between

true training labels tn and estimated labels fw(xn):

Minimization can be done by gradient descent, provided f is

differentiable

– Training method: error backpropagation.

17B. LeibeSlide credit: Svetlana Lazebnik

Page 17: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Hubel/Wiesel Architecture

• D. Hubel, T. Wiesel (1959, 1962, Nobel Prize 1981)

Visual cortex consists of a hierarchy of simple, complex, and

hyper-complex cells

18B. LeibeSlide credit: Svetlana Lazebnik, Rob Fergus

Page 18: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Neural Networks (CNN, ConvNet)

• Neural network with specialized connectivity structure

Stack multiple stages of feature extractors

Higher stages compute more global, more invariant features

Classification layer at the end

19B. Leibe

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to

document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.

Slide credit: Svetlana Lazebnik

Page 19: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Topics of This Lecture

• Deep Learning Motivation

• Convolutional Neural Networks Convolutional Layers

Pooling Layers

Nonlinearities

• CNN Architectures LeNet

AlexNet

VGGNet

GoogLeNet

• Applications

20B. Leibe

Page 20: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Structure

• Feed-forward feature extraction

1. Convolve input with learned filters

2. Non-linearity

3. Spatial pooling

4. (Normalization)

• Supervised training of convolutional

filters by back-propagating

classification error

21B. LeibeSlide credit: Svetlana Lazebnik

Page 21: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Fully connected network

E.g. 1000£1000 image

1M hidden units

1T parameters!

• Ideas to improve this

Spatial correlation is local

22B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 22: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Locally connected net

E.g. 1000£1000 image

1M hidden units10£10 receptive fields

100M parameters!

• Ideas to improve this

Spatial correlation is local

Want translation invariance

23B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 23: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Convolutional net

Share the same parameters

across different locations

Convolutions with learned

kernels

24B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 24: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Convolutional net

Share the same parameters

across different locations

Convolutions with learned

kernels

• Learn multiple filters

E.g. 1000£1000 image

100 filters10£10 filter size

10k parameters

• Result: Response map

size: 1000£1000£100

Only memory, not params!25

B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 25: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Important Conceptual Shift

• Before

• Now:

26B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Page 26: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Note: Connectivity is

Local in space (5£5 inside 32£32)

But full in depth (all 3 depth channels)

27B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy

Hidden neuron

in next layer

Exampleimage: 32£32£3 volume

Before: Full connectivity32£32£3 weights

Now: Local connectivity

One neuron connects to, e.g.,5£5£3 region.

Only 5£5£3 shared weights.

Page 27: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• All Neural Net activations arranged in 3 dimensions

Multiple neurons all looking at the same input region,

stacked in depth

28B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy

Page 28: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• All Neural Net activations arranged in 3 dimensions

Multiple neurons all looking at the same input region,

stacked in depth

Form a single [1£1£depth] depth column in output volume.

29B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Naming convention:

Page 29: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

30B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1

Page 30: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

31B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1

Page 31: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

32B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1

Page 32: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

33B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1

Page 33: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

34B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1 5£5 output

Page 34: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

35B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1 5£5 output

What about stride 2?

Page 35: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

36B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1 5£5 output

What about stride 2?

Page 36: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

37B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1 5£5 output

What about stride 2? 3£3 output

Page 37: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolution Layers

• Replicate this column of hidden neurons across space,

with some stride.

• In practice, common to zero-pad the border.

Preserves the size of the input spatially.

38B. LeibeSlide credit: FeiFei Li, Andrej Karpathy

Example:7£7 input

assume 3£3 connectivity

stride 1 5£5 output

What about stride 2? 3£3 output

0

0

0 00 0

0

0

0

Page 38: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Activation Maps of Convolutional Filters

39B. Leibe

5£5 filters

Slide adapted from FeiFei Li, Andrej Karpathy

Activation maps

Each activation map is a depth

slice through the output volume.

Page 39: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Effect of Multiple Convolution Layers

40B. LeibeSlide credit: Yann LeCun

Page 40: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Commonly Used Nonlinearities

• Sigmoid

• Hyperbolic tangent

• Rectified linear unit (ReLU)

41B. Leibe

Preferred option for deep networks

Page 41: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Let’s assume the filter is

an eye detector

How can we make the

detection robust to the

exact location of the eye?

42B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 42: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Convolutional Networks: Intuition

• Let’s assume the filter is

an eye detector

How can we make the

detection robust to the

exact location of the eye?

• Solution:

By pooling (e.g., max or avg)

filter responses at different

spatial locations, we gain

robustness to the exact

spatial location of features.

43B. Leibe Image source: Yann LeCunSlide adapted from Marc’Aurelio Ranzato

Page 43: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Max Pooling

• Effect:

Make the representation smaller without losing too much

information

Achieve robustness to translations44

B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy

Page 44: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Max Pooling

• Note

Pooling happens independently across each slice, preserving the

number of slices.

45B. LeibeSlide adapted from FeiFei Li, Andrej Karpathy

Page 45: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Compare: SIFT Descriptor

46B. LeibeSlide credit: Svetlana Lazebnik

Page 46: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Compare: Spatial Pyramid Matching

47B. LeibeSlide credit: Svetlana Lazebnik

Page 47: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Topics of This Lecture

• Deep Learning Motivation

• Convolutional Neural Networks Convolutional Layers

Pooling Layers

Nonlinearities

• CNN Architectures LeNet

AlexNet

VGGNet

GoogLeNet

• Applications

48B. Leibe

Page 48: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

CNN Architectures: LeNet (1998)

• Early convolutional architecture

2 Convolutional layers, 2 pooling layers

Fully-connected NN layers for classification

Successfully used for handwritten digit recognition (MNIST)

49B. Leibe

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to

document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.

Slide credit: Svetlana Lazebnik

Page 49: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

ImageNet Challenge 2012

• ImageNet

~14M labeled internet images

20k classes

Human labels via Amazon

Mechanical Turk

• Challenge (ILSVRC)

1.2 million training images

1000 classes

Goal: Predict ground-truth

class within top-5 responses

Currently one of the top benchmarks in Computer Vision

50B. Leibe

[Deng et al., CVPR’09]

Page 50: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

CNN Architectures: AlexNet (2012)

• Similar framework as LeNet, but

Bigger model (7 hidden layers, 650k units, 60M parameters)

More data (106 images instead of 103)

GPU implementation

Better regularization and up-to-date tricks for training (Dropout)

51Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep

Convolutional Neural Networks, NIPS 2012.

Page 51: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

ILSVRC 2012 Results

• AlexNet almost halved the error rate

16.4% error (top-5) vs. 26.2% for the next best approach

A revolution in Computer Vision

Acquired by Google in Jan ‘13, deployed in Google+ in May ‘1352

B. Leibe

Page 52: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

AlexNet Results

53B. Leibe

Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012

Page 53: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

AlexNet Results

54Image source: A. Krizhevsky, I. Sutskever and G.E. Hinton, NIPS 2012

Test image Retrieved images

Page 54: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

CNN Architectures: VGGNet (2014/15)

55B. Leibe

Image source: Hirokatsu Kataoka

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale

Image Recognition, ICLR 2015

Page 55: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

CNN Architectures: VGGNet (2014/15)

• Main ideas

Deeper network

Stacked convolutional

layers with smaller

filters (+ nonlinearity)

Detailed evaluation

of all components

• Results

Improved ILSVRC top-5

error rate to 6.7%.

56B. Leibe

Image source: Simonyan & Zisserman

Mainly used

Page 56: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Comparison: AlexNet vs. VGGNet

• Receptive fields in the first layer

AlexNet: 11£11, stride 4

Zeiler & Fergus: 7£7, stride 2

VGGNet: 3£3, stride 1

• Why that?

If you stack three 3£3 on top of another 3£3 layer, you

effectively get a 5£5 receptive field.

With three 3£3 layers, the receptive field is already 7£7.

But much fewer parameters: 3¢32 = 27 instead of 72 = 49.

In addition, non-linearities in-between 3£3 layers for additional

discriminativity.

57B. Leibe

Page 57: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

CNN Architectures: GoogLeNet (2014)

• Main ideas

“Inception” module as modular component

Learns filters at several scales within each module

58B. Leibe

C. Szegedy, W. Liu, Y. Jia, et al, Going Deeper with Convolutions,

arXiv:1409.4842, 2014.

Page 58: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

GoogLeNet Visualization

59B. Leibe

Inception

module+ copies

Auxiliary classification

outputs for training the

lower layers (deprecated)

Page 59: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Results on ILSVRC

• VGGNet and GoogLeNet perform at similar level

Comparison: human performance ~5% [Karpathy]

60B. Leibe

Image source: Simonyan & Zisserman

http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

Page 60: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Newest Development: Residual Networks

61B. Leibe

Page 61: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Newest Development: Residual Networks

• Core component

Skip connections

bypassing each layer

Better propagation of

gradients to the deeper

layers

62B. Leibe

Page 62: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

ImageNet Performance

63B. Leibe

Page 63: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Topics of This Lecture

• Deep Learning Motivation

• Convolutional Neural Networks Convolutional Layers

Pooling Layers

Nonlinearities

• CNN Architectures LeNet

AlexNet

VGGNet

GoogLeNet

• Applications

64B. Leibe

Page 64: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

The Learned Features are Generic

• Experiment: feature transfer

Train network on ImageNet

Chop off last layer and train classification layer on CalTech256

State of the art accuracy already with only 6 training images65

B. LeibeImage source: M. Zeiler, R. Fergus

state of the art

level (pre-CNN)

Page 65: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Other Tasks: Detection

• Results on PASCAL VOC Detection benchmark

Pre-CNN state of the art: 35.1% mAP [Uijlings et al., 2013]

33.4% mAP DPM

R-CNN: 53.7% mAP

66

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for

Accurate Object Detection and Semantic Segmentation, CVPR 2014

Page 66: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Faster R-CNN (based on ResNets)

67B. Leibe

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,

CVPR 2016.

Page 67: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Faster R-CNN (based on ResNets)

68B. Leibe

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,

CVPR 2016.

Page 68: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Other Tasks: Semantic Segmentation

69B. Leibe

[Farabet et al. ICML 2012, PAMI 2013]

Page 69: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Semantic Segmentation

• More recent results

Based on an extension of ResNets

[Pohlen, Hermans, Mathias, Leibe, arXiv 2016]

Page 70: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Other Tasks: Face Verification

72

Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-

Level Performance in Face Verification, CVPR 2014

Slide credit: Svetlana Lazebnik

Page 71: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Commercial Recognition Services

• E.g.,

• Be careful when taking test images from Google Search

Chances are they may have been seen in the training set...73

B. LeibeImage source: clarifai.com

Page 72: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

Commercial Recognition Services

74B. Leibe

Image source: clarifai.com

Page 73: Computer Vision Lecture 15...g/16 Topics of This Lecture •Deep Learning Motivation •Convolutional Neural Networks Convolutional Layers Pooling Layers Nonlinearities •CNN Architectures

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Co

mp

ute

r V

isio

n W

S 1

5/1

6

References and Further Reading

• LeNet

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based

learning applied to document recognition, Proceedings of the IEEE

86(11): 2278–2324, 1998.

• AlexNet

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification

with Deep Convolutional Neural Networks, NIPS 2012.

• VGGNet

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for

Large-Scale Image Recognition, ICLR 2015

• GoogLeNet

C. Szegedy, W. Liu, Y. Jia, et al, Going Deeper with Convolutions,

arXiv:1409.4842, 2014.

B. Leibe75