devil in the details: analysing the performance of convnet features

Devil in the Details: Analysing the Performance

of ConvNet FeaturesKen Chatfield - University of Oxford

May 2015

The Devil is still in the Details2011 2014

• This work is about comparing the latest ConvNet based feature representations on common ground

• We compare both different pre-trained network architectures and different learning heuristics

Comparing Apples to Apples

Fixed Evaluation Protocol

Fixed Learning

CNN Arch 1

CNN Arch 2

IFV

Input Dataset

…

Performance Evolution over VOC2007

BOW32K–

IFV-BL327K–

IFV84K–

IFV84Kf s

DeCAF4Kt t

CNN-F4Kf s

CNN-M 2K2Kf s

CNN-S4K (TN)f s

VGG-D+E4KS s

545658606264666870727476788082848688

mAP

68.02

54.48

61.6964.36

73.41

77.15

80.13

2008 2010 2013 2014...

82.42

MethodDim.Aug.

201589.70

CNN-based methods

Evaluation Setup

SVM Classifier

train

test

training set

test set

Evaluate using mAP, accuracy etc.

classifier output

Pre-trained Net on 1,000 ImageNet Classes

CNN Feature Extractor

(4096-D feature vector out)

Outline

1

2

3

4

Different pre-trained networks

Data augmentation (for both CNN and IFV)

Dataset fine-tuning

• CNN-F Network

• CNN-M Network

• CNN-S Network

• VGG Very Deep Network

Network Architectures

Network ArchitecturesCNN-F NetworkSimilar to Krizhevsky et al. (ILSVRC-2012 winner)

conv3 256x3x3 stride 1

conv4 512x3x3



conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

input image

x2 x2

Network ArchitecturesCNN-M NetworkSimilar to Zeiler & Fergus (ILSVRC-2013 winner)


conv4 512x3x3


conv1 96x7x7

stride 2

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

input image

x2 x2

Smaller receptive window size + stride in conv1

Network ArchitecturesCNN-S NetworkSimilar to Overfeat ‘accurate’ network (ICLR 2014)


conv4 512x3x3


conv1 96x7x7

stride 2

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

input image

x3 x2

Smaller stride in in conv2

Network ArchitecturesVGG Very Deep NetworkSimonyan & Zisserman (ICLR 2015)

conv1a 64x3x3

stride 1

fc6 d.o. 4096-D

fc7 d.o. 4096-D

input image

Smaller receptive window size + stride, and deeper

conv1b 64x3x3

stride 1

conv1c 64x3x3

stride 1x2

conv2a 128x3x3 stride 1

conv2b 128x3x3 stride 1

conv2c 128x3x3 stride 1

3(32C2) = 27C2

72C2 = 49C2

Pre-trained networks

mAP

( V

OC

07 )

70

75

80

85

90

Decaf CNN-F CNN-M CNN-S VGG-VD

89.3

79.7479.8977.38

73.41

Outline

1

2

3

4



Dataset fine-tuning

Data Augmentation

Given pre-trained ConvNet, augmentation applied at test time

CNN Feature Extractor

Pre-trained Network

a. Extract crops

b. Pool features (average, max)

Data Augmentation

a. No augmentation (= 1 image)

b. Flip augmentation (= 2 images)

c. Crop+Flip augmentation (= 10 images)

+

+ flips

224x224

224x224

224x224

Data Augmentationm

AP (

VO

C07

)

60

65

70

75

80

IFV CNN-M

79.89

67.17

79.44

66.68

76.99

64.35

76.97

64.36

NoneFlipCrop+Flip (train pooling: sum, test pooling: sum)Crop+Flip (train pooling: none, test pooling: sum)

Scale Augmentation

+ flips224x224

[Smin

, Smax

] = [256, 512]

+ flips224x224

256

512

Q = {Smin

, 0.5(Smin

+ Smax

), Smax

}

Fully Convolutional Net

Sermanet et al. 2014 (Overfeat)

• Convert final fc layers to convolutional layers • Output is then an activation map which can be pooled

8.8% ⇒ 7.5% top-5 val. error ILSVRC-2014

Outline

1

2

3

4



Dataset fine-tuning

Fine Tuning

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

ILSVRC softmax

Fine Tuning

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

VOC07 SVM loss

VOC 2007 Train Images

Fine Tuning

mAP

( V

OC

07 )

79

80

81

82

83

No TN TN-RNK TN-RNK

82.482.2

79.7

• TN-CLS – classification loss max{ 0, 1 - ywTφ( I ) }

• TN-RNK – ranking loss max{ 0, 1 - wT( φ( IPOS ) - φ( INEG ) ) }

Comparison with State of the ArtVOC2007 VOC2012ILSVRC-2012

CNN-M 2048CNN-SCNN-S TUNE-RNK

13.513.113.1

80.179.782.4

82.482.983.2

Zeiler & FergusOquab et al.Wei et al.

Clarifai (1 net)

16.1 79.018.0 77.7 78.7 (82.8*)

81.5 (85.2*) 81.7 (90.3*)

GoogLeNet (1 net)12.57.9

VGG Very Deep (1 net) 89.3 89.07.0

If you get the details right, a relatively simple ConvNet-based pipeline can outperform much more complex architectures

• Data augmentation helps a lot, both for deep and shallow features

• Fine tuning makes a difference, and should use ranking loss where appropriate

• Smaller filters and deeper networks help, although feature computation is slower

Take-home Messages

• Presented here was just a subset of the full results from the paper

• Check out the paper for full results on:

• VOC 2007 • VOC 2012 • Caltech-101 • Caltech-256 • ILSVRC-2012

There’s more…

• Caffe-compatible CNN models can be downloaded from the Caffe Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo

• Matlab feature computation code is also available from the project website: http://www.robots.ox.ac.uk/~vgg/software/deep_eval

Source Code

https://github.com/BVLC/caffe/wiki/Model-Zoo

http://www.robots.ox.ac.uk/~vgg/software/deep_eval

Related Publications

“Return of the Devil in the Details: Delving Deep into Convolutional Nets” BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize)

“The devil is in the details: an evaluation of recent feature encoding methods” BMVC 2011 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Victor Lempitsky, Andrew Zisserman (Best Poster Prize Honourable Mention, 300+ citations)

http://www.robots.ox.ac.uk/~ken

http://www.robots.ox.ac.uk/~ken

devil in the details: analysing the performance of convnet features

Technology

cnnf network cnn

ifv cnn

network cnns network

decaf cnnf cnn

f s cnns

cnns vggvd

network architectures

conv1 96x7x7 stride