convolutional neural networkslvelho.impa.br/ip16/proj/slides/imagenet_deepcnn.pdf · convolutional...

Convolutional Neural Networks 1

Convolutional Neural Networks

08, 10 & 17 Nov, 2016

J. Ezequiel Soto S.Image Processing 2016

Prof. Luiz Velho


Summary & References08/11 ImageNet Classification with Deep Convolutional Neural Networks

2012, Krizhevsky et. al. [source]10/11 Going Deeper with Convolutions

2015, Szegedy et. al. [source]17/11 Painting Style Transfer for Head Portraits using Convolutional Neural Networks

2016, Selim & Elgharib [source]

+ CS231n: Convolutional Neural Networks for Visual RecognitionSanford University Course Notes

+ Very Deep Convolutional Networks for Large-Scale Image Recognition 2015, Simonyan & Zizzerman [source]

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf

http://dl.acm.org/citation.cfm?id=2925968

http://cs231n.github.io/

https://arxiv.org/abs/1409.1556


ImageNet Classification with Deep Convolutional Neural Networks

Krizhevsky et.al. 2012


Outline● Motivation● Data● Architecture

– ReLU Nonlinearity– Parallel GPU training– Local Response Normalization– Overlapping Pooling

● Reducing Overfitting– Data augmentation– Dropout

● Learning details● Results● Discussion


Motivation● Object recognition Machine Learning Methods→

● Improved performance:– Larger datasets – Powerful learning methods– Better techniques vs. overfitting

● MNIST digit recognition [e<0.3% ~ human]

● Evolution of labeled large image datasets:– NORB, CIFAR– LabelMe: ~100k segmented & labeled images– ImageNet: >15M labeled hi-res images in 22k categories

● Still not enough to specify such a complex problem: we need prior knowledge...

http://yann.lecun.com/exdb/mnist/

http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/

https://www.cs.toronto.edu/~kriz/cifar.html

http://labelme.csail.mit.edu/Release3.0/

http://image-net.org/


Motivation● Models with large learning capacity

→ Convolutional Neural Networks.

● CNN assumptions (strong & correct):– Stationarity of statistics– Locality of pixel dependencies

● CNNs pros:– Variable capacity (depth and breadth)– Fewer connections and parameters than usual, but still a lot...– Easier to train

● CNNs cons:– Prohibitively expensive to apply in large scale to high-resolution images

● Applicability GPU with optimized 2D convolutions→ → Large enough datasets like ImageNet for training without overfitting


Motivation● It was one of the largest CNN trained with the ImageNet dataset for the ILSVRC Challenges, and the results set a new state of the art for the task.

● Highly optimized GPU implementation of 2D convolutions publicly available code.

● Reduction of training time and strategies to control overfitting.

● Specific architecture: 5 Conv + 3 FC layers.

● Network limits established by existing hardware.

https://github.com/akrizhevsky/cuda-convnet2


ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)

ILSVRC Challenges2010 Classification with 1000 categories

2011 ClassificationClassification + localization

2012ClassificationClassification + localizationFine-grained classification (100+ categories dogs)* WINNER: Krizhevsky et.al. 2012

2013PASCAL-style detection on fully labeled data for 200 categoriesClassification with 1000 categoriesClassification + localization with 1000 categories

2014PASCAL-style detection on fully labeled data for 200 categoriesClassification + localization with 1000 categories* WINNERs: GoogLeNet, Szegedy et.al. 2015VGG, Simonyan & Zisserman, 2015

2015Object detection for 200 fully labeled categoriesObject localization for 1000 categoriesObject detection from video for 30 fully labeled categoriesScene classification for 401 categories

2016Object localization for 1000 categoriesObject detection for 200 fully labeled categoriesObject detection from video for 30 fully labeled categoriesScene classification for 365 scene categoriesScene parsing for 150 stuff and discrete object categories

Source: http://image-net.org/challenges/LSVRC/


Data● ImageNet: >15M labeled hi-res images in ~22k categories.

● ILSVRC uses a subset of about 1.2M in 1000 categories.

● Used labels of the 2010 set for training.

● Error reporting: top-1 and top-5.

● Variable resolution images down-sampled to fit 256 x 256.→

● Centered raw RGB values of the pixels.


Architecture● 8 learned layers:

– 5 convolutional (Conv)– 3 fully connected (FC)

Conv1 Conv2 Conv3 Conv4 Conv5 FC1 FC2 FC3→ → → → →


ReLU Nonlinearity

tanh(x )max (0, x)

● Non-saturating activation function: max(0,x)

● Neurons: Rectified Linear Units (ReLU)

● Faster training → Figure: test on a 4-deep

CNN on CIFAR-10, no regularization, different optimal learning rates.

● Is this the best? PreLu, ELU? Open debate...

https://arxiv.org/pdf/1511.07289v5.pdf


Parallel GPU training● GTX 580 GPU (3GB) limits training capability● 1.2M images for training● CNN: Spread across 2 GPU units● Communication only in certain layers: 3 4 and FC→

– Easy with modern GPUs: common access to memory● Communication reduces error with respect to completely independent columns by 1.7% (top-1) and 1.2% (top-5)

GPU 1

GPU 2


Local Response Normalization

● ReLUs don’t require input normalization● Local normalization generalization→● Average over neighboring kernels at the same spatial position (x,y)● Lateral inhibition inspired in real neurons (biology)● “Brightness normalization”● Reduces error in 1.4% (top-1) and 1.2% (top-5)● Parameters obtained trough a validation set…

k=2, n=5, α=10-4, β=0.75


Overlapping Pooling● Pooling summarizes the output of neighboring groups of neurons in the same kernel

● Grid: spaced by s units of z×z averaging units

● Common pooling in CNNs: s = z

● Overlapping pooling: s < z

● This implementation has s = 2, z = 3

● Reduction of error by 0.4% (top-1) and 0.3% (top-5)

● Observed result during training: overfitting is more difficult to occur


LRN

Pooling


ArchitectureModel:

Maximize the multinomial logistic regression objectiveMaximize the average across training cases of the log-probability of the correct label under the prediction distribution

~60 million parameters

Conv1 Conv2 Conv3 Conv4 Conv5 FC1 FC2 FC3→ → → → → 96 Ker 256 Ker 384 Ker 384 Ker 256 Ker 4096 neurons 11×11×3(s4) 5×5×48 3×3×256 3×3×192 3×3×192 each

LRN

LRN

dropout


Fitting filters and neuronsW: input volume sizeF: receptive field size (filter / kernel)S: strideP: zero padding

Neurons = (W – F + 2P)/S + 1

In this CNN:(224 – 11 + 0)/4 + 1 = 52.25 !!!(224 – 11 + 3)/4 + 1 = 54 OK→

● CS231n: claims error in the paper or unreported zero-padding


Reducing Overfitting● 60 M parameters / 1.2 M training images for 1000 classes impose 10 bits of constraints in the mapping from image to label

→ Not enough to prevent overfitting

● Data augmentation = artificially enlarge training set with label preserving transformations– Image translation and horizontal reflection

Training over random 224 × 224 patches and its reflections → 2048x training set size

Test with four corners and central patch 10x test chance→– Changes in the intensity and color of illumination: Alter color intensities

with PCA of of the 3×3 covariance color matrixI’xy = [IRxy, IGxy, IBxy] + [p1, p2, p3][α1λ1, α2λ2, α3λ3]T

Each αi is a random Gaussian computed each training use of the image


Reducing Overfitting


Reducing Overfitting● Dropout:Zero the output of each neuron during training with a probability of 0.5

(turn off: during forward feed and back-propagation)

– Combine the predictions of many models is effective but it is too expensive

– Similar results strategy that costs about 2x the time of training– Reduce co-adaptation of neighboring neurons– Forced to learn more robust features– Test time: multiply all outputs by 0.5!– Dropout inhibits substantial overfitting– Doubles time of convergence


Learning details● Stochastic gradient descent (L: loss function)

– Batch (Di) size: 128 images– Momentum: 0.9– Weight decay: 0.0005– Learning rate: ϵ

● Initialization:– Weights ~ N(0,0.01)– Biases = 1 for Conv2, Conv4, Conv5, all FC; = 0 everywhere else

→ accelerated initial learning with non-zero ReLU– Learning rate 0.01 and divide by 10 when validation error rate stops improving (3 times

until termination)

● Training: 90 cycles trough all 1.2 M images (6 days / 2 NVIDIA GTX 580)


Results

● ILSVRC 2010

● ILSVRC 2012(*Pre-training Conv6: ImageNet 2011 Fall, 15M images in 22k categories)


Results● Data connectedlearned kernelsConv1:

– Frequency / orientation selective filters– GPU specialization (independent of initialization)


Results

Examples of classified images: even with not centered objects

Euclidean distance groups by 4096-dimensional feature vectors of last hidden layer (not equal to L2 on pixels) → Generate auto-encoders?


Discussion● Depth is really important!!!

→ As we will see with GoogLeNet: 22 layers deep → Hyper-parameters: depth, breadth filter size!→

● How to increase the size of the network without needing much more data? Faster?

● Apply CNNs on video, use temporal structure to improve results!


It will continue...

convolutional neural networkslvelho.impa.br/ip16/proj/slides/imagenet_deepcnn.pdf · convolutional...

Documents