deep learning and vision - indico · • no biophysics y = f (x i w ix i + b) ... 2015 inception v3...

Deep Learning and VisionJon Shlens

Google Research 28 April 2017

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions

Agenda

http://dspace.mit.edu/handle/1721.1/6125

The hubris of artificial intelligence

• For decades we tried to write down every possible rule for everyday tasks —> impossible

• Every day tasks we consider blindingly obvious have been exceedingly difficult for computers.

‘Simple’ problems proved most difficult.

cat?

Machine learning applied everywhere.

• The last decade has shown that if we teach computers

to perform a task, they can perform exceedingly better.

machine translation speech recognitionface recognition time series analysis

molecular activity prediction image recognitionroad hazard detection object detection

optical character recognition motor planning

motor activity planning syntax parsing

language understanding …

Large scale academic competition focused on predicting 1000 object classes (~1.2M images).

• electric ray

• barracuda

• coho salmon

• tench

• goldfish

• sawfish

• smalltooth sawfish

• guitarfish

• stingray

• roughtail stingray

• ...

The computer vision competition:

Imagenet: A large-scale hierarchical image database J Deng et al (2009)

classes

History of techniques in ImageNet Challenge

Locality constrained linear coding + SVM NEC & UIUCFisher kernel + SVM Xerox Research Center EuropeSIFT features + LI2C Nanyang Technological InstituteSIFT features + k-Nearest Neighbors Laboratoire d'Informatique de GrenobleColor features + canonical correlation analysis National Institute of Informatics, Tokyo

Compressed Fisher kernel + SVM Xerox Research Center EuropeSIFT bag-of-words + VQ + SVM University of Amsterdam & University of

TrentoSIFT + ? ISI Lab, Tokyo University

Deep convolutional neural network University of TorontoDiscriminatively trained DPMs University of OxfordFisher-based SIFT features + SVM ISI Lab, Tokyo University

ImageNet 2010

ImageNet 2011

ImageNet 2012

Good fine-grain classification.

hibiscus dahila

Both recognized as “meal”

Good generalization.

mealSensible errors.

snake dog

** Trained a model for whole image recognition using Inception-v3 architecture.

Examples of artificial vision in action

• fine-grain classification

• generalization

• sensible errors

meal








4. Conclusions

Agenda

History of techniques in ImageNet Challenge

Locality constrained linear coding + SVM NEC & UIUCFisher kernel + SVM Xerox Research Center EuropeSIFT features + LI2C Nanyang Technological InstituteSIFT features + k-Nearest Neighbors Laboratoire d'Informatique de GrenobleColor features + canonical correlation analysis National Institute of Informatics, Tokyo

Compressed Fisher kernel + SVM Xerox Research Center EuropeSIFT bag-of-words + VQ + SVM University of Amsterdam & University of

TrentoSIFT + ? ISI Lab, Tokyo University

Deep convolutional neural network University of TorontoDiscriminatively trained DPMs University of OxfordFisher-based SIFT features + SVM ISI Lab, Tokyo University

ImageNet 2010

ImageNet 2011

ImageNet 2012

• Multi-layer perceptron trained with back-propagation are ideas known since the 1980’s.

Deep convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks A Krizhevsky I Sutskever, G Hinton (2012)

Backpropagation applied to handwritten zip code recognitionY LeCun et al (1990)

• Winning network contained 60M parameters.

• Achieving scale in compute and data is critical.

• large academic data sets

• SIMD hardware (e.g. GPU’s, SSE instruction sets)

Convolutional neural networks, revisited.


“Deep learning” = artificial neural networks

“cat”

Loosely based on (what little) we know about the brain

What is deep learning?

• Hierarchical composition of simple mathematical functions

Untangling invariant object recognitionJ DiCarlo and D Cox (2007)

“Deep learning” = artificial neural networks

• Hierarchical composition of simple mathematical functions

“cat”

Loosely based on (what little) we know about the brain

What is deep learning?

Loosely inspired by (what little) we know about the brain

Untangling invariant object recognitionJ DiCarlo and D Cox (2007)

A toy model of a neuron: “perceptron”

The perceptron: a probabilistic model for information storage and organization in the brain.F Rosenblatt (1958)

• no spikes

• no recurrence or feedback *

• no dynamics or state *

• no biophysics

y = f(X

i

wixi + b)

Simplify the neuron to a sum over weighted inputs and a nonlinear activation function.

f(z) = max(0, z)

Employing a network for a task.

• A network is a hierarchical composition of nonlinear functions.

y = f(f(...)) y

cat dog car truck cow bicycle

• Output of network is a real-valued vector.

y

label of node j

“dog”

Example: how to classify with a network

Step 1: Convert the network output to a probability distribution with the softmax function.


y

label of node j label of node j

P (j) =exp(yj)Pj exp(yj)

00.250.5

0.751


Example: how to classify with a network

Step 2: Minimize the cross-entropy loss between the predicted distribution and a one-hot target distribution.

label of node j

00.25

0.50.75

1

cat dog car truck cow bicycle 00.250.5

0.75

1


• Cross entropy loss is the KL-divergence the predicted and target distribution.

loss =

X

x

p(x) log

p(x)

q(x)

p(x)q(x)

predicted distribution target distribution

Gradient descent with back-propagation.

• Calculate the partial derivatives of each parameter with respect to the loss to minimize an objective function via gradient descent.

y@ loss

@ wi

Beyond Regression: New Tools for Prediction and Analysis in the Behavioral SciencesP Werbos (1974)

Learning Internal Representations by Error Propagation.D Rumelhart, G Hinton, R Williams, James L. McClelland et al (1986)

• For weights buried inside the network, employ clever factorization of the chain rule, i.e. back-propagation.

Optimization is highly non-convex.

Note that deep networks operation in O(1M) dimensions.

weight 2weight 1

loss

playground.tensorflow.org

http://playground.tensorflow.org

E. Coli of image recognition

Gradient-based learning applied to document recognitionY. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998)

http://yann.lecun.com/exdb/mnist/

machine learning system (e.g. neural network)

“4”

Multi-layer perceptron on MNIST.

• Note that weights grow as the square of the number of pixels.

“4”

handwritten zip codes

P=28

logistic classifier (M=10)

fully connected (N=100)

# weights = N x M = 1000

# weights = N x P2 = 78400

• Consider that the iPhone camera uses P = 2000, then the number of weights would be 4 million.

M = # classes

N = # hidden units

Statistics of natural images: Scaling in the woodsD Ruderman and W Bialek (1994)

Natural image statistics and neural representationE Simoncelli and B Olshausen (2001)

… translationcropping dilation contrastrotation scalebrightness …

Natural image statistics obey invariances.

Translation invariance —> convolutions

• Models of natural image statistics begin with convolutional filter bank.

interlude for convolutions

original

https://docs.gimp.org/en/plug-in-convmatrix.html

filter (3 x 3) identity

0 0 0

0 1 0

0 0 0

original


filter (5 x 5) blur

original


filter (5 x 5) sharpen

original


filter (3 x 3) vertical edge detector

original


filter (3 x 3) all edge detector

interlude for convolutions

Multi-layer perceptron on MNIST.

• Note that weights grow as the square of the number of pixels!

“4”


P=28



# weights = N x M = 1000

# weights = N x P2 = 78400

Convolutional neural network on MNIST.

• Note that the number of model parameters is largely independent of image size.

“4”


P=28


convolutional (N=100)

# weights = N x M x K= 1000 K

# weights = N x F2 = 2500

F=5

F=5

N=100

Generalizing convolutions in depth.

filter bank output activationsinput activationsexample

grayscale image

input depth

input depthRGB image

Generalizing convolutions in depth.

filter bank output activationsinput activationsoutput depth

• input and output depth are arbitrary parameters and not equal. • Convolutional neural networks operate with depths up to 1024.

example

edge detector filter bank

output depth

output depth

convolutional network

output depth

The first convolutional neural network.


“4”





• Similar architecture to original CNN architecture but deeper and larger (70K —> 60M parameters).

• More nonlinearities and regularization.

Convolutional neural networks, revisited



Steady progress in network architectures.

place top 5 error

2012 Supervision 1st 16.4%2013 Clarifai 1st 11.5%2014 VGG 2nd 7.3%2014 GoogLeNet / Inception 1st 6.6%2014 Andrej Karpathy n/a 5.1%2015 Batch Normalization Inception n/a 4.8%2015 Inception v3 2nd 3.6%2015 ResNet 1st 3.6%2016 Inception-ResNet n/a 3.1%

Advances in network architectures

Animation by Dan Mané

Inception-v4, Inception-ResNet and the Impact of Residual Connections on LearningC Szegedy, S Ioffe, V Vanhoucke (2016)

Deep Residual Learning for Image RecognitionK He, X Zhang, S Ren, J Sun (2015)

Rethinking the Inception Architecture for Computer VisionC Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna (2015)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)

What I learned from competing against a ConvNet on ImageNetA Karpathy (2014)

Very Deep Convolutional Networks for Large-scale Image RecognitionKaren Simonyan and Andrew Zisserman (2015)

Going Deeper with ConvolutionsC Szegedy et al (2014)

Visualizing and Understanding Convolutional NetworksM Zeiler and R Fergus (2013)


Scalable Multiclass Object Categorization with Fisher Based FeaturesN. Gunji et al, (2012)

Compressed Fisher vectors for Large Scale Visual RecognitionF Perronnin, J Sanchez (2011)








4. Conclusions

Agenda

• Traditional machine learning must contend with covariate shift between data sets.

• Covariate shifts must be mitigates through domain adaptation.

Covariate shifts are problematic in machine learning

blog.bigml.com

http://blog.bigml.com

• Traditional machine learning must contend with covariate shift between data sets.

• Covariate shifts must be mitigates through domain adaptation.

layer i

time = 1

time = N

time = 1

time = N

Covariate shifts occur between network layers.

• Covariate shifts occur across layers in a deep network.

• Performing domain adaptation or whitening is impractical in an online setting.

Covariate shifts occur between network layers.

logistic unit activation during MNIST training

time


50%

85%

15%

• Adagrad

• whitening input data

• building invariances through normalization

• regularizing the network (e.g. dropout, maxout)

I Goodfellow et al (2013) N Srivastava et al. (2014)

layer i

time = 1

time = N

time = 1

time = N

Previous method for addressing covariate shifts

1. Normalize the activations within a mini-batch.


Mitigate covariate shift via batch normalization.

(�,�)

{xi}

µ =1

n

nX

i

{xi}

�

2 =1

n

nX

i

(xi � µ)2xi =

xi � µp�

2 + ✏

yi = �xi + �

2. Learn the mean and variance of each layer as parameters

{xi}

• The canonical module of a perceptron is updated:

• Activations are more stable over training.


Batch normalization stabilizes training.

50%

85%

15%

hidden layer activations on MNIST

y = f(BatchNorm(X

i

wixi) )y = f(X

i

wixi + b)

• CNN’s train faster with fewer data samples (15x).

• Employ faster learning rates and less network regularizations.


Batch normalization speeds up training enormously.

number of mini-batches

prec

isio

n @

1








4. Conclusions

Agenda

For training a network, one focused on how to change parameters with respect to a loss function.

Switching to other types of gradients

An important distinction: • the former provides an update that “lives” in weight space • the latter provides an update that “lives” in image space

The rest of this talk is instead focused on how does an activation or loss function depend on the image.

@

@ wi

@

@ image

layer 3 layer 5

Gradient propagation to find responsible pixels

• Which pixels elicit large activation values within an image?

• Examine activations at middle layers in a trained network.


layer 3



layer 5


Gradient propagation for distorting images.

“dog”

Inception-v3

http://mscoco.org

• What happens if we distort the original image to amplify the label using the gradient signal?

http://mscoco.org

Inceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)

“dog”


…. But if we used the wrong image?

• What happens if we distort the original image to amplify the label using the gradient signal?

Inceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)

“dog”

• Apply gradient distortion, feed back the distorted image into the network and iterate.


http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

A. Mordvintsev, C. Olah and M. Tyka

A Neural Algorithm for Artistic Style

A Neural Algorithm of Artistic StyleL. Gatys, A. Ecker, M. Bethge (2015)

https://github.com/kaishengtai/neuralart

A Neural Algorithm of Artistic StyleL. Gatys, A. Ecker, M. Bethge (2015)

Gradient propagation for breaking things.

“dog”

Inception-v3

@ loss

@ image

Intriguing properties of neural networksC Szegedy et al (2014)

Explaining and Harnessing Adversarial ExamplesI Goodfellow, J Shlens and C Szegedy (2015)

@ loss

@ image

-

which pixels are sensitive to the label

how to change pixels to decrease the probability of the label

• Constrained optimization to find adversarial adjustment to an image (L1 norm).

• Robust across trained networks, network architectures and other machine learning systems.

Intriguing properties of neural networksC Szegedy et al (2014)

Explaining and Harnessing Adversarial ExamplesI Goodfellow, J Shlens and C Szegedy (2015)

Gradient propagation for breaking things.








4. Conclusions

Agenda

Quick Start Guide

1. Purchase a desktop with a fast GPU.

2. Download an open-source library for deep learning.

3. Download a pre-trained model a similar vision task.

4. Retrain (fine-tune) the network for your particular data set.

Online resources: http://www.tensorflow.org http://cs231n.github.io/convolutional-networks/

Google Brain Residency Program

One year immersion program in deep learning research● First class started six weeks ago, planning for next year’s class is underway

Learn to conduct deep learning research w/experts in our team● Fixed one-year employment with salary, benefits, ...

● Goal after one year is to have conducted several research projects

● Interesting problems, TensorFlow, and access to computational resources

g.co/brainresidency

Google Brain Residency Program

Who should apply? ● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics

● completed coursework in calculus, linear algebra, and probability, or equiv.

● programming experience

● motivated, hard working, and have a strong interest in deep learning

g.co/brainresidency

deep learning and vision - indico · • no biophysics y = f (x i w ix i + b) ... 2015 inception v3...

Documents