deep learning and vision - indico · • no biophysics y = f (x i w ix i + b) ... 2015 inception v3...
TRANSCRIPT
Deep Learning and VisionJon Shlens
Google Research 28 April 2017
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
http://dspace.mit.edu/handle/1721.1/6125
The hubris of artificial intelligence
• For decades we tried to write down every possible rule for everyday tasks —> impossible
• Every day tasks we consider blindingly obvious have been exceedingly difficult for computers.
‘Simple’ problems proved most difficult.
cat?
Machine learning applied everywhere.
• The last decade has shown that if we teach computers
to perform a task, they can perform exceedingly better.
machine translation speech recognitionface recognition time series analysis
molecular activity prediction image recognitionroad hazard detection object detection
optical character recognition motor planning
motor activity planning syntax parsing
language understanding …
Large scale academic competition focused on predicting 1000 object classes (~1.2M images).
• electric ray
• barracuda
• coho salmon
• tench
• goldfish
• sawfish
• smalltooth sawfish
• guitarfish
• stingray
• roughtail stingray
• ...
The computer vision competition:
Imagenet: A large-scale hierarchical image database J Deng et al (2009)
classes
History of techniques in ImageNet Challenge
Locality constrained linear coding + SVM NEC & UIUCFisher kernel + SVM Xerox Research Center EuropeSIFT features + LI2C Nanyang Technological InstituteSIFT features + k-Nearest Neighbors Laboratoire d'Informatique de GrenobleColor features + canonical correlation analysis National Institute of Informatics, Tokyo
Compressed Fisher kernel + SVM Xerox Research Center EuropeSIFT bag-of-words + VQ + SVM University of Amsterdam & University of
TrentoSIFT + ? ISI Lab, Tokyo University
Deep convolutional neural network University of TorontoDiscriminatively trained DPMs University of OxfordFisher-based SIFT features + SVM ISI Lab, Tokyo University
ImageNet 2010
ImageNet 2011
ImageNet 2012
Good fine-grain classification.
hibiscus dahila
Both recognized as “meal”
Good generalization.
mealSensible errors.
snake dog
** Trained a model for whole image recognition using Inception-v3 architecture.
Examples of artificial vision in action
• fine-grain classification
• generalization
• sensible errors
meal
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
History of techniques in ImageNet Challenge
Locality constrained linear coding + SVM NEC & UIUCFisher kernel + SVM Xerox Research Center EuropeSIFT features + LI2C Nanyang Technological InstituteSIFT features + k-Nearest Neighbors Laboratoire d'Informatique de GrenobleColor features + canonical correlation analysis National Institute of Informatics, Tokyo
Compressed Fisher kernel + SVM Xerox Research Center EuropeSIFT bag-of-words + VQ + SVM University of Amsterdam & University of
TrentoSIFT + ? ISI Lab, Tokyo University
Deep convolutional neural network University of TorontoDiscriminatively trained DPMs University of OxfordFisher-based SIFT features + SVM ISI Lab, Tokyo University
ImageNet 2010
ImageNet 2011
ImageNet 2012
• Multi-layer perceptron trained with back-propagation are ideas known since the 1980’s.
Deep convolutional neural networks
ImageNet Classification with Deep Convolutional Neural Networks A Krizhevsky I Sutskever, G Hinton (2012)
Backpropagation applied to handwritten zip code recognitionY LeCun et al (1990)
• Winning network contained 60M parameters.
• Achieving scale in compute and data is critical.
• large academic data sets
• SIMD hardware (e.g. GPU’s, SSE instruction sets)
Convolutional neural networks, revisited.
ImageNet Classification with Deep Convolutional Neural Networks A Krizhevsky I Sutskever, G Hinton (2012)
“Deep learning” = artificial neural networks
“cat”
Loosely based on (what little) we know about the brain
What is deep learning?
• Hierarchical composition of simple mathematical functions
Untangling invariant object recognitionJ DiCarlo and D Cox (2007)
“Deep learning” = artificial neural networks
• Hierarchical composition of simple mathematical functions
“cat”
Loosely based on (what little) we know about the brain
What is deep learning?
Loosely inspired by (what little) we know about the brain
Untangling invariant object recognitionJ DiCarlo and D Cox (2007)
A toy model of a neuron: “perceptron”
The perceptron: a probabilistic model for information storage and organization in the brain.F Rosenblatt (1958)
• no spikes
• no recurrence or feedback *
• no dynamics or state *
• no biophysics
y = f(X
i
wixi + b)
Simplify the neuron to a sum over weighted inputs and a nonlinear activation function.
f(z) = max(0, z)
Employing a network for a task.
• A network is a hierarchical composition of nonlinear functions.
y = f(f(...)) y
cat dog car truck cow bicycle
• Output of network is a real-valued vector.
y
label of node j
“dog”
Example: how to classify with a network
Step 1: Convert the network output to a probability distribution with the softmax function.
cat dog car truck cow bicycle
y
label of node j label of node j
P (j) =exp(yj)Pj exp(yj)
00.250.5
0.751
cat dog car truck cow bicycle
Example: how to classify with a network
Step 2: Minimize the cross-entropy loss between the predicted distribution and a one-hot target distribution.
label of node j
00.25
0.50.75
1
cat dog car truck cow bicycle 00.250.5
0.75
1
cat dog car truck cow bicycle
• Cross entropy loss is the KL-divergence the predicted and target distribution.
loss =
X
x
p(x) log
p(x)
q(x)
p(x)q(x)
predicted distribution target distribution
Gradient descent with back-propagation.
• Calculate the partial derivatives of each parameter with respect to the loss to minimize an objective function via gradient descent.
y@ loss
@ wi
Beyond Regression: New Tools for Prediction and Analysis in the Behavioral SciencesP Werbos (1974)
Learning Internal Representations by Error Propagation.D Rumelhart, G Hinton, R Williams, James L. McClelland et al (1986)
• For weights buried inside the network, employ clever factorization of the chain rule, i.e. back-propagation.
Optimization is highly non-convex.
Note that deep networks operation in O(1M) dimensions.
weight 2weight 1
loss
playground.tensorflow.org
E. Coli of image recognition
Gradient-based learning applied to document recognitionY. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998)
http://yann.lecun.com/exdb/mnist/
machine learning system (e.g. neural network)
“4”
Multi-layer perceptron on MNIST.
• Note that weights grow as the square of the number of pixels.
“4”
handwritten zip codes
P=28
logistic classifier (M=10)
fully connected (N=100)
# weights = N x M = 1000
# weights = N x P2 = 78400
• Consider that the iPhone camera uses P = 2000, then the number of weights would be 4 million.
M = # classes
N = # hidden units
Statistics of natural images: Scaling in the woodsD Ruderman and W Bialek (1994)
Natural image statistics and neural representationE Simoncelli and B Olshausen (2001)
… translationcropping dilation contrastrotation scalebrightness …
Natural image statistics obey invariances.
Translation invariance —> convolutions
• Models of natural image statistics begin with convolutional filter bank.
interlude for convolutions
original
https://docs.gimp.org/en/plug-in-convmatrix.html
filter (3 x 3) identity
0 0 0
0 1 0
0 0 0
original
https://docs.gimp.org/en/plug-in-convmatrix.html
filter (5 x 5) blur
original
https://docs.gimp.org/en/plug-in-convmatrix.html
filter (5 x 5) sharpen
original
https://docs.gimp.org/en/plug-in-convmatrix.html
filter (3 x 3) vertical edge detector
original
https://docs.gimp.org/en/plug-in-convmatrix.html
filter (3 x 3) all edge detector
interlude for convolutions
Multi-layer perceptron on MNIST.
• Note that weights grow as the square of the number of pixels!
“4”
handwritten zip codes
P=28
logistic classifier (M=10)
fully connected (N=100)
# weights = N x M = 1000
# weights = N x P2 = 78400
Convolutional neural network on MNIST.
• Note that the number of model parameters is largely independent of image size.
“4”
handwritten zip codes
P=28
logistic classifier (M=10)
convolutional (N=100)
# weights = N x M x K= 1000 K
# weights = N x F2 = 2500
F=5
F=5
N=100
Generalizing convolutions in depth.
filter bank output activationsinput activationsexample
grayscale image
input depth
input depthRGB image
Generalizing convolutions in depth.
filter bank output activationsinput activationsoutput depth
• input and output depth are arbitrary parameters and not equal. • Convolutional neural networks operate with depths up to 1024.
example
edge detector filter bank
output depth
output depth
convolutional network
output depth
The first convolutional neural network.
Backpropagation applied to handwritten zip code recognitionY LeCun et al (1989)
“4”
convolutional (N=12)
fully connected (N=30)
logistic classifier (M=10)
convolutional (N=12)
• Similar architecture to original CNN architecture but deeper and larger (70K —> 60M parameters).
• More nonlinearities and regularization.
Convolutional neural networks, revisited
ImageNet Classification with Deep Convolutional Neural Networks A Krizhevsky I Sutskever, G Hinton (2012)
Backpropagation applied to handwritten zip code recognitionY LeCun et al (1990)
Steady progress in network architectures.
place top 5 error
2012 Supervision 1st 16.4%2013 Clarifai 1st 11.5%2014 VGG 2nd 7.3%2014 GoogLeNet / Inception 1st 6.6%2014 Andrej Karpathy n/a 5.1%2015 Batch Normalization Inception n/a 4.8%2015 Inception v3 2nd 3.6%2015 ResNet 1st 3.6%2016 Inception-ResNet n/a 3.1%
Steady progress in network architectures.
place top 5 error
2012 Supervision 1st 16.4%2013 Clarifai 1st 11.5%2014 VGG 2nd 7.3%2014 GoogLeNet / Inception 1st 6.6%2014 Andrej Karpathy n/a 5.1%2015 Batch Normalization Inception n/a 4.8%2015 Inception v3 2nd 3.6%2015 ResNet 1st 3.6%2016 Inception-ResNet n/a 3.1%
Steady progress in network architectures.
place top 5 error
2012 Supervision 1st 16.4%2013 Clarifai 1st 11.5%2014 VGG 2nd 7.3%2014 GoogLeNet / Inception 1st 6.6%2014 Andrej Karpathy n/a 5.1%2015 Batch Normalization Inception n/a 4.8%2015 Inception v3 2nd 3.6%2015 ResNet 1st 3.6%2016 Inception-ResNet n/a 3.1%
Advances in network architectures
Animation by Dan Mané
Inception-v4, Inception-ResNet and the Impact of Residual Connections on LearningC Szegedy, S Ioffe, V Vanhoucke (2016)
Deep Residual Learning for Image RecognitionK He, X Zhang, S Ren, J Sun (2015)
Rethinking the Inception Architecture for Computer VisionC Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna (2015)
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)
What I learned from competing against a ConvNet on ImageNetA Karpathy (2014)
Very Deep Convolutional Networks for Large-scale Image RecognitionKaren Simonyan and Andrew Zisserman (2015)
Going Deeper with ConvolutionsC Szegedy et al (2014)
Visualizing and Understanding Convolutional NetworksM Zeiler and R Fergus (2013)
ImageNet Classification with Deep Convolutional Neural Networks A Krizhevsky I Sutskever, G Hinton (2012)
Scalable Multiclass Object Categorization with Fisher Based FeaturesN. Gunji et al, (2012)
Compressed Fisher vectors for Large Scale Visual RecognitionF Perronnin, J Sanchez (2011)
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
• Traditional machine learning must contend with covariate shift between data sets.
• Covariate shifts must be mitigates through domain adaptation.
Covariate shifts are problematic in machine learning
blog.bigml.com
• Traditional machine learning must contend with covariate shift between data sets.
• Covariate shifts must be mitigates through domain adaptation.
layer i
time = 1
time = N
time = 1
time = N
Covariate shifts occur between network layers.
• Covariate shifts occur across layers in a deep network.
• Performing domain adaptation or whitening is impractical in an online setting.
Covariate shifts occur between network layers.
logistic unit activation during MNIST training
time
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)
50%
85%
15%
• Adagrad
• whitening input data
• building invariances through normalization
• regularizing the network (e.g. dropout, maxout)
I Goodfellow et al (2013) N Srivastava et al. (2014)
layer i
time = 1
time = N
time = 1
time = N
Previous method for addressing covariate shifts
1. Normalize the activations within a mini-batch.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)
Mitigate covariate shift via batch normalization.
(�,�)
{xi}
µ =1
n
nX
i
{xi}
�
2 =1
n
nX
i
(xi � µ)2xi =
xi � µp�
2 + ✏
yi = �xi + �
2. Learn the mean and variance of each layer as parameters
{xi}
• The canonical module of a perceptron is updated:
• Activations are more stable over training.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)
Batch normalization stabilizes training.
50%
85%
15%
hidden layer activations on MNIST
y = f(BatchNorm(X
i
wixi) )y = f(X
i
wixi + b)
• CNN’s train faster with fewer data samples (15x).
• Employ faster learning rates and less network regularizations.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftS Ioffe and C Szegedy (2015)
Batch normalization speeds up training enormously.
number of mini-batches
prec
isio
n @
1
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
For training a network, one focused on how to change parameters with respect to a loss function.
Switching to other types of gradients
An important distinction: • the former provides an update that “lives” in weight space • the latter provides an update that “lives” in image space
The rest of this talk is instead focused on how does an activation or loss function depend on the image.
@
@ wi
@
@ image
layer 3 layer 5
Gradient propagation to find responsible pixels
• Which pixels elicit large activation values within an image?
• Examine activations at middle layers in a trained network.
Visualizing and Understanding Convolutional NetworksM Zeiler and R Fergus (2013)
layer 3
Gradient propagation to find responsible pixels
Visualizing and Understanding Convolutional NetworksM Zeiler and R Fergus (2013)
layer 5
Gradient propagation to find responsible pixels
Gradient propagation for distorting images.
“dog”
Inception-v3
http://mscoco.org
• What happens if we distort the original image to amplify the label using the gradient signal?
Inceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)
“dog”
Gradient propagation for distorting images.
…. But if we used the wrong image?
• What happens if we distort the original image to amplify the label using the gradient signal?
Inceptionism: Going Deeper into Neural NetworksA. Mordvintsev, C. Olah and M. Tyka (2015)
“dog”
• Apply gradient distortion, feed back the distorted image into the network and iterate.
Gradient propagation for distorting images.
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
A. Mordvintsev, C. Olah and M. Tyka
A Neural Algorithm for Artistic Style
A Neural Algorithm of Artistic StyleL. Gatys, A. Ecker, M. Bethge (2015)
https://github.com/kaishengtai/neuralart
A Neural Algorithm of Artistic StyleL. Gatys, A. Ecker, M. Bethge (2015)
Gradient propagation for breaking things.
“dog”
Inception-v3
@ loss
@ image
Intriguing properties of neural networksC Szegedy et al (2014)
Explaining and Harnessing Adversarial ExamplesI Goodfellow, J Shlens and C Szegedy (2015)
@ loss
@ image
-
which pixels are sensitive to the label
how to change pixels to decrease the probability of the label
• Constrained optimization to find adversarial adjustment to an image (L1 norm).
• Robust across trained networks, network architectures and other machine learning systems.
Intriguing properties of neural networksC Szegedy et al (2014)
Explaining and Harnessing Adversarial ExamplesI Goodfellow, J Shlens and C Szegedy (2015)
Gradient propagation for breaking things.
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
Quick Start Guide
1. Purchase a desktop with a fast GPU.
2. Download an open-source library for deep learning.
3. Download a pre-trained model a similar vision task.
4. Retrain (fine-tune) the network for your particular data set.
Online resources: http://www.tensorflow.org http://cs231n.github.io/convolutional-networks/
Google Brain Residency Program
One year immersion program in deep learning research● First class started six weeks ago, planning for next year’s class is underway
Learn to conduct deep learning research w/experts in our team● Fixed one-year employment with salary, benefits, ...
● Goal after one year is to have conducted several research projects
● Interesting problems, TensorFlow, and access to computational resources
g.co/brainresidency
Google Brain Residency Program
Who should apply? ● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics
● completed coursework in calculus, linear algebra, and probability, or equiv.
● programming experience
● motivated, hard working, and have a strong interest in deep learning
g.co/brainresidency