intel nervana artificial intelligence meetup 1/31/17
TRANSCRIPT
Proprietary and confidential. Do not distribute.
Introduction to deeplearning with neon
MAKING MACHINES SMARTER.™
Nervana Systems Proprietary
2
• Intel Nervana overview• Machine learning basics
• What is deep learning?
• Basic deep learning concepts
• Example: recognition of handwritten digits
• Model ingredients in-depth
• Deep learning with neon
Nervana Systems Proprietary
Intel Nervana‘s deep learning solution stack
3
Images
Video
Text
Speech
Tabular
Time series
Solutions
Nervana Systems Proprietary
Deep Dream
Autoencoders
Deep Speech 2
Skip-thought
SegNet
Fast-RCNN Object Localization
Deep Reinforcement Learning
imdb Sentiment Analysis
Video Activity Detection
Deep Residual Net
bAbI Q&A
AIICNN AlexNet GoogLeNet
VGG
https://github.com/NervanaSystems/ModelZoo
Nervana Systems Proprietary
Intel Nervana in action
5
Healthcare: Tumor detection
Automotive: Speech interfacesFinance: Time-series search engine
Positive:
Negative:
Agricultural Robotics Oil & Gas
Positive:
Negative:
Proteomics: Sequence analysis
Query:
Results:
Nervana Systems Proprietary
• Optimized AVX-2 and AVX-512 instructions• Intel® Xeon® processors and Intel® Xeon Phi™ processors• Optimized for common deep learning operations
• GEMM (useful in RNNs and fully connected layers)• Convolutions• Pooling• ReLU• Batch normalization
• Coming soon: LSTM, GRU, Winograd-based convolutions
6
Nervana Systems Proprietary
Nervana Systems Proprietary
8
• Intel Nervana overview
• Machine learning basics• What is deep learning?
• Basic deep learning concepts
• Example: recognition of handwritten digits
• Model ingredients in-depth
• Deep learning with neon
Nervana Systems Proprietary
9
• SUPERVISED LEARNING
• DATA -> LABELS
• UNSUPERVISED LEARNING
• NO LABELS; CLUSTERING
• REDUCING DIMENSIONALITY
• REINFORCEMENT LEARNING
• REWARD ACTIONS (E.G., ROBOTICS)
Nervana Systems Proprietary
10
• SUPERVISED LEARNING
• DATA -> LABELS
• UNSUPERVISED LEARNING
• NO LABELS; CLUSTERING
• REDUCING DIMENSIONALITY
• REINFORCEMENT LEARNING
• REWARD ACTIONS (E.G., ROBOTICS)
Nervana Systems Proprietary
11
(𝑓#, 𝑓%, … , 𝑓')
SVMRandom ForestNaïve BayesDecision TreesLogistic RegressionEnsemble methods
𝑁×𝑁
𝐾 ≪ 𝑁
Arjun
Nervana Systems Proprietary
12
Animals
FacesChairs
Fruits
Vehicles
Nervana Systems Proprietary
Animals
FacesChairs
Fruits
Vehicles
13
Nervana Systems Proprietary
Animals
FacesChairs
Fruits
Vehicles
14
Training error
x
x
x
x
x
x
x
x x
xx
x xxx x
xxx
x
x
xxx
xxx
Testing error
Nervana Systems Proprietary
15
Training Time
Erro
r
Training Error
Testing/Validation Error
Underfitting Overfitting
Bias-Variance Trade-off
Nervana Systems Proprietary
16
• Intel Nervana overview
• Machine learning basics
• What is deep learning? • Basic deep learning concepts
• Example: recognition of handwritten digits
• Model ingredients in-depth
• Deep learning with neon
Nervana Systems Proprietary
17
~60 million parameters
Arjun
But old practices apply: Data Cleaning, Underfit/Overfit, Data exploration, right cost function, hyperparameters, etc.
𝑁×𝑁
Nervana Systems Proprietary
18
Bigger Data Better Hardware Smarter Algorithms
Image: 1000 KB / pictureAudio: 5000 KB / song
Video: 5,000,000 KB / movie
Transistor density doubles every 18 months
Cost / GB in 1995: $1000.00Cost / GB in 2015: $0.03
Advances in algorithm innovation, including neural networks, leading to better accuracy in training models
Nervana Systems Proprietary
19
Nervana Systems Proprietary
20
• Intel Nervana overview
• Machine learning basics
• What is deep learning?
• Basic deep learning concepts• Model ingredients in-depth
• Deep learning with neon
Nervana Systems Proprietary
𝑦𝑥%
𝑥0
𝑥#
𝑎
max(𝑎, 0)
𝑡𝑎𝑛ℎ(𝑎)
Output of unit
Activation FunctionLinear weights Bias unit
Input from unit j
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝑔∑
Nervana Systems Proprietary
InputHidden
Output
Affine layer: Linear + Bias + Activation
Nervana Systems Proprietary
MNIST dataset 70,000 images (28x28 pixels)Goal: classify images into a digit 0-9
N = 28 x 28 pixels = 784 input units
N = 10 output units (one for each digit)
Each unit i encodes the probability of the
input image of being of the digit i
N = 100 hidden units (user-defined parameter)
InputHidden
Output
Nervana Systems Proprietary
N=784N=100
N=10
Total parameters:
𝑊@→B, 𝑏B𝑊B→D, 𝑏D
𝑊@→B
𝑏B𝑊B→D𝑏D
784x100100100x1010
= 84,600
𝐿𝑎𝑦𝑒𝑟𝑖𝐿𝑎𝑦𝑒𝑟𝑗
𝐿𝑎𝑦𝑒𝑟𝑘
Nervana Systems Proprietary
InputHidden
Output 1. Randomly seed weights2. Forward-pass3. Cost4. Backward-pass5. Update weights
Nervana Systems Proprietary
InputHidden
Output
𝑊@→B, 𝑏B ∼ 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0,1)
𝑊B→D, 𝑏D ∼ 𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛(0,1)
Nervana Systems Proprietary
0.00.10.00.30.10.10.00.00.40.0
Output (10x1)
28x28
InputHidden
Output
Nervana Systems Proprietary
0.00.10.00.30.10.10.00.00.40.0
Output (10x1)
28x28
InputHidden
Output0001000000
Ground Truth
Cost function𝑐(𝑜𝑢𝑡𝑝𝑢𝑡, 𝑡𝑟𝑢𝑡ℎ)
Nervana Systems Proprietary
0.00.10.00.30.10.10.00.00.40.0
Output (10x1)
InputHidden
Output0001000000
Ground Truth
Cost function𝑐(𝑜𝑢𝑡𝑝𝑢𝑡, 𝑡𝑟𝑢𝑡ℎ)
Δ𝑊@→B Δ𝑊B→D
Nervana Systems Proprietary
InputHidden
Output 𝐶 𝑦, 𝑡𝑟𝑢𝑡ℎ
𝑊∗
𝜕𝐶𝜕𝑊∗
compute
Nervana Systems Proprietary
InputHidden
Output 𝐶 𝑦, 𝑡𝑟𝑢𝑡ℎ = 𝐶 𝑔 ∑(𝑊B→D𝑥D + 𝑏D)
𝑊∗
Nervana Systems Proprietary
InputHidden
Output 𝐶 𝑦, 𝑡𝑟𝑢𝑡ℎ = 𝐶 𝑔 ∑(𝑊B→D𝑥D + 𝑏D)
𝑎(𝑊B→D, 𝑥D)=
𝑊B→D∗𝜕𝐶𝜕𝑊∗ =
𝜕𝐶𝜕𝑔 \
𝜕𝑔𝜕𝑎 \
𝜕𝑎𝜕𝑊∗
a
𝑔 = max(𝑎, 0)
a
𝑔′(𝑎)
= 𝐶 𝑔(𝑎 𝑊B→D, 𝑥D )
Nervana Systems Proprietary
InputHidden
Output 𝐶 𝑦, 𝑡𝑟𝑢𝑡ℎ = 𝐶 𝑔D(𝑎D 𝑊B→D, 𝑔B(𝑎B(𝑊@→B, 𝑥B))
𝜕𝐶𝜕𝑊∗ =
𝜕𝐶𝜕𝑔D
\𝜕𝑔D𝜕𝑎D
\𝜕𝑎D𝜕𝑔B
\𝜕𝑔B𝜕𝑎B
\𝜕𝑎B𝜕𝑊∗
𝐶 𝑦, 𝑡𝑟𝑢𝑡ℎ = 𝐶 𝑔D 𝑎D(𝑊B→D, 𝑥D = 𝑦B
𝑦B
𝑊@→B∗
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝑑𝐽 𝒘(_)
𝑑𝒘
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝒘(#) = 𝒘(_) −𝑑𝐽 𝒘(_)
𝑑𝒘
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝒘(#) = 𝒘(_) − 𝛼𝑑𝐽 𝒘(_)
𝑑𝒘
learning rate
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝒘(#) = 𝒘(_) − 𝛼𝑑𝐽 𝒘(_)
𝑑𝒘
𝒘(#)
too small
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝒘(#) = 𝒘(_) − 𝛼𝑑𝐽 𝒘(_)
𝑑𝒘
𝒘(#)
too large
Nervana Systems Proprietary
𝐽 𝒘(_) =`𝑐𝑜𝑠𝑡(𝒘(_), 𝒙𝑖)b
@c#
𝒘𝒘(_)
𝒘(#) = 𝒘(_) − 𝛼𝑑𝐽 𝒘(_)
𝑑𝒘
𝒘(#)
good enough
Nervana Systems Proprietary
𝐽 𝒘(#) =`𝑐𝑜𝑠𝑡(𝒘(#), 𝒙𝑖)b
@c#
𝒘𝒘(%)
𝒘(%) = 𝒘(#) − 𝛼𝑑𝐽 𝒘(#)
𝑑𝒘
𝒘(#)
Nervana Systems Proprietary
𝐽 𝒘(%) =`𝑐𝑜𝑠𝑡(𝒘(%), 𝒙𝑖)b
@c#
𝒘
𝒘(0) = 𝒘(%) − 𝛼𝑑𝐽 𝒘(%)
𝑑𝒘
𝒘(%)𝒘(0)
Nervana Systems Proprietary
𝐽 𝒘(0) =`𝑐𝑜𝑠𝑡(𝒘(0), 𝒙𝑖)b
@c#
𝒘
𝒘(g) = 𝒘(0) − 𝛼𝑑𝐽 𝒘(0)
𝑑𝒘
𝒘(g)
𝒘(0)
Nervana Systems Proprietary
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
Nervana Systems Proprietary
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
Update weights via:
Δ𝑊 = 𝛼 ∗1𝑁`𝛿𝑊
�
�
Learning rate
Nervana Systems Proprietary
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
fprop cost bprop 𝛿𝑊
minibatch #1 weight update
minibatch #2 weight update
Nervana Systems Proprietary
Epoch 0
Epoch 1
Sample numbers:• Learning rate ~0.001• Batch sizes of 32-128• 50-90 epochs
Nervana Systems Proprietary
SGDGradient Descent
Nervana Systems Proprietary
Krizhevsky, 2012
60 million parameters
120 million parameters Taigman, 2014
Nervana Systems Proprietary
50
• Intel Nervana overview
• Machine learning basics
• What is deep learning?
• Basic deep learning concepts
• Model ingredients in-depth• Deep learning with neon
Nervana Systems Proprietary
Dataset Model/Layers Activation OptimizerCost
𝐶(𝑦, 𝑡)
Nervana Systems Proprietary
Filter + Non-Linearity
Pooling
Filter + Non-Linearity
Fully connected layers
…
“how can I help you?”
cat
Low level features
Mid level features
Object parts, phonemes
Objects, words
*Hinton et al., LeCun, Zeiler, Fergus
Filter + Non-Linearity
Pooling
Nervana Systems Proprietary
Tanh Rectified Linear UnitLogistic
-1
11
0
𝑔 𝑎 =𝑒j
∑ 𝑒jk�D
Softmax
Nervana Systems Proprietary
Gaussian Gaussian(mean, sd)
GlorotUniform Uniform(-k, k)
Xavier Uniform(k, k)
Kaiming Gaussian(0, sigma)
𝑘 =6
𝑑@m + 𝑑nop
�
𝑘 =3𝑑@m
�
𝜎 =2𝑑@m
�
Nervana Systems Proprietary
• Cross Entropy Loss
• Misclassification Rate
• Mean Squared Error
• L1 loss
Nervana Systems Proprietary
0.00.10.00.30.10.10.00.00.40.0
Output (10x1)
0001000000
Ground Truth
−`𝑡D×log(𝑦D)�
D= −log(0.3)
Nervana Systems Proprietary
0.3 0.3 0.4
0.3 0.4 0.3
0.1 0.2 0.7
0 0 1
0 1 0
1 0 0
Outputs Targets Correct?YY
N
0.1 0.2 0.7
0.1 0.7 0.2
0.3 0.4 0.3
0 0 1
0 1 0
1 0 0
YY
N
-(log(0.4) + log(0.4) + log(0.1))/3=1.38
-(log(0.7) + log(0.7) + log(0.3))/3=0.64
Nervana Systems Proprietary
• SGD with Momentum
• RMS propagation
• Adagrad
• Adadelta
• Adam
Nervana Systems Proprietary
Δ𝑊# Δ𝑊% Δ𝑊0 Δ𝑊g
training time
𝛼pcxy =𝛼
∑ Δ𝑊p%pcx
pc_�
Nervana Systems Proprietary
Δ𝑊# Δ𝑊% Δ𝑊0 Δ𝑊g
training time
𝛼pcgy =𝛼
Δ𝑊%% + Δ𝑊0
% + Δ𝑊g%�
Nervana Systems Proprietary
61
• Intel Nervana overview
• Machine learning basics
• What is deep learning?
• Basic deep learning concepts
• Model ingredients in-depth
• Deep learning with neon
Nervana Systems Proprietary
Nervana Systems Proprietary
Nervana Systems Proprietary
•Popular, well established, developer familiarity
•Fast to prototype
•Rich ecosystem of existing packages.
•Data Science: pandas, pycuda, ipython, matplotlib, h5py, …
•Good “glue” language: scriptable plus functional and OO support,
plays well with other languages
Nervana Systems Proprietary
Backend NervanaGPU, NervanaCPU
DatasetsMNIST, CIFAR-10, Imagenet 1K, PASCAL VOC, Mini-Places2, IMDB, Penn Treebank,
Shakespeare Text, bAbI, Hutter-prize, UCF101, flickr8k, flickr30k, COCO
Initializers Constant, Uniform, Gaussian, Glorot Uniform, Xavier, Kaiming, IdentityInit, Orthonormal
Optimizers Gradient Descent with Momentum, RMSProp, AdaDelta, Adam, Adagrad,MultiOptimizer
Activations Rectified Linear, Softmax, Tanh, Logistic, Identity, ExpLin
LayersLinear, Convolution, Pooling, Deconvolution, Dropout, Recurrent,Long Short-
Term Memory, Gated Recurrent Unit, BatchNorm, LookupTable,Local Response Normalization, Bidirectional-RNN, Bidirectional-LSTM
Costs Binary Cross Entropy, Multiclass Cross Entropy, Sum of Squares Error
Metrics Misclassification (Top1, TopK), LogLoss, Accuracy, PrecisionRecall, ObjectDetection
Nervana Systems Proprietary
1. Generate backend2. Load data3. Specify model architecture4. Define training parameters5. Train model6. Evaluate
Nervana Systems Proprietary
NERVANA