deep learning theory and...
TRANSCRIPT
![Page 1: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/1.jpg)
Deep Learning Theory and PracticeLecture 14
Modern CNN Architectures II
Dr. Ted Willke [email protected]
Thursday, February 20, 2020
![Page 2: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/2.jpg)
Review of Lecture 13• LeNet-5 (1990’s)
2
- One of the first convolutional neural networks used industrially (by banks for check reading)
- A form of non-linear downsampling of activation maps
- Max pooling outputs the maximum value from a cluster of neurons at the prior layer
• Pooling
https://computersciencewiki.org/index.php/Max-pooling_/_Pooling
• AlexNet (2012)
- First CNN winner of ImageNet contest
- A deeper model (8 layers)
- First use of ReLU
- Used SGD with Momentum
- 7 CNN ensemble
- Trained on 2 GPUS
![Page 3: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/3.jpg)
Review of Lecture 13
3
• Rectified Linear Unit (ReLU)- Effective and computationally advantageous
W = W − ηVt
• Momentum- Modifies weight updates to build up “velocity” in a
gradient direction
Vt = βVt−1 + (1 − β)∇wL(W, X, y)
• Model ensembles
- Training multiple versions of a model (or independent models)
- Many approaches (different initializations, hyperparameters, checkpoints/timepoints)
- Only limited by complexity and computational cost!
![Page 4: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/4.jpg)
Review of Lecture 13• ZFNet (2013) similar to AlexNet
4
- A lesson in the value of hyperparameter optimization!
• VGGNet (2014)
- Introduced smaller filters, deeper network (16-19)
(Simonyan and Zisserman 2014)
- Offered more non-linearities with fewer parameters (But still 138M parameters for VGG-16!)
![Page 5: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/5.jpg)
Review of Lecture 13
5(Szegedy et al. 2014)
• GoogLeNet (2014)
Inception module
- A deeper model with computational efficiency
- Introduced the ‘inception’ module
- Only 5M parameters! How did they do it?!
![Page 6: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/6.jpg)
Today’s Lecture
•The rest of the CNN modern classics
6(Many slides adapted from Stanford’s excellent CS231n course. Thank you Fei-Fei Li, Justin Johnson, and Serena Young!)
![Page 7: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/7.jpg)
GoogLeNet
7
Naive Inception module
128 filters 192 96
28x28x128 28x28x192 28x28x96 28x28x256
28x28x(128+192+96+256)=28x28x672! Conv Ops:[1x1 conv, 128] 28x28x128x1x1x256[3x3 conv, 192] 28x28x192x3x3x256[5x5 conv, 96] 28x28x96x5x5x256Total of 854M ops!
Very expensive compute!
Pooling layer also preserves feature depth,which means total depth after concatenation can only grow at every layer!
How did they deal with this challenge?
Module input: 28x28x256
![Page 8: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/8.jpg)
GoogLeNet
8
Naive Inception module
128 filters 192 96
28x28x128 28x28x192 28x28x96 28x28x256
28x28x(128+192+96+256)=28x28x672!
Solution: “bottleneck” layers that use 1x1 convolutions to reduce feature depth
Module input: 28x28x256
![Page 9: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/9.jpg)
1x1 convolutions for projection
9
1x1 CONV, 32 filters
(Each filter is 1x1x64 and performs a 64-dim dot product.)
•Preserves spatial dims but reduces depth!
•Projects depth to lower dimension (essentially combining activation maps)
![Page 10: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/10.jpg)
Inception module with dimensionality reduction
10
Naive Inception module With dimensionality reduction
![Page 11: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/11.jpg)
Inception module with dimensionality reduction
11
With dimensionality reduction
28x28x128 28x28x192 28x28x96 28x28x64
28x28x480
1x1 conv,128
3x3 conv,192
5x5 conv,96
1x1 conv,64
1x1 conv,64
1x1 conv,64
28x28x64 28x28x64 28x28x256
Module input: 28x28x256
Same design, but with ‘1x1 conv, 64 filter bottlenecks:
Conv Ops:[1x1 conv, 64] 28x28x64x1x1x256[1x1 conv, 64] 28x28x64x1x1x256[1x1 conv, 128] 28x28x128x1x1x256[3x3 conv, 192] 28x28x192x3x3x64[5x5 conv, 96] 28x28x96x5x5x64[1x1 conv, 64] 28x28x64x1x1x256Total of 358M ops (vs 854M)!
Note that bottleneck also reduces depth after pooling layer.
![Page 12: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/12.jpg)
GoogLeNet
12
Inception module
(Szegedy et al. 2014)
Stack Inception modules with dimensionality reduction.
![Page 13: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/13.jpg)
Full GoogLeNet Architecture
13(Szegedy et al. 2014)
‘Stem Network’:CONV POOL CONV CONV POOL
![Page 14: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/14.jpg)
Full GoogLeNet Architecture
14(Szegedy et al. 2014)
Stacked Inception Modules
![Page 15: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/15.jpg)
Full GoogLeNet Architecture
15(Szegedy et al. 2014)
Classifier output:Removed expensive FC layers!
![Page 16: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/16.jpg)
Full GoogLeNet Architecture
16(Szegedy et al. 2014)
Auxiliary classification outputs:Inject additional gradient at lower layers.
AvgPOOL 1x1 CONV
FCFC
SOFTMAX
![Page 17: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/17.jpg)
Full GoogLeNet Architecture
17(Szegedy et al. 2014)
22 layers with weights (including the parallel Inception layers).
![Page 18: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/18.jpg)
GoogLeNet
18
Deeper networks,with computational efficiency
•22 layers
•Efficient ‘inception’ module
•No FC layers!
•Only 5 million parameters!(1/12th of AlexNet)
• ILSVRC’14 classification winner(6.7% top-5 error)
Inception module
(Szegedy et al. 2014)
![Page 19: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/19.jpg)
ImageNet (ILSVRC contest)
19Image credit: Kaiming He
ResNet: Very deep networks
![Page 20: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/20.jpg)
ResNet
20
Very deep networks using residualconnections
•152-layer model for ImageNet
•ILSVRC’15 classification winner(3.57% top-5 error)
•Swept all classification and detection competitions in ILSVRC’15 and COCO’15
(He et al. 2015)
![Page 21: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/21.jpg)
ResNet
21
What happens when we continue stack deep layers on a plain CNN?
What’s surprising about these training and test curves?
(He et al. 2015)
Test
Training
![Page 22: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/22.jpg)
ResNet
22
What happens when we continue stack deep layers on a plain CNN?
56-layer model performs worse on both training and test error.
Test
Training
Not caused by overfitting! (He et al. 2015)
![Page 23: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/23.jpg)
ResNet
23
Hypothesis: The problem is an optimization problem, with deep models being harder to optimize.
•Deep model should be able to perform at least as well as shallower (a strict subset)
•Not likely to be caused by vanishing gradients (verified neither forward nor backward vanish) Batch Normalization!
•Solver verified to work to some extent: Exponentially low convergence rates??
A solution (by construction) is to copy the learned layers from the shallower model and set the additional layers to ‘identity mappings’. Why would this work and how would we do it?
![Page 24: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/24.jpg)
Nested Function Classes (Hypothesis Sets)
24
f*F := arg minx
L(X, Y, f ) subject to f ∈ F .
Consider , the class of functions that a specific network architecture, etc., can reach.F
That is, for all , there exists some set of parameters that can be obtainedthrough training.
f ∈ F W
Assume we want . If in , we are in good shape. Often not the case. So, find some .f* f*FF
Assume a more powerful architecture would deliver a better , right?F′� Not if !F ⊈ F′�
generic nested
![Page 25: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/25.jpg)
ResNet
25
Solution: Use network layers to fit a ‘residual mapping’ instead of directly trying to fit adesired underlying mapping.
‘Plain’ layers Residual block
H(x) = F(x) + x
Fit residual
instead ofdirectly
F(x) = H(x) − xH(x)
(He et al. 2015)
![Page 26: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/26.jpg)
Full ResNet Architecture
26
•Stack residual blocks
•Each residual block has two 3x3 conv layers
•Periodically double number of filters and downsample spatially using stride of 2 (divide by 2 in each dim)
3x3 conv, 128 stride 2
3x3 conv, 64•Additional conv layer at beginning
Additional•No hidden FC layers
(only FC 1000 to output)
FC 1000 to output classes
Global average pooling after final conv
(He et al. 2015)
![Page 27: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/27.jpg)
Varieties of ResNet
27
Total depths of 34, 50, 101, or 152 layers for ImageNet
(He et al. 2015)
![Page 28: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/28.jpg)
Bottleneck layers in ResNet
28(He et al. 2015)
For deeper networks (over 50 layers) use ‘bottleneck’ layer to improve efficiency, similar to GoogLeNet
1x1 conv, 64 projectsto 28x28x64
3x3 conv operates over only 64 maps
1x1 conv, 256 projects256 maps (28x28x256)
![Page 29: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/29.jpg)
Adjusting the Residual’s Channels and Resolution
29
Residual block
(He et al. 2015)
The 1x1 convolution can be used to select # filters and to apply various strides.
Residual block
1x1 conv
![Page 30: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/30.jpg)
Training ResNet in practice
30(He et al. 2015)
•Batch Normalization after every conv layer
•Xavier/2 initialization from He et al.
•SGD + Momentum (0.9)
•Learning rate: 0.1, 1/10th when val plateaus
•Mini-batch size of 256
•Weight decay of 1e-5
•No dropout used
![Page 31: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/31.jpg)
ResNet Results
31(He et al. 2015)
•Able to train very deep networks without degradation (152 on ImageNet, 1202 (!) on CIFAR)
•Deeper networks now achieve lower training error, as expected
•Swept 1st place in all ILSVRC and MS COCO 2015 competitions
ILSVRC 2015 classification (3.6% top-5 error) -Better than some humans! (Russakovsky 2014)
![Page 32: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/32.jpg)
ImageNet (ILSVRC contest)
32Image credit: Kaiming He
![Page 33: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/33.jpg)
Complexity Comparison
33
An Analysis of Deep Neural Network Models for Practical Applications (Canziani et al. 2017)
Inception-v4: ResNet + Inception!VGG: Highest memory,most ops!
GoogLeNet: Efficient!
AlexNet: Smaller compute,lower accuracy, memory heavy!
ResNet: Moderateefficiency, high accuracy
![Page 34: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/34.jpg)
Inferencing Time and Power
34
An Analysis of Deep Neural Network Models for Practical Applications (Canziani et al. 2017)
![Page 35: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/35.jpg)
Other architectures worth knowing….
35
![Page 36: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/36.jpg)
Network in Network (NiN)
36
•Precursor to GoogLeNet and ResNet ‘bottleneck’ layers
•Philosophical inspiration for GoogLeNet
•Mlpconv layer with ‘micronetwork’ within each conv layer to compute more abstract features for local patches
•Uses MLP (FC, i.e., 1x1 conv layers)Comparison with conventional (a)
Stack of 3 with one global average pooling layer
![Page 37: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/37.jpg)
Global Average Pooling
37
•Typically, CNNs use convolutions in lower layers, followed by FCs and softmax logistic regression for classification
•FC are prone to overfitting and computationally expensive
•Global average pooling can replace the FC layers in CNNs
GAP idea: 1. Generate 1 activation map per class
2. Take the average of each map and feedthis vector to softmax
![Page 38: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/38.jpg)
Improving ResNets
38
Identity Mappings in Deep Residual Networks(He et al. 2016)
•Improved ResNet block design from creators of ResNet
•Creates a more direct path for propagating info throughout network (moves activations to residual mapping pathway)
•Provides better performance
![Page 39: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/39.jpg)
Improving ResNets…
39
Wide Residual Networks(Zagoruyko et al. 2016)
•Argues that residuals are the important factor, not depth
•Uses wider residual blocks (F x k filters instead of F filters in each layer)
•50-layer ‘wide’ ResNet outperforms 152-layer original
• Increasing width instead of depth is more computationally efficient (parallelizable)
Basic residual block Wide residual block
![Page 40: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/40.jpg)
Improving ResNets…
40
Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)(Xie et al. 2016)
•Also from creators of ResNet
• Increases width of residual block through multiple parallel pathways (‘cardinality’)
•Similar in spirit to Inception module
![Page 41: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/41.jpg)
Improving ResNets…
41
Deep Networks with Stochastic Depth(Huang et al. 2016)
•Motivation: Reduce vanishing gradients and training time through short networks (during training)
•Randomly drop a subset of layers during each training pass
•Bypass with identity function
•Use full deep network at test
![Page 42: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/42.jpg)
Beyond ResNets…
42
FractalNet: Ultra-Deep Networks without Residuals(Larsson et al. 2017)
•Argues that key is transitioning effectively from shallow to deep and residual representations are not necessary
•Fractal architecture with both shallow and deep paths to output
•Trained with dropping out sub-paths
•Full network at test time
![Page 43: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/43.jpg)
Beyond ResNets…
43
Densely Connected Convolutional Networks(Huang et al. 2017)
•Dense blocks where each layer is connected to every other layer in a feedforward manner
•Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse
![Page 44: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/44.jpg)
Efficient Networks
44
SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and <0.5 MB Model Size(Iandola et al. 2017)
• ‘Fire modules’ consisting of a ‘squeeze’ layer with 1x1 filters feeding an ‘expand’ layer with 1x1 and 3x3 filters
• AlexNet-level accuracy on ImageNet with 50x fewer parameters
• Compresses to 510x smaller than AlexNet (0.5 MB) for FPGAs, embedded
A Fire module
Strategies: 1. Replace 3x3 with 1x1 filters (9X fewer parameters)
2. Decrease number of input channels to 3x3 filters
3. Downsample late so that conv layers have large maps
![Page 45: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/45.jpg)
Summary
45
•VGG, GoogLeNet, ResNet all in wide use and widely available in repos
•ResNet currently best default (circa 2017)
•Trend toward extremely deep networks
•Significant research around design of layer/skip connections and improving gradient flow
•More recent trend towards examining necessity of depth vs width and residual connections
![Page 46: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/EE510W20/lectures/lecture14.pdf · -A lesson in the value of hyperparameter optimization! ... Hypothesis: The problem](https://reader036.vdocuments.site/reader036/viewer/2022062919/5edf2b23ad6a402d666a84ae/html5/thumbnails/46.jpg)
Further reading
• Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2020) Dive into Deep Learning, Release 0.7.1. https://d2l.ai/
• Stanford CS231n Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.
• Goodfellow et al. (2016) Deep Learning. https://www.deeplearningbook.org/
• Boyd, S., and Vandenberghe, L. (2018) Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares. http://vmls-book.stanford.edu/
• VanderPlas, J. (2016) Python Data Science Handbook. https://jakevdp.github.io/PythonDataScienceHandbook/
46