approaching to ultimate speed for inference...
TRANSCRIPT
Jie Fang, Xipeng Li, Yong Wang November 22th
APPROACHING TO ULTIMATE SPEED FOR INFERENCE – NETWORK PRUNING AND BEYOND
2
OUTLINE
Introduction to Network Pruning
Introduction to TensorRT
Case study: Accelerating SSD using Network Pruning and TensorRT
3
INTRODUCTION TO NETWORK PRUNING
4
WHAT IS NETWORK PRUNING & WHY
Pruning the parameters of specified network to reduce the computation cost and storage without affecting the accuracy as much as possible.
Pruning -> Smaller network -> Less computation -> faster inference
5
WHAT IS NETWORK PRUNING & WHY
This plot shows that directly training a small sized network cannot reach the performance level of a similar sized network obtained through pruning. (MCR: misclassification rate)
Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured Pruning of Deep Convolutional Neural Networks, arXiv:1512.08571, 2015.
6
TWO DIRECTIONSReducing storage
ImageNet: AlexNet is reduced by 9x, VGG-16 is reduced by 13x without loss of accuracy.
Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Networks. arXiv:1506.02626, 2015.
7
TWO DIRECTIONSReducing computation
Convolutional Layers Fully Connected Layers
AlexNet 91.8% 8.2%
VGG-16 99.2% 0.8%
ResNet-50 > 99.9% < 0.1%
Ratio of GFLOPs
8
TWO DIRECTIONSReducing computation
l1 l2 l3 l4 l5 l6
l7 l8 l9 l10 l11 l12
l13 l14 l15 l16 l17 l18
l19 l20 l21 l22 l23 l24
l25 l26 l27 l28 l29 l30
l31 l32 l33 l34 l35 l36
F1 F2 F3
F4 F5 F6
F7 F8 F9
l1 l2 l3 l7 l8 l9 l13 l14 l15
l2 l3 l4 l8 l9 l10 l14 l15 l16
l3 l4 l5 l9 l10 l11 l15 l16 l7
F1 F1 F1
F2 F2 F2
F3 F3 F3
F4 F4 F4
F5 F5 F5
F6 F6 F6
F7 F7 F7
F8 F8 F8
F9 F9 F9F1 F2 F3
F4 F5 F6
F7 F8 F9
F1 F2 F3
F4 F5 F6
F7 F8 F9
For most of the case, pruning parameters doesn’t mean reducing computation.
9
TWO DIRECTIONSReducing computation
For most of the case, pruning parameters doesn’t mean reducing computation.
l1 l2 l3 l4 l5 l6
l7 l8 l9 l10 l11 l12
l13 l14 l15 l16 l17 l18
l19 l20 l21 l22 l23 l24
l25 l26 l27 l28 l29 l30
l31 l32 l33 l34 l35 l36
F1 F2 F3
F4 F5 F6
F7 F8 F9
l1 l2 l3 l7 l8 l9 l13 l14 l15
l2 l3 l4 l8 l9 l10 l14 l15 l16
l3 l4 l5 l9 l10 l11 l15 l16 l7
F1 F1 F1
F2 F2 F2
F3 F3 F3
F4 F4 F4
F5 F5 F5
F6 F6 F6
F7 F7 F7
F8 F8 F8
F9 F9 F9F1 F2 F3
F4 F5 F6
F7 F8 F9
F1 F2 F3
F4 F5 F6
F7 F8 F9
10
TWO DIRECTIONSReducing computation
Shape-wise pruning (cublasfriendly)
Remove entire filter (cudnnfriendly)
Remove the same kernel in all filters
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila ans Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ICLR, 2017.Wei Wen, Chunpeng Wu, Yanda Wang, Yiran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016.Wang, Huan, et al. "Structured Deep Neural Network Pruning by Varying Regularization Parameters." arXiv preprint arXiv:1804.09461 (2018).
11
INTRODUCTION TO TENSORRT
12
NVIDIA TensorRTFrom Every Framework, Optimized For Each Target Platform
https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
13
NVIDIA TensorRT
TensorRT provides model importers for Caffe and TensorFlow. Other framework models can be imported using the Network Definition API.
https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
14
NVIDIA TensorRTDeploying a TensorFlow model with TensorRT
https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
15
Case study: Accelerating SSD using Network Pruning
and TensorRT
16
SSD: Single Shot MultiBox Detector
Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
NetworkConcat NMS
Prior Box
17
Pruning Method3x3 c
onv,
64
3x3 c
onv,
64
3x3 c
onv,
512
3x3 c
onv,
256
3x3 c
onv,
512
Fc 4
096
Fc 4
096
3x3 c
onv,
512
3x3 c
onv,
512
3x3 c
onv,
128
3x3 c
onv,
128
3x3 c
onv,
256
3x3 c
onv,
512
3x3 c
onv,
512
Fc 4
096
3x3 c
onv,
256
Pool\
2
Pool\
2
Pool\
2
Pool\
2
Pool\
2
Only prune VGG16 part
One-shot filter pruning using L1 Norm
𝑠𝑗 =𝑙=1
𝑛𝑖 𝐾𝑙
Predefine the number of filters to be pruned
conv1: 32
conv2: 32
conv3: 64
conv4: 64
18
Integrating with TensorRT
TENSORRTAPI INvPlugin * createSSDPriorBoxPlugin(PriorBoxParameters param);
NetworkConcat NMS
Prior Box
UFFParserTENSORRTAPI INvPlugin * createConcatPlugin(int concatAxis, bool ignoreBatch);
TENSORRTAPI INvPlugin * createSSDDetectionOutputPlugin(DetectionOutputParameters param);
19
ResultPruning results
Filters Accuracy
VGG16 16896 77.2%
VGG16-0.3 11874 76.31%
VGG16-0.5 8448 74.44%
20
FPS without applying TensorRTGPU: Tesla P4 TensorRT 4
20
24
2627
31
25
28
33
3839
28
36 36
41
48
15
20
25
30
35
40
45
50
1 2 4 8 16
VGG16 VGG16-0.3 VGG16-0.5
21
FPS applying TensorRTGPU: Tesla P4 TensorRT 4
63 60
74 7267
91
112119 120 118
132
171
187
199205
50
70
90
110
130
150
170
190
210
1 2 4 8 16
VGG16 VGG16-0.3 VGG16-0.5
22
FPS applying TensorRT INT8GPU: Tesla P4 TensorRT 4
129
175
199212 219
166
236
277
296309
200
327
407
448
474
110
160
210
260
310
360
410
460
1 2 4 8 16
VGG16 VGG16-0.3 VGG16-0.5
23
Summary
• Network pruning can speed up by 1.26x with 0.9% accuracy loss and by 1.55x with 2.76% accuracy loss.
• TensorRT can drastically accelerate the inference of SSD.
• Combining network pruning and TensorRT INT8 obtain ultimate speed, which is about 15x than original SSD.