approaching to ultimate speed for inference...

Jie Fang, Xipeng Li, Yong Wang November 22th

APPROACHING TO ULTIMATE SPEED FOR INFERENCE – NETWORK PRUNING AND BEYOND

2

OUTLINE

Introduction to Network Pruning

Introduction to TensorRT

Case study: Accelerating SSD using Network Pruning and TensorRT

3

INTRODUCTION TO NETWORK PRUNING

4

WHAT IS NETWORK PRUNING & WHY

Pruning the parameters of specified network to reduce the computation cost and storage without affecting the accuracy as much as possible.

Pruning -> Smaller network -> Less computation -> faster inference

5

WHAT IS NETWORK PRUNING & WHY

This plot shows that directly training a small sized network cannot reach the performance level of a similar sized network obtained through pruning. (MCR: misclassification rate)

Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured Pruning of Deep Convolutional Neural Networks, arXiv:1512.08571, 2015.

6

TWO DIRECTIONSReducing storage

ImageNet: AlexNet is reduced by 9x, VGG-16 is reduced by 13x without loss of accuracy.

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Networks. arXiv:1506.02626, 2015.

7

TWO DIRECTIONSReducing computation

Convolutional Layers Fully Connected Layers

AlexNet 91.8% 8.2%

VGG-16 99.2% 0.8%

ResNet-50 > 99.9% < 0.1%

Ratio of GFLOPs

8


l1 l2 l3 l4 l5 l6

l7 l8 l9 l10 l11 l12

l13 l14 l15 l16 l17 l18

l19 l20 l21 l22 l23 l24

l25 l26 l27 l28 l29 l30

l31 l32 l33 l34 l35 l36

F1 F2 F3

F4 F5 F6

F7 F8 F9

l1 l2 l3 l7 l8 l9 l13 l14 l15

l2 l3 l4 l8 l9 l10 l14 l15 l16

l3 l4 l5 l9 l10 l11 l15 l16 l7

F1 F1 F1

F2 F2 F2

F3 F3 F3

F4 F4 F4

F5 F5 F5

F6 F6 F6

F7 F7 F7

F8 F8 F8

F9 F9 F9F1 F2 F3

F4 F5 F6

F7 F8 F9

F1 F2 F3

F4 F5 F6

F7 F8 F9

For most of the case, pruning parameters doesn’t mean reducing computation.

9


For most of the case, pruning parameters doesn’t mean reducing computation.

l1 l2 l3 l4 l5 l6

l7 l8 l9 l10 l11 l12

l13 l14 l15 l16 l17 l18

l19 l20 l21 l22 l23 l24

l25 l26 l27 l28 l29 l30

l31 l32 l33 l34 l35 l36

F1 F2 F3

F4 F5 F6

F7 F8 F9

l1 l2 l3 l7 l8 l9 l13 l14 l15

l2 l3 l4 l8 l9 l10 l14 l15 l16

l3 l4 l5 l9 l10 l11 l15 l16 l7

F1 F1 F1

F2 F2 F2

F3 F3 F3

F4 F4 F4

F5 F5 F5

F6 F6 F6

F7 F7 F7

F8 F8 F8

F9 F9 F9F1 F2 F3

F4 F5 F6

F7 F8 F9

F1 F2 F3

F4 F5 F6

F7 F8 F9

10


Shape-wise pruning (cublasfriendly)

Remove entire filter (cudnnfriendly)

Remove the same kernel in all filters

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila ans Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ICLR, 2017.Wei Wen, Chunpeng Wu, Yanda Wang, Yiran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016.Wang, Huan, et al. "Structured Deep Neural Network Pruning by Varying Regularization Parameters." arXiv preprint arXiv:1804.09461 (2018).

11

INTRODUCTION TO TENSORRT

12

NVIDIA TensorRTFrom Every Framework, Optimized For Each Target Platform

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

13

NVIDIA TensorRT

TensorRT provides model importers for Caffe and TensorFlow. Other framework models can be imported using the Network Definition API.


14

NVIDIA TensorRTDeploying a TensorFlow model with TensorRT


15

Case study: Accelerating SSD using Network Pruning

and TensorRT

16

SSD: Single Shot MultiBox Detector

Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.

NetworkConcat NMS

Prior Box

17

Pruning Method3x3 c

onv,

64

3x3 c

onv,

64

3x3 c

onv,

512

3x3 c

onv,

256

3x3 c

onv,

512

Fc 4

096

Fc 4

096

3x3 c

onv,

512

3x3 c

onv,

512

3x3 c

onv,

128

3x3 c

onv,

128

3x3 c

onv,

256

3x3 c

onv,

512

3x3 c

onv,

512

Fc 4

096

3x3 c

onv,

256

Pool\

2

Pool\

2

Pool\

2

Pool\

2

Pool\

2

Only prune VGG16 part

One-shot filter pruning using L1 Norm

𝑠𝑗 =𝑙=1

𝑛𝑖 𝐾𝑙

Predefine the number of filters to be pruned

conv1: 32

conv2: 32

conv3: 64

conv4: 64

18

Integrating with TensorRT

TENSORRTAPI INvPlugin * createSSDPriorBoxPlugin(PriorBoxParameters param);

NetworkConcat NMS

Prior Box

UFFParserTENSORRTAPI INvPlugin * createConcatPlugin(int concatAxis, bool ignoreBatch);

TENSORRTAPI INvPlugin * createSSDDetectionOutputPlugin(DetectionOutputParameters param);

19

ResultPruning results

Filters Accuracy

VGG16 16896 77.2%

VGG16-0.3 11874 76.31%

VGG16-0.5 8448 74.44%

20

FPS without applying TensorRTGPU: Tesla P4 TensorRT 4

20

24

2627

31

25

28

33

3839

28

36 36

41

48

15

20

25

30

35

40

45

50

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

21

FPS applying TensorRTGPU: Tesla P4 TensorRT 4

63 60

74 7267

91

112119 120 118

132

171

187

199205

50

70

90

110

130

150

170

190

210

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

22

FPS applying TensorRT INT8GPU: Tesla P4 TensorRT 4

129

175

199212 219

166

236

277

296309

200

327

407

448

474

110

160

210

260

310

360

410

460

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

23

Summary

• Network pruning can speed up by 1.26x with 0.9% accuracy loss and by 1.55x with 2.76% accuracy loss.

• TensorRT can drastically accelerate the inference of SSD.

• Combining network pruning and TensorRT INT8 obtain ultimate speed, which is about 15x than original SSD.

approaching to ultimate speed for inference...

Documents