approaching to ultimate speed for inference...

24
Jie Fang, Xipeng Li, Yong Wang November 22th APPROACHING TO ULTIMATE SPEED FOR INFERENCE NETWORK PRUNING AND BEYOND

Upload: others

Post on 15-Mar-2020

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

Jie Fang, Xipeng Li, Yong Wang November 22th

APPROACHING TO ULTIMATE SPEED FOR INFERENCE – NETWORK PRUNING AND BEYOND

Page 2: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

2

OUTLINE

Introduction to Network Pruning

Introduction to TensorRT

Case study: Accelerating SSD using Network Pruning and TensorRT

Page 3: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

3

INTRODUCTION TO NETWORK PRUNING

Page 4: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

4

WHAT IS NETWORK PRUNING & WHY

Pruning the parameters of specified network to reduce the computation cost and storage without affecting the accuracy as much as possible.

Pruning -> Smaller network -> Less computation -> faster inference

Page 5: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

5

WHAT IS NETWORK PRUNING & WHY

This plot shows that directly training a small sized network cannot reach the performance level of a similar sized network obtained through pruning. (MCR: misclassification rate)

Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured Pruning of Deep Convolutional Neural Networks, arXiv:1512.08571, 2015.

Page 6: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

6

TWO DIRECTIONSReducing storage

ImageNet: AlexNet is reduced by 9x, VGG-16 is reduced by 13x without loss of accuracy.

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Networks. arXiv:1506.02626, 2015.

Page 7: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

7

TWO DIRECTIONSReducing computation

Convolutional Layers Fully Connected Layers

AlexNet 91.8% 8.2%

VGG-16 99.2% 0.8%

ResNet-50 > 99.9% < 0.1%

Ratio of GFLOPs

Page 8: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

8

TWO DIRECTIONSReducing computation

l1 l2 l3 l4 l5 l6

l7 l8 l9 l10 l11 l12

l13 l14 l15 l16 l17 l18

l19 l20 l21 l22 l23 l24

l25 l26 l27 l28 l29 l30

l31 l32 l33 l34 l35 l36

F1 F2 F3

F4 F5 F6

F7 F8 F9

l1 l2 l3 l7 l8 l9 l13 l14 l15

l2 l3 l4 l8 l9 l10 l14 l15 l16

l3 l4 l5 l9 l10 l11 l15 l16 l7

F1 F1 F1

F2 F2 F2

F3 F3 F3

F4 F4 F4

F5 F5 F5

F6 F6 F6

F7 F7 F7

F8 F8 F8

F9 F9 F9F1 F2 F3

F4 F5 F6

F7 F8 F9

F1 F2 F3

F4 F5 F6

F7 F8 F9

For most of the case, pruning parameters doesn’t mean reducing computation.

Page 9: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

9

TWO DIRECTIONSReducing computation

For most of the case, pruning parameters doesn’t mean reducing computation.

l1 l2 l3 l4 l5 l6

l7 l8 l9 l10 l11 l12

l13 l14 l15 l16 l17 l18

l19 l20 l21 l22 l23 l24

l25 l26 l27 l28 l29 l30

l31 l32 l33 l34 l35 l36

F1 F2 F3

F4 F5 F6

F7 F8 F9

l1 l2 l3 l7 l8 l9 l13 l14 l15

l2 l3 l4 l8 l9 l10 l14 l15 l16

l3 l4 l5 l9 l10 l11 l15 l16 l7

F1 F1 F1

F2 F2 F2

F3 F3 F3

F4 F4 F4

F5 F5 F5

F6 F6 F6

F7 F7 F7

F8 F8 F8

F9 F9 F9F1 F2 F3

F4 F5 F6

F7 F8 F9

F1 F2 F3

F4 F5 F6

F7 F8 F9

Page 10: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

10

TWO DIRECTIONSReducing computation

Shape-wise pruning (cublasfriendly)

Remove entire filter (cudnnfriendly)

Remove the same kernel in all filters

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila ans Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ICLR, 2017.Wei Wen, Chunpeng Wu, Yanda Wang, Yiran Chen, and Hai Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016.Wang, Huan, et al. "Structured Deep Neural Network Pruning by Varying Regularization Parameters." arXiv preprint arXiv:1804.09461 (2018).

Page 11: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

11

INTRODUCTION TO TENSORRT

Page 12: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

12

NVIDIA TensorRTFrom Every Framework, Optimized For Each Target Platform

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

Page 13: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

13

NVIDIA TensorRT

TensorRT provides model importers for Caffe and TensorFlow. Other framework models can be imported using the Network Definition API.

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

Page 14: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

14

NVIDIA TensorRTDeploying a TensorFlow model with TensorRT

https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/

Page 15: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

15

Case study: Accelerating SSD using Network Pruning

and TensorRT

Page 16: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

16

SSD: Single Shot MultiBox Detector

Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.

NetworkConcat NMS

Prior Box

Page 17: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

17

Pruning Method3x3 c

onv,

64

3x3 c

onv,

64

3x3 c

onv,

512

3x3 c

onv,

256

3x3 c

onv,

512

Fc 4

096

Fc 4

096

3x3 c

onv,

512

3x3 c

onv,

512

3x3 c

onv,

128

3x3 c

onv,

128

3x3 c

onv,

256

3x3 c

onv,

512

3x3 c

onv,

512

Fc 4

096

3x3 c

onv,

256

Pool\

2

Pool\

2

Pool\

2

Pool\

2

Pool\

2

Only prune VGG16 part

One-shot filter pruning using L1 Norm

𝑠𝑗 =𝑙=1

𝑛𝑖 𝐾𝑙

Predefine the number of filters to be pruned

conv1: 32

conv2: 32

conv3: 64

conv4: 64

Page 18: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

18

Integrating with TensorRT

TENSORRTAPI INvPlugin * createSSDPriorBoxPlugin(PriorBoxParameters param);

NetworkConcat NMS

Prior Box

UFFParserTENSORRTAPI INvPlugin * createConcatPlugin(int concatAxis, bool ignoreBatch);

TENSORRTAPI INvPlugin * createSSDDetectionOutputPlugin(DetectionOutputParameters param);

Page 19: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

19

ResultPruning results

Filters Accuracy

VGG16 16896 77.2%

VGG16-0.3 11874 76.31%

VGG16-0.5 8448 74.44%

Page 20: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

20

FPS without applying TensorRTGPU: Tesla P4 TensorRT 4

20

24

2627

31

25

28

33

3839

28

36 36

41

48

15

20

25

30

35

40

45

50

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

Page 21: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

21

FPS applying TensorRTGPU: Tesla P4 TensorRT 4

63 60

74 7267

91

112119 120 118

132

171

187

199205

50

70

90

110

130

150

170

190

210

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

Page 22: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

22

FPS applying TensorRT INT8GPU: Tesla P4 TensorRT 4

129

175

199212 219

166

236

277

296309

200

327

407

448

474

110

160

210

260

310

360

410

460

1 2 4 8 16

VGG16 VGG16-0.3 VGG16-0.5

Page 23: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero

23

Summary

• Network pruning can speed up by 1.26x with 0.9% accuracy loss and by 1.55x with 2.76% accuracy loss.

• TensorRT can drastically accelerate the inference of SSD.

• Combining network pruning and TensorRT INT8 obtain ultimate speed, which is about 15x than original SSD.

Page 24: APPROACHING TO ULTIMATE SPEED FOR INFERENCE …on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8308.pdfF7 F7 F7 F8 F8 F8 F1 F2 F3 F9 F4 F5 F6 F7 F8 F9 F1 F2 F3 ... Stephen Tyree, Tero