energy-efficient face detection using andes risc-v …...image from joint face detection and...

Energy-Efficient Face Detection Using Andes RISC-V Processor

Presenter: Chien-Hao Chen

Advisor: Prof. Chen-Yi Lee

Date: 2018/03/12

1

Outline • Introduction

• Face Detector on Andes Processor

• Experiment Result

• Conclusion

• Reference

2


• Motivation

• Face Detection Model



• Conclusion

• Reference

3

Motivation • Cloud computing

– Image upload to cloud → → result returned

• Edge computing

– Image directly computed → → result returned

4

processing

processing

Face Detection Model MTCNN, 2016[1]

1. Resize image and sliding window sampling

2. P-Net (Proposal): Find candidate bounding box

3. R-Net (Refine): Reject the wrong candidate from P-Net

4. O-Net (Output): From R-Net, find more correct face region

P-Net R-Net O-Net

5 Image from Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks, IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499-1503, 2016

Face Detection Model • P-Net (Proposal):

• Fully convolution with 3 convolution and 1 max pooling layer

• Rough proposal

• R-Net (Refine): • 3 convolution, 2 max pooling and 1 fully connect layer

• Reject false proposal from P-Net

• O-Net (Output): • 4 convolution, 3 max pooling and

1 fully connect layer

• More complicated model

→ Reject false result from R-NET

→ Better face bounding box position

6

Image from Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks, IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499-1503, 2016


• Face Detector on Andes Processor − Hardware environment

− Model Simplification and Acceleration


• Conclusion

• Reference

7

8

Hardware environment Andes RISC-V :

− Processor 60MHz, 64-bit AndesCore

− Xilinx Kintex-7 FPGA XC7K410T

− DRAM: 1GB

− Flash: 64MB


• Face Detector on Andes Processor − Hardware environment

− Model Simplification and Acceleration


• Conclusion

• Reference

9

Depth-wise separable convolution [3]

10

Model Simplification and Acceleration

Model Simplify

1 1

Depth-wise MTCNN

• P-Net: (Proposal) • Fully convolution with 1 convolution layer: stride = 2 (channel: 10)

2 DW convolution layer: stride = 1 (channel: 16, 32)

• R-Net: (Refine) • 1 convolution layer: stride = 2

1 DW convolution layer: stride = 2 1 DW convolution layer: stride = 1

• 1 fully connect

• O-Net: (Output) • 1 convolution: stride = 2

2 DW convolution: stride = 2 2 convolution: stride = 1 (channel: 128, 128)

• 1 fully connect

11


8 24

Motivation

• Ex: If PNET input size 240 × 320 output1 size 115 × 155 × 2 output2 size 115 × 155 × 4

• Soft-max:

𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥 + 𝑒𝑦

𝑒𝑦

𝑒𝑥 + 𝑒𝑦

→ 6 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 & 2 𝑑𝑖𝑣𝑖𝑠𝑖𝑜𝑛

• For output1 Soft-max: → 115 × 155 × 6~107𝑘 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 → 115 × 155 × 2~35𝑘 𝑑𝑖𝑣𝑖𝑠𝑖𝑜𝑛

12

1 2 Soft-max

Approximation


𝐻𝑜𝑢𝑡 =𝐻𝑖𝑛 − 𝐻𝑓𝑖𝑙𝑡𝑒𝑟 + 𝑃𝑎𝑑𝑑𝑖𝑛𝑔

𝑆𝑡𝑟𝑖𝑑𝑒+ 1

=240 − 12 + 0

2+ 1 = 115

Soft-max approximation

• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

13


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

14

> 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑃)


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

15


𝑒𝑥

𝑒𝑥 + 𝑒𝑦> 𝑃


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

16


𝑒𝑥


𝑒𝑥 > 𝑃𝑒𝑥 + 𝑃𝑒𝑦


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

17


𝑒𝑥



(1 − 𝑃)𝑒𝑥> 𝑃𝑒𝑦


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

18


𝑒𝑥




𝑙𝑛 1 − 𝑃 + 𝑥 > 𝑙𝑛 𝑃 + 𝑦


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

19


𝑒𝑥




𝑙𝑛 1 − 𝑃 + 𝑥 > 𝑙𝑛 𝑃 + 𝑦

𝑥 > 𝑙𝑛 (𝑃

1 − 𝑃) + 𝑦


1 2 Soft-max

Approximation


• 𝜎𝑥𝑦 =

𝑒𝑥

𝑒𝑥+𝑒𝑦

𝑒𝑦

𝑒𝑥+𝑒𝑦

20


𝑒𝑥




𝑙𝑛 1 − 𝑃 + 𝑥 > 𝑙𝑛 𝑃 + 𝑦

𝑥 > 𝑙𝑛 (𝑃

1 − 𝑃) + 𝑦

constant


1 2 Soft-max

Approximation

21

𝑒𝑥

𝑒𝑥 + 𝑒𝑦= 0.7

𝑥 = 𝑙𝑛 (0.7

1 − 0.7) + 𝑦


1 2




• Conclusion

• Reference

22

• On FDDB[4] database: • P-Net, R-Net threshold = 0.6, 0.7; min-face = 25x25

23

Experiment Result

Method Accuracy @

FPPI 0.01 Accuracy @

FPPI 0.1 Accuracy @

FPPI 1.0

Speedup @ Andes RISC-V

Processor

MTCNN 84.95% 92.40% 94.66% -

Ours 82.59% 88.15% 90.68% 106x

• FPPI: False Positive Per Image

• On FDDB database:

24

Experiment Result

• FPPI: False Positive Per Image

Method Accuracy @

FPPI 1.0


Processor

MTCNN 94.66% -

Ours 90.68% 106x

Method Accuracy

@ FPPI 0.1 Accuracy

@ FPPI 0.01 FPS

(Titan X GPU)

FPS (1080-Ti)

Brodmann17 89.25% 81.88% 200 90

DeepIR 88.45% 82.16% <=1

Xiaomi 87.82% 77.99% 2?

Faceness 86.04% 79.67% 1

Hyperface 85.63% 80.68% 0.33

DP2MFD 85.57% 76.73% <0.05

Ours 88.15% 82.59% 54


• Performance without considering face size under 48x48

• P-Net, R-Net threshold = 0.9, 0.85; min-face = 48x48

• P-Net, R-Net threshold = 0.6, 0.7; min-face = 48x48

25

Method Accuracy @


FPPI 0.1

Ours 86.64% 87.7%

Method Accuracy @


FPPI 0.1

Ours 90.53% 93.81%

Experiment Result




• Conclusion

• Reference

26

• Proposed face detection model

Conclusion

27

Model Size 3.6x smaller

Speedup @ Andes processor

106x faster

Accuracy @ FPPI 1.0

90.68%

Reference

28

[1] Zhang, Kaipeng, et al. "Joint face detection and alignment using multitask cascaded convolutional networks." IEEE Signal Processing Letters 23.10 (2016): 1499-1503.

[2] Li, Haoxiang, et al. "A convolutional neural network cascade for face detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[3] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

[4] Jain, Vidit, and Erik Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Vol. 2. No. 4. UMass Amherst Technical Report, 2010.

Reference

29

[5] Sun, Xudong, Pengcheng Wu, and Steven CH Hoi. "Face detection using deep learning: An improved faster rcnn approach." Neurocomputing 299 (2018): 42-50.

[6] Jiang, Huaizu, and Erik Learned-Miller. "Face detection with the faster R-CNN." 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017.

[7] Yang, Shuo, et al. "Faceness-net: Face detection through deep facial part responses." IEEE transactions on pattern analysis and machine intelligence 40.8 (2018): 1845-1859.

[8] Ranjan, Rajeev, Vishal M. Patel, and Rama Chellappa. "Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence 41.1 (2019): 121-135.

Reference

30

[9] Ranjan, Rajeev, Vishal M. Patel, and Rama Chellappa. "A deep pyramid deformable part model for face detection." 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2015.

Thanks for your listening!

31

Soft-max with NMS

33


•𝑒𝑥

𝑒𝑥+𝑒𝑦 > 𝑃 → 𝑥 >𝑙𝑛 𝑃

𝑙𝑛 1−𝑃+ 𝑦

Soft-max approximation with NMS

• NMS:

34

Highest score


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1

𝑒𝑥1+𝑒𝑦1 >𝑒𝑥2

𝑒𝑥2+𝑒𝑦2

35

Highest score


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2

→ 𝑒𝑥1(𝑒𝑥2 + 𝑒𝑦2) > 𝑒𝑥2(𝑒𝑥1 + 𝑒𝑦1)

36

Highest score


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2

→ 𝑒𝑥1 ∙ 𝑒𝑥2 + 𝑒𝑥1 ∙ 𝑒𝑦2 > 𝑒𝑥2 ∙ 𝑒𝑥1 + 𝑒𝑥2 ∙ 𝑒𝑦1

37

Highest score


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2


38

Highest score


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2


39

Highest score

𝑒𝑥1+𝑦2 > 𝑒𝑥2+𝑦1


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2


40

Highest score

𝑒𝑥1+𝑦2 > 𝑒𝑥2+𝑦1 𝑥1 + 𝑦2 > 𝑥2 + 𝑦1


1 2 Soft-max

Approximation


•𝑒𝑥




• NMS:

•𝑒𝑥1


𝑒𝑥2+𝑒𝑦2


41

Highest score

𝑒𝑥1+𝑦2 > 𝑒𝑥2+𝑦1 𝑥1 + 𝑦2 > 𝑥2 + 𝑦1 𝑥1 − 𝑦1 > 𝑥2 − 𝑦2

• Speedup: 1.43x faster


1 2 Soft-max

Approximation

Computational Complexity

42

Model operation complexity comparison

43

Experiment Result

Original MTCNN

Network Input size MAC number

P-Net 12x12 44.76K

P-Net* 120x160 55x75x44.76K

=184.6M

R-Net 24x24 1.531M

O-Net 48x48 12.91M

Ours

Network Input size MAC number

P-Net 12x12 7.872K

P-Net* 120x160 55x75x7.872K

=32.47M

R-Net 24x24 319.3K

O-Net 48x48 2.267M

*: Consider P-Net’s input is an image with size 120x160 but not a block only.

Quantization

44

Model size comparison

45

Experiment Result

Original MTCNN

Network Data type Model size (Byte)

P-Net float32 26.04K

R-Net float32 398.5K

O-Net float32 1.542M

Total 1.966M

Ours

Network Data type Model size (Byte)

P-Net int8 1.088K

R-Net int8 137.4K

O-Net int8 402.6K

Total 541.2K


Quantization Result

46

Word Length Accuracy @

FPPI 0.1

Original MTCNN 92.40%

Ours (float32) 88.20%

Ours (int8) 88.15%

• FPPI: False Positive Per Image ANDES

DSP 1 3

Quantization Method

47

ANDES DSP

1 3

• Weight quantization

𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 = 7 − 𝑐𝑒𝑖𝑙(𝑙𝑜𝑔2(max (𝑎𝑏𝑠 𝑤𝑒𝑖𝑔ℎ𝑡 𝑚𝑖𝑛 , 𝑎𝑏𝑠 𝑤𝑒𝑖𝑔ℎ𝑡 𝑚𝑎𝑥 )))

𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 𝑟𝑜𝑢𝑛𝑑 𝑑𝑜𝑤𝑛 𝑜𝑙𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 × 2𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟

𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 > 126 = 127

𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 < −127 = −128

𝑓𝑖𝑛𝑎𝑙 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 ÷ 2𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟

48

Quantization Method ANDES

DSP 1 3

• Layer output quantization

𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟

= 7

− 𝑐𝑒𝑖𝑙(𝑙𝑜𝑔2(max (𝑎𝑏𝑠 𝑙𝑎𝑦𝑒𝑟 𝑜𝑢𝑡𝑝𝑢𝑡 𝑚𝑖𝑛 , 𝑎𝑏𝑠 𝑙𝑎𝑦𝑒𝑟 𝑜𝑢𝑡𝑝𝑢𝑡 𝑚𝑎𝑥 )))

𝑤ℎ𝑖𝑙𝑒 (𝑠ℎ𝑖𝑓𝑡_𝑠𝑡𝑎𝑟𝑡):

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑟𝑜𝑢𝑛𝑑 𝑑𝑜𝑤𝑛 𝑜𝑢𝑡𝑝𝑢𝑡 × 2𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟

𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑢𝑡𝑝𝑢𝑡 > 126 = 127

𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑢𝑡𝑝𝑢𝑡 < −127 = −128

𝑓𝑖𝑛𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑜𝑢𝑡𝑝𝑢𝑡 ÷ 2𝑠ℎ𝑖𝑓𝑡 𝑛𝑢𝑚𝑏𝑒𝑟

𝑠ℎ𝑖𝑓𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 += 1

49

Quantization Method ANDES

DSP 1 3

• 𝑜𝑢𝑡𝑝𝑢𝑡 = −4, −0.24, −0.20, … , 0.19, 0.23, 4

Example

50

ANDES DSP

1 3

• 𝑜𝑢𝑡𝑝𝑢𝑡 = −4, −0.24, −0.20, … , 0.19, 0.23, 4

Example

51

ANDES DSP

1 3

7 − 𝑙𝑜𝑔2 4 = 5

• 𝑜𝑢𝑡𝑝𝑢𝑡 = −4, −0.24, −0.20, … , 0.19, 0.23, 4

• 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠ℎ𝑖𝑓𝑡 5 = [−4, −0.25, −0.1875, … , 0.1875, 0.21875, 3.96875]

Example

52

ANDES DSP

1 3

• 𝑜𝑢𝑡𝑝𝑢𝑡 = −4, −0.24, −0.20, … , 0.19, 0.23, 4


• 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠ℎ𝑖𝑓𝑡 6 = [−2, −0.234375, −0.203125, … , 0.1875, 0.234375, 1.984375]

Example

53

ANDES DSP

1 3

• 𝑜𝑢𝑡𝑝𝑢𝑡 = −4, −0.24, −0.20, … , 0.19, 0.23, 4


• 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑠ℎ𝑖𝑓𝑡 6 = [−2, −0.234375, −0.203125, … , 0.1875, 0.234375, 1.984375]

Example

54

More precise

ANDES DSP

1 3

Speed-up each step

55


56

Experiment Result • FPPI: False Positive Per Image

Method Accuracy @

FPPI 1.0


Processor

Ori-MTCNN 94.66% -

Ours 90.68% 106x

Method Accuracy

@ FPPI 0.1 Accuracy

@ FPPI 0.01 FPS

(Titan X GPU)

FPS (1080-Ti)

Brodmann17 89.25% 81.88% 200 90

DeepIR 88.45% 82.16% <=1

Xiaomi 87.82% 77.99% 2?

Faceness 86.04% 79.67% 1

Hyperface 85.63% 80.68% 0.33

DP2MFD 85.57% 76.73% <0.05

MTCNN 92.40% 84.95% 51

Ours 88.15% 82.59% 54

57

Step Baseline Sim#1 Fast soft-max DSP-Sim#1 DSP-Sim#2

Overall 294.0129 99.81 53.69 3.88 2.78

Overall Speedup - 2.95 1.86 13.84 1.397

FPS 0.0034 0.01002 0.01863 0.25776 0.3601

P-Net Overall time

97.25 77.2 31.2 1.54 1.18

P-Net Overall speedup

- 1.26 2.47 20.30 1.30

R-Net Overall time

59.08 6.158 6.028 0.989 0.628

R-Net Trigger Times 46 22 22 32 29

R-Net normalize 1.28 0.28 0.274 0.0309 0.022

R-Net normalize speedup - 4.59 1.02 8.87 1.43

O-Net Overall time

132.19 15.034 15.004 1.35 0.96

O-Net Trigger Times 14 9 9 8 9

O-Net normalize 9.44 1.67 1.67 0.17 0.107

O-Net normalize speedup - 5.65 1.002 9.9 1.57

58

Step Baseline Sim#1 Fast soft-max DSP-Sim#1 DSP-Sim#2

Overall 294.0129 99.8111858 53.687959 3.879538 2.777296

Overall Speedup - 2.94569088 1.8590982 13.8388 1.396875954

FPS 0.0034012107406

68638 0.01001891713455

0104 0.01862615026848

5554 0.25776 0.360062449

P-Net Overall time

97.248312473297119

77.170741379261017

31.195177435874939

1.536423 1.180413

P-Net Overall speedup

- 1.26017077 2.4738036 20.3038 1.301597831

R-Net Overall time

59.077883005142212

6.1582962274551392

6.0284666419029236

0.988531 0.627762

R-Net Trigger Times 46 22 22 32 29

R-Net normalize 1.284302 0.27992256 0.2740212 0.03089 0.021646966

R-Net normalize speedup - 4.58806178 1.0215361 8.87087 1.426989815

O-Net Overall time

132.18732833862305

15.033685207366943

15.003592789173126

1.345193 0.961341

O-Net Trigger Times 14 9 9 8 9

O-Net normalize 9.441952 1.67040947 1.6670659 0.16815 0.106815667

O-Net normalize speedup - 5.65247753 1.0020057 9.91416 1.574207274

energy-efficient face detection using andes risc-v …...image from joint face detection and...

Documents