optimizing fpga-based accelerator design … · references • optimizing fpga-based accelerator...

15
OPTIMIZING FPGA-BASED ACCELERATOR DESIGN FOR DEEP CONVOLUTIONAL NEURAL NETWORKS Surabhi Singh(ss3659) Mohammad Dohadwala(md874) Godly k Thomas(gkt27)

Upload: trannhan

Post on 18-Jun-2018

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

OPTIMIZING FPGA-BASED ACCELERATOR DESIGN FOR DEEPCONVOLUTIONAL NEURAL NETWORKS

Surabhi Singh(ss3659)Mohammad Dohadwala(md874)Godly k Thomas(gkt27)

Page 2: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

YOLO: REAL TIME OBJECT DETECTION SYSTEM

• YOLO.?? YOU ONLY LIVE ONCE

• YOU ONLY LOOK ONCE – State of real-time object detection system

• Can Process upto 200 frames per second

Page 3: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

HOW IT WORKS

• YOLO does the following:

• Applies a single neural network to the resized full image (448X448)

• Divide image into regions and predicts probability for each region

• Bounding box for each object is weighted by predicted probabilities

Image Courtesy: http://pjreddie.com/darknet/yolo/

Page 4: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

NETWORK DESIGN

• 24 convolutional Layers followed by two fully connected layers

• Each convolutional layer has input and output feature maps

Image Courtesy: See Reference[1]

Page 5: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

WHERE IS HLDDA.??

• Optimize a single convolutional layer execution in hardware(FPGA) while execute other convolutional layers in software (ARM)

• Use pragmas learnt in class to increase throughout

• Loop Unrolling, Loop Pipelining, Hardware Tiling for the function executed in hardware.

Page 6: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

SINGLE CONVOLUTIONAL LAYER

w11 w12

w11 w12

w11 w12

w11 w12

w11w13 w14

Weights Input Feature Map Output Feature Map

Page 7: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

KERNEL COMPUTATION

• Pseudo Code

KERNEL (WEIGHT) MULTIPLICATION

Image Courtesy: See Reference[3]

Page 8: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

IMPLEMENTATION: FEATURE MAP CALCULATION

LayerConvolutional

1 2 3 4 5 6 7 8

input_fm (N) 3 16 32 64 128 256 512 1024

output_fm (M) 16 32 64 128 256 512 1024 256

fm row (R) 448 224 112 56 28 14 7 7

fm col (C) 448 224 112 56 28 14 7 7

kernel (K) 3 3 3 3 3 3 3 3

stride (S) 1 1 1 1 1 1 1 1

set # 1 1 1 1 1 1 1 1

Weights (MBytes) 0.0033 0.0352 0.1406 0.5625 2.25 9 36 18

Input Feature Map (MBytes) 41.3438 55.125 27.5625 13.7813 6.8906 3.4453 1.7227 3.4453

Output Feature Map (MBytes) 24.5 12.25 6.125 3.0625 1.5313 0.7656 0.3828 0.0957

Page 9: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

IMPLEMENTATION: OPTIMIZATION

HARDWARE OPTIMIZATION Achieve II = 1 Any ideas..??

COMMUNICATION OVERHEAD

Reduce Latency

Page 10: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

CODE OPTIMIZATION

• Code Snippet

Image Courtesy: See Reference[3]

Page 11: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

HARDWARE OPTIMIZATION STEPS TO ACHIEVE II = 1

* Stream data into local RAM for inputs (multiple access required)* Pipeline the dot-product loop, to fully unroll it* Separate multiply-accumulate in inner loop to force two Floating Point operators

ACTUAL II = 14RESOURCE CONSTRAINT-LIMITED MEMORY PORTS

ARRAY PARTITIONGuesses??

* Partition local RAMs into N/2 sub-arrays for fully parallel access (dual-port read)

EXPECTED II = 1

Page 12: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

RESULTS & CONCLUSION

Design

CPU Cycles

Speedup Total Runtime (s)Average function

RuntimeTotal 1st layer

Runtime

SW HW BASELINE (ARM+FPGA) 22,135,862 7,415,513,736 1X 30.92

SW BASELINE (ARM) 2,763,683 925,833,786 8X 22.81

SW HW ALTERNATE (ARM+FPGA) 323,961 108,526,936 68X 21.6768X

Page 13: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

FINAL PROJECT LINK

• https://youtu.be/v9Bt19h6OUM

Page 14: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

ACKNOWLEDGEMENT

• Special Thanks to Professor Zhiru Zhang for helping us throughout the Project duration and guiding us when we faced roadblocks.

• We would also like to acknowledge suggestions from Steve Dai and Hsiang Yi, PhD students under Professor Zhiru Zhang.

Page 15: OPTIMIZING FPGA-BASED ACCELERATOR DESIGN … · REFERENCES • Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15Proceedings of the 2015 ACM/SIGDA

REFERENCES

• Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks FPGA '15 Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysChen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong,

• YOLO REAL TIME DETECTION SYSTEM

http://pjreddie.com/darknet/yolo/

• YOLO GITHUB LINK

https://github.com/pjreddie/darknet