a demonstration of fpga-based you only look once version2...

2
A Demonstration of FPGA-based You Only Look Once version2 (YOLOv2) Hiroki Nakahara Tokyo Institute of Technology, Japan Masayuki Shimoda Tokyo Institute of Technology, Japan Shimpei Sato Tokyo Institute of Technology, Japan Binary CNN (Feature extracon) Half Precision CNN (Localizaon, classificaon) Feature maps Fig. 1. Proposed Mixed-precision CNN for YOLOv2. Abstract—We implement the YOLO (You only look once) object detector on an FPGA, which is faster and has higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part of both the performance and the area. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones, all of which require high-performance and low-power consumption. A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. We used the binary (1 bit) precision CNN for feature extraction and the half-precision (16 bit) precision CNN for both classification and localization. We implement a pipelined based architecture for the mixed-precision YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 35.71 frames per second (FPS), which is faster than the standard video speed (29.9 FPS). Compared with a CPU and a GPU, an FPGA based accelerator was superior in power performance efficiency. Our method is suitable for the frame object detector for an embedded vision system. I. I NTRODUCTION Convolutional neural networks (CNNs) are essentially a cascaded set of pattern recognition filters, which is training by big-data [1]. It enables us to solve complex problems for a wide range of computer vision applications. The demonstration shows an FPGA implementation of a kind of frame object detection, such as the YOLO (You only look once) [3], which is used in embedded vision systems, such as a robot, an auto- mobile, a security camera, and a drone. However, it requires highly performance-per-power detection by an inexpensive device. II. MIXED- PRECISION CNN In the past work, we showed that the mixed-precision CNN (binary (1 bit) and half (16 bit) precision) is suitable for a complex problem, which is the YOLO object detection [2]. The past implementation, we proposed a lightweight YOLOv2, Fig. 2. Overall architecture for YOLOv2. which consists of the binarized CNN for feature extraction and the parallel support vector regression (SVR) for both classification and localization. In the demonstration, we implemented a mixed-precision CNN, which consists of the binarized one for the former part and the half-precision (16 bit) CNN for the latter part. Since the mixed-precision CNN consists of the neural network, and we can apply the standard back-propagation training algorithm. Thus, as for the training, is relatively more straightforward than the SVR-CNN mixed version. Fig. 1 shows a proposed mixed-precision CNN. We offer to use other machine learning regression into the localization layer where highly accurate regression is required. In our design, a half-precision CNN is used in parallel for both the localization and the classification, while a binary precision CNN is used for the feature extraction. The existing YOLOv2 adopts the FCN (fully convolutional network) structure, and the convolution operation is executed in all layers. A. YOLOv2 Accelerator Implementation on an FPGA Fig. 2 shows the overall architecture for the proposed YOLOv2. Our architecture has weight caches. The binarized one is used for the 2D convolutional binarized neural network, while the half-precision weight cache is used for the latter convolutional circuit. The off-chip DDR memory stores all weights. Since the load operations for the weight is not dominant for the CNN and the half-precision computations, the proposed architecture achieves a high performance object detector. The result is sent to the host ARM processor. Then post-processing is done since it is light processing. Also, our implementation achieves higher computation speed than a conventional one, since it performs a convolutional operation for feature maps at a time.

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Demonstration of FPGA-based You Only Look Once version2 …kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · 2018. 8. 2. · for the frame object detector for a ROS node of

A Demonstration of FPGA-based You Only LookOnce version2 (YOLOv2)

Hiroki NakaharaTokyo Institute

of Technology, Japan

Masayuki ShimodaTokyo Institute

of Technology, Japan

Shimpei SatoTokyo Institute

of Technology, Japan

Binary

CNN

(Feature

extrac�on)

Half Precision

CNN

(Localiza�on,

classifica�on)Feature

maps

Fig. 1. Proposed Mixed-precision CNN for YOLOv2.

Abstract—We implement the YOLO (You only look once)object detector on an FPGA, which is faster and has higheraccuracy. It is based on the convolutional deep neural network(CNN), and it is a dominant part of both the performanceand the area. It is widely used in the embedded systems, suchas robotics, autonomous driving, security, and drones, all ofwhich require high-performance and low-power consumption. Aframe object detection problem consists of two problems: oneis a regression problem to spatially separated bounding boxes,the second is the associated classification of the objects withinrealtime frame rate. We used the binary (1 bit) precision CNNfor feature extraction and the half-precision (16 bit) precisionCNN for both classification and localization. We implement apipelined based architecture for the mixed-precision YOLOv2 onthe Xilinx Inc. zcu102 board, which has the Xilinx Inc. ZynqUltrascale+ MPSoC. The implemented object detector archived35.71 frames per second (FPS), which is faster than the standardvideo speed (29.9 FPS). Compared with a CPU and a GPU,an FPGA based accelerator was superior in power performanceefficiency. Our method is suitable for the frame object detectorfor an embedded vision system.

I. INTRODUCTION

Convolutional neural networks (CNNs) are essentially acascaded set of pattern recognition filters, which is trainingby big-data [1]. It enables us to solve complex problems for awide range of computer vision applications. The demonstrationshows an FPGA implementation of a kind of frame objectdetection, such as the YOLO (You only look once) [3], whichis used in embedded vision systems, such as a robot, an auto-mobile, a security camera, and a drone. However, it requireshighly performance-per-power detection by an inexpensivedevice.

II. MIXED-PRECISION CNN

In the past work, we showed that the mixed-precisionCNN (binary (1 bit) and half (16 bit) precision) is suitable fora complex problem, which is the YOLO object detection [2].The past implementation, we proposed a lightweight YOLOv2,

Fig. 2. Overall architecture for YOLOv2.

which consists of the binarized CNN for feature extractionand the parallel support vector regression (SVR) for bothclassification and localization.

In the demonstration, we implemented a mixed-precisionCNN, which consists of the binarized one for the former partand the half-precision (16 bit) CNN for the latter part. Since themixed-precision CNN consists of the neural network, and wecan apply the standard back-propagation training algorithm.Thus, as for the training, is relatively more straightforwardthan the SVR-CNN mixed version. Fig. 1 shows a proposedmixed-precision CNN. We offer to use other machine learningregression into the localization layer where highly accurateregression is required. In our design, a half-precision CNN isused in parallel for both the localization and the classification,while a binary precision CNN is used for the feature extraction.The existing YOLOv2 adopts the FCN (fully convolutionalnetwork) structure, and the convolution operation is executedin all layers.

A. YOLOv2 Accelerator Implementation on an FPGA

Fig. 2 shows the overall architecture for the proposedYOLOv2. Our architecture has weight caches. The binarizedone is used for the 2D convolutional binarized neural network,while the half-precision weight cache is used for the latterconvolutional circuit. The off-chip DDR memory stores allweights. Since the load operations for the weight is notdominant for the CNN and the half-precision computations,the proposed architecture achieves a high performance objectdetector. The result is sent to the host ARM processor. Thenpost-processing is done since it is light processing. Also,our implementation achieves higher computation speed than aconventional one, since it performs a convolutional operationfor feature maps at a time.

Page 2: A Demonstration of FPGA-based You Only Look Once version2 …kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43... · 2018. 8. 2. · for the frame object detector for a ROS node of

TABLE I. PARAMETERS FOR THE IMPLEMENTED CNN OF THEYOLOV2.

Layer # In. # In. Kernel # Out.Fmaps Fmaps Size F Size

(Feature Extraction)Bin Conv1 3 128 3 × 3 128 × 128Bin Conv2 128 128 3 × 3 128 × 128Max Pool 128 128 2 × 2 64 × 64Bin Conv1 128 256 3 × 3 64 × 64Bin Conv2 256 256 3 × 3 64 × 64Bin Conv2 256 256 3 × 3 64 × 64Max Pool 256 256 2 × 2 32 × 32(Localization+Classification)Half Conv2 256 256 3 × 3 32 × 32Max Pool 256 256 2 × 2 16 × 16Half Conv2 256 256 3 × 3 16 × 16Max Pool 256 256 2 × 2 8 × 8Half Conv2 256 40 1 × 1 8 × 8Accuracy (mAP) 64.6

RJ45

Connector

Send Image

Category

(Car, Person)

+Loca!on (x,y,h,w)Xilinx ZCU102 Evalua!on Board

Xilinx Zynq UltraScale+

MPSoC ZU9EG

Host PC

Detec!on Result

Processor

Yolo CNN

Fig. 3. System diagram.

III. IMPLEMENTATION

We implemented the proposed mixed-precision YOLOv2on the Xilinx Inc. Zynq UltraScale+ MPSoC zcu102 evalua-tion board, which has the Xilinx Zynq UltraScale+ MPSoCFPGA (ZU9EG, 274,080 LUTs, 548,160 FFs, 1,824 18KbBRAMs, 2,520 DSP48Es). Fig. 3 shows a system diagramof the demonstration. First, the host PC shows an image tothe FPGA board via an Ethernet. In the demonstration, wecaptured a driving video and stored them on the host PC.Next, the YOLO CNN on the FPGA detects objects andsend its category and location to the host PC. Finally, thehost PC shows a detection result. We used the architecturecomputed an image with 28.0 msec, the number of framesper second (FPS) was 35.71. We measured the dynamic boardpower consumption: It was 4.5 Watt. Thus, the performanceper power efficiency was 7.93 (FPS/W).

We compared our mixed-precision YOLOv2 on an FPGAwith other embedded platforms. We used the NVidia JetsonTX2 board which has both the embedded CPU (ARM Cortex-A57) and the embedded GPU (Pascal GPU). For both theembedded platform, we used the original YOLO version 2from the Darknet deep learning framework [3]. Table IIcompares our FPGA implementation with other platforms.Compared with the ARM Cortex-A57, it was 155.2 timesfaster, it dissipated 1.1 times more power, and its performanceper power efficiency was 141.1 times better. Also, comparedwith the Pascal GPU, it was 24.1 times faster, it dissipated 1.5times lower power, and its performance per power efficiencywas 36.1 times better. Thus, an FPGA based node is suitablefor the frame object detector for a ROS node of a robot system.

TABLE II. COMPARISON WITH EMBEDDED PLATFORMS WITH RESPECTTO THE YOLOV2 DETECTION (BATCH SIZE IS 1).

Platform Embedded Embedded FPGACPU GPU

Device Quad-core 256-core Zynq Ultra.ARM Cortex-A57 Pascal GPU MPSoC

Clock Freq. 1.9 GHz 1.3 GHz 0.3 GHzMemory 32 GB 8 GB 32.1 Mb

eMMC Flash LPDDR4 BRAMTime [msec] 4210.0 715.9 28.0(FPS) [sec−1] (0.23) (1.48) (35.71)Power [W] 4.0 7.0 4.5Efficiency 0.057 0.211 7.93[FPS/W]

Fig. 4. Photograph of a demonstration.

IV. DEMONSTRATION

We clopped the driving video frames from the YouTube.Then we assigned annotations to each frame. Next, we trainedthe mixed-precision YOLOv2 with our designed trainingsystem by Chainer deep learning framework. It recognitionaccuracy (mAP) was 85.2%. Fig. 4 shows a photograph ofthe demonstration, which detects objects (person and vehicles)during a driving scene. It satisfies a requirement of autodriving support system. Also, our demo can be watched onthe YouTube [4].

V. ACKNOWLEDGEMENTS

This research is supported in part by the Grants in Aidfor Scientistic Research of JSPS, and the New Energy andIndustrial Technology Development Organization (NEDO).Also, thanks to the Xilinx University Program (XUP), IntelUniversity Program, and the NVidia Corp.’s support.

REFERENCES

[1] Y. LeCun, Y. Bengio and G. Hinton, “Deep Learning,” Nature, No. 521,2015, pp. 436-444.

[2] H. Nakahara, H. Yonekawa, T. Fujii and Shimpei Sato, “A LightweightYOLOv2: A Binarized CNN with A Parallel Support Vector Regressionfor an FPGA,” FPGA, 2018, pp.31-40.

[3] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” arXivpreprint arXiv:1612.08242, 2016.

[4] H. Nakahara, “FPGA YOLOv2 on the Xilinx ZCU102 Zynq Ultrascale+MPSoC Board,” https://www.youtube.com/watch?v= iMboyu8iWc