a demonstration of fpga-based you only look once version2...

A Demonstration of FPGA-based You Only LookOnce version2 (YOLOv2)

Hiroki NakaharaTokyo Institute

of Technology, Japan

Masayuki ShimodaTokyo Institute


Shimpei SatoTokyo Institute


Binary

CNN

(Feature

extrac�on)

Half Precision

CNN

(Localiza�on,

classifica�on)Feature

maps

Fig. 1. Proposed Mixed-precision CNN for YOLOv2.

Abstract—We implement the YOLO (You only look once)object detector on an FPGA, which is faster and has higheraccuracy. It is based on the convolutional deep neural network(CNN), and it is a dominant part of both the performanceand the area. It is widely used in the embedded systems, suchas robotics, autonomous driving, security, and drones, all ofwhich require high-performance and low-power consumption. Aframe object detection problem consists of two problems: oneis a regression problem to spatially separated bounding boxes,the second is the associated classification of the objects withinrealtime frame rate. We used the binary (1 bit) precision CNNfor feature extraction and the half-precision (16 bit) precisionCNN for both classification and localization. We implement apipelined based architecture for the mixed-precision YOLOv2 onthe Xilinx Inc. zcu102 board, which has the Xilinx Inc. ZynqUltrascale+ MPSoC. The implemented object detector archived35.71 frames per second (FPS), which is faster than the standardvideo speed (29.9 FPS). Compared with a CPU and a GPU,an FPGA based accelerator was superior in power performanceefficiency. Our method is suitable for the frame object detectorfor an embedded vision system.

I. INTRODUCTION

Convolutional neural networks (CNNs) are essentially acascaded set of pattern recognition filters, which is trainingby big-data [1]. It enables us to solve complex problems for awide range of computer vision applications. The demonstrationshows an FPGA implementation of a kind of frame objectdetection, such as the YOLO (You only look once) [3], whichis used in embedded vision systems, such as a robot, an auto-mobile, a security camera, and a drone. However, it requireshighly performance-per-power detection by an inexpensivedevice.

II. MIXED-PRECISION CNN

In the past work, we showed that the mixed-precisionCNN (binary (1 bit) and half (16 bit) precision) is suitable fora complex problem, which is the YOLO object detection [2].The past implementation, we proposed a lightweight YOLOv2,

Fig. 2. Overall architecture for YOLOv2.

which consists of the binarized CNN for feature extractionand the parallel support vector regression (SVR) for bothclassification and localization.

In the demonstration, we implemented a mixed-precisionCNN, which consists of the binarized one for the former partand the half-precision (16 bit) CNN for the latter part. Since themixed-precision CNN consists of the neural network, and wecan apply the standard back-propagation training algorithm.Thus, as for the training, is relatively more straightforwardthan the SVR-CNN mixed version. Fig. 1 shows a proposedmixed-precision CNN. We offer to use other machine learningregression into the localization layer where highly accurateregression is required. In our design, a half-precision CNN isused in parallel for both the localization and the classification,while a binary precision CNN is used for the feature extraction.The existing YOLOv2 adopts the FCN (fully convolutionalnetwork) structure, and the convolution operation is executedin all layers.

A. YOLOv2 Accelerator Implementation on an FPGA

Fig. 2 shows the overall architecture for the proposedYOLOv2. Our architecture has weight caches. The binarizedone is used for the 2D convolutional binarized neural network,while the half-precision weight cache is used for the latterconvolutional circuit. The off-chip DDR memory stores allweights. Since the load operations for the weight is notdominant for the CNN and the half-precision computations,the proposed architecture achieves a high performance objectdetector. The result is sent to the host ARM processor. Thenpost-processing is done since it is light processing. Also,our implementation achieves higher computation speed than aconventional one, since it performs a convolutional operationfor feature maps at a time.

TABLE I. PARAMETERS FOR THE IMPLEMENTED CNN OF THEYOLOV2.

Layer # In. # In. Kernel # Out.Fmaps Fmaps Size F Size

(Feature Extraction)Bin Conv1 3 128 3 × 3 128 × 128Bin Conv2 128 128 3 × 3 128 × 128Max Pool 128 128 2 × 2 64 × 64Bin Conv1 128 256 3 × 3 64 × 64Bin Conv2 256 256 3 × 3 64 × 64Bin Conv2 256 256 3 × 3 64 × 64Max Pool 256 256 2 × 2 32 × 32(Localization+Classification)Half Conv2 256 256 3 × 3 32 × 32Max Pool 256 256 2 × 2 16 × 16Half Conv2 256 256 3 × 3 16 × 16Max Pool 256 256 2 × 2 8 × 8Half Conv2 256 40 1 × 1 8 × 8Accuracy (mAP) 64.6

RJ45

Connector

Send Image

Category

(Car, Person)

+Loca!on (x,y,h,w)Xilinx ZCU102 Evalua!on Board

Xilinx Zynq UltraScale+

MPSoC ZU9EG

Host PC

Detec!on Result

Processor

Yolo CNN

Fig. 3. System diagram.

III. IMPLEMENTATION

We implemented the proposed mixed-precision YOLOv2on the Xilinx Inc. Zynq UltraScale+ MPSoC zcu102 evalua-tion board, which has the Xilinx Zynq UltraScale+ MPSoCFPGA (ZU9EG, 274,080 LUTs, 548,160 FFs, 1,824 18KbBRAMs, 2,520 DSP48Es). Fig. 3 shows a system diagramof the demonstration. First, the host PC shows an image tothe FPGA board via an Ethernet. In the demonstration, wecaptured a driving video and stored them on the host PC.Next, the YOLO CNN on the FPGA detects objects andsend its category and location to the host PC. Finally, thehost PC shows a detection result. We used the architecturecomputed an image with 28.0 msec, the number of framesper second (FPS) was 35.71. We measured the dynamic boardpower consumption: It was 4.5 Watt. Thus, the performanceper power efficiency was 7.93 (FPS/W).

We compared our mixed-precision YOLOv2 on an FPGAwith other embedded platforms. We used the NVidia JetsonTX2 board which has both the embedded CPU (ARM Cortex-A57) and the embedded GPU (Pascal GPU). For both theembedded platform, we used the original YOLO version 2from the Darknet deep learning framework [3]. Table IIcompares our FPGA implementation with other platforms.Compared with the ARM Cortex-A57, it was 155.2 timesfaster, it dissipated 1.1 times more power, and its performanceper power efficiency was 141.1 times better. Also, comparedwith the Pascal GPU, it was 24.1 times faster, it dissipated 1.5times lower power, and its performance per power efficiencywas 36.1 times better. Thus, an FPGA based node is suitablefor the frame object detector for a ROS node of a robot system.

TABLE II. COMPARISON WITH EMBEDDED PLATFORMS WITH RESPECTTO THE YOLOV2 DETECTION (BATCH SIZE IS 1).

Platform Embedded Embedded FPGACPU GPU

Device Quad-core 256-core Zynq Ultra.ARM Cortex-A57 Pascal GPU MPSoC

Clock Freq. 1.9 GHz 1.3 GHz 0.3 GHzMemory 32 GB 8 GB 32.1 Mb

eMMC Flash LPDDR4 BRAMTime [msec] 4210.0 715.9 28.0(FPS) [sec−1] (0.23) (1.48) (35.71)Power [W] 4.0 7.0 4.5Efficiency 0.057 0.211 7.93[FPS/W]

Fig. 4. Photograph of a demonstration.

IV. DEMONSTRATION

We clopped the driving video frames from the YouTube.Then we assigned annotations to each frame. Next, we trainedthe mixed-precision YOLOv2 with our designed trainingsystem by Chainer deep learning framework. It recognitionaccuracy (mAP) was 85.2%. Fig. 4 shows a photograph ofthe demonstration, which detects objects (person and vehicles)during a driving scene. It satisfies a requirement of autodriving support system. Also, our demo can be watched onthe YouTube [4].

V. ACKNOWLEDGEMENTS

This research is supported in part by the Grants in Aidfor Scientistic Research of JSPS, and the New Energy andIndustrial Technology Development Organization (NEDO).Also, thanks to the Xilinx University Program (XUP), IntelUniversity Program, and the NVidia Corp.’s support.

REFERENCES

[1] Y. LeCun, Y. Bengio and G. Hinton, “Deep Learning,” Nature, No. 521,2015, pp. 436-444.

[2] H. Nakahara, H. Yonekawa, T. Fujii and Shimpei Sato, “A LightweightYOLOv2: A Binarized CNN with A Parallel Support Vector Regressionfor an FPGA,” FPGA, 2018, pp.31-40.

[3] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” arXivpreprint arXiv:1612.08242, 2016.

[4] H. Nakahara, “FPGA YOLOv2 on the Xilinx ZCU102 Zynq Ultrascale+MPSoC Board,” https://www.youtube.com/watch?v= iMboyu8iWc

a demonstration of fpga-based you only look once version2...

Documents