neural networks on fpgas

Neural Networks on FPGAs

Digital System Integration and Programming

Alexander Deibel

December 16, 2020

1

Content

• Introduction

• Neural Networks

• Why use FPGAs?

• Challenges and Application Areas

• Xilinx Deep Neural Network (xDNN)

• ZynqNet

2

Introduction

• Image classification, speech recognition and object detection

become increasingly important

• Computation of unstructured data is difficult

• Machine learning concepts (Neural Networks) replace

hard-coded algorithms

• Enormous computational complexity

• Hardware Accelerators to increase execution speed

3

Neural Networks i

• Computation Algorithm

inspired by nervous system

• Artificial neurons are the

basic building blocks

• Learnable weights: define

reaction to given input

signal

• Trained using thousands of

examples

Figure 1: Artificial Neuron [1]

4

Neural Networks ii

• Interconnection of artificial

neurons

• Connections only between

neurons of adjacent layers

• Two Phases:

• Training

• InferenceFigure 2: Neural Network [1]

5

Convolutional Neural Networks

• NNs specialized for

Image Data

• 2D convolutions

• Can easily be

parallelized

• First layer detects low

level features like

edges and curves,

second layer circles,...Figure 3: Illustration of CNN [1]

6

HW Accelerators for NNs

NNs have a high computational complexity (Giga to Tera-Op/s):

• Speed up through parallelization possible

• CPU not suitable for high parallel workloads (low

performance per watt)

• GPU

• High performance

• High power consumption

• ASIC

• Best performance per watt

• High price

• Fixed designs

7

Why use FPGAs?

• FPGA combine the best of both worlds:

• reasonable flexibility

• energy efficient

• more scalable than GPUs

• higher lifetime

• low latency

• Usage in Embedded Systems

• Very Specific requirements

• Require high efficiency (power and space)

• Training can be done offline

8

Latency: FPGA vs GPU

Latency is very important in real-time applications:

Figure 4: Comparison of latency and power consumption [3].

9

Challenges of FPGA-based implementation

• Larger NN have millions of parameters

• External memory access is slow

• Solution: use specialized NNs (reduce parameters)

• FPGAs typically have no floating-point HW

• 16-bit float

• Fixed-point schemes

• Binary weights

10

Application Areas

Areas, where low latency and high efficiency is required:

• Robots

• Drones

• Automotive area

• Autonomous driving (Tesla FSD Chip)

• Datacenters

• Bing search (Microsoft)

• Language Recognition (Baidu)

• Xilinx Alveo AI accelerator cards

11

Xilinx Deep Neural Network (xDNN)

• Programmable inference

processor

• Xilinx Alveo accelerator

cards

• Optimized for CNNs

• Performance: 30-35

FPS/WattFigure 5: Xilinx Alveo card [7]

12

Xilinx ML Design Suite: xfDNN

Automatic generation and optimization of NN for FPGAs:

Figure 6: xfDNN design flow [7]13

Example of NN on FPGA: ZynqNet

FPGA-based Convolutional Neural Network for Image

Classification:

• Runs on Xilinx Zynq FPGAs

• Optimized co-operation of HW and CNN

• Two main components:

• ZynqNet CNN

• ZynqNet FPGA Accelerator

14

ZynqNet CNN

Customized CNN to fit on FPGA:

• based on SqueezeNet

• -22% Error

• -38% MACC operations

• +100% Parameters (low compared to others)

• Very regular (99% convulutional layers)

• Layers with dimensions of 2N enable multiplications and

divisions with shift operations

• Fits into on-chip Caches

15

ZynqNet CNN

• Trained using the Caffe Framework (offline)

• No fixed-point approximations (32-bit floating-point weights)

Figure 7: High-Level Abstraction of ZynqNet CNN [1]

16

ZynqNet FPGA Accelerator

Specialized FPGA architecture for CNN Acceleration:

• 2D convolution (99% of operations)

• parallelization

• across output channels

• across 3x3 multiply-add

• data reuse (caching)

• filters

• line buffers for input regions

• accumulation of ouput channels

17

ZynqNet FPGA Accelerator: Block Diagram

Figure 8: Block Diagram [1]

18

ZynqNet: Evaluation and Comparison

• FPGA: Utilization: 80-90%

• Performance: appr. 1 FPS

Figure 9: Comparison of ZynqNet with other CNNs [1]

19

ZynqNet: Modification

ZynqNet was adapted for a gesture recognition system:

• Optimizations to the FPGA Accelerator:

• 8-bit fixed-point scheme

• No off-chip memory usage

• Fine-tuning of the NN leads almost the same accuracy

• Performance: 23.5 FPS

20

Questions?

References i

[1] David Gschwend. ZynqNet: An FPGA-Accelerated Embedded

Convolutional Neural Network.

https://arxiv.org/abs/2005.06892. Online; accessed 14

December 2020.

[2] A. Shawahna, S. M. Sait and A. El-Maleh. FPGA-Based

Accelerators of Deep Learning Networks for Learning and

Classification: A Review.

https://ieeexplore.ieee.org/document/8594633.

Online; accessed 14 December 2020.

[3] Jyrki Leskela. FPGA VS GPU.

https://haltian.com/resource/fpga-vs-gpu. Online;

accessed 14 December 2020.

21

https://arxiv.org/abs/2005.06892

https://ieeexplore.ieee.org/document/8594633

https://haltian.com/resource/fpga-vs-gpu

References ii

[4] R. Nunez-Prieto, P. C. Gomez and L. Liu. A Real-Time Gesture

Recognition System with FPGA Accelerated ZynqNet

Classification.



[5] Y. Yao et al. A FPGA-based Hardware Accelerator for Multiple

Convolutional Neural Networksn.



22



References iii

[6] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G.

Venkatesh and D. Marr. Accelerating Binarized Neural

Networks: Comparison of FPGA, CPU, GPU, and ASIC.



[7] Xilinx. Accelerating DNNs with Xilinx Alveo Accelerator Cards.

https://www.xilinx.com/support/documentation/

white_papers/wp504-accel-dnns.pdf. Online; accessed 14

December 2020.

23


https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

neural networks on fpgas

Documents