neural networks on fpgas
TRANSCRIPT
Neural Networks on FPGAs
Digital System Integration and Programming
Alexander Deibel
December 16, 2020
1
Content
• Introduction
• Neural Networks
• Why use FPGAs?
• Challenges and Application Areas
• Xilinx Deep Neural Network (xDNN)
• ZynqNet
2
Introduction
• Image classification, speech recognition and object detection
become increasingly important
• Computation of unstructured data is difficult
• Machine learning concepts (Neural Networks) replace
hard-coded algorithms
• Enormous computational complexity
• Hardware Accelerators to increase execution speed
3
Neural Networks i
• Computation Algorithm
inspired by nervous system
• Artificial neurons are the
basic building blocks
• Learnable weights: define
reaction to given input
signal
• Trained using thousands of
examples
Figure 1: Artificial Neuron [1]
4
Neural Networks ii
• Interconnection of artificial
neurons
• Connections only between
neurons of adjacent layers
• Two Phases:
• Training
• InferenceFigure 2: Neural Network [1]
5
Convolutional Neural Networks
• NNs specialized for
Image Data
• 2D convolutions
• Can easily be
parallelized
• First layer detects low
level features like
edges and curves,
second layer circles,...Figure 3: Illustration of CNN [1]
6
HW Accelerators for NNs
NNs have a high computational complexity (Giga to Tera-Op/s):
• Speed up through parallelization possible
• CPU not suitable for high parallel workloads (low
performance per watt)
• GPU
• High performance
• High power consumption
• ASIC
• Best performance per watt
• High price
• Fixed designs
7
Why use FPGAs?
• FPGA combine the best of both worlds:
• reasonable flexibility
• energy efficient
• more scalable than GPUs
• higher lifetime
• low latency
• Usage in Embedded Systems
• Very Specific requirements
• Require high efficiency (power and space)
• Training can be done offline
8
Latency: FPGA vs GPU
Latency is very important in real-time applications:
Figure 4: Comparison of latency and power consumption [3].
9
Challenges of FPGA-based implementation
• Larger NN have millions of parameters
• External memory access is slow
• Solution: use specialized NNs (reduce parameters)
• FPGAs typically have no floating-point HW
• 16-bit float
• Fixed-point schemes
• Binary weights
10
Application Areas
Areas, where low latency and high efficiency is required:
• Robots
• Drones
• Automotive area
• Autonomous driving (Tesla FSD Chip)
• Datacenters
• Bing search (Microsoft)
• Language Recognition (Baidu)
• Xilinx Alveo AI accelerator cards
11
Xilinx Deep Neural Network (xDNN)
• Programmable inference
processor
• Xilinx Alveo accelerator
cards
• Optimized for CNNs
• Performance: 30-35
FPS/WattFigure 5: Xilinx Alveo card [7]
12
Xilinx ML Design Suite: xfDNN
Automatic generation and optimization of NN for FPGAs:
Figure 6: xfDNN design flow [7]13
Example of NN on FPGA: ZynqNet
FPGA-based Convolutional Neural Network for Image
Classification:
• Runs on Xilinx Zynq FPGAs
• Optimized co-operation of HW and CNN
• Two main components:
• ZynqNet CNN
• ZynqNet FPGA Accelerator
14
ZynqNet CNN
Customized CNN to fit on FPGA:
• based on SqueezeNet
• -22% Error
• -38% MACC operations
• +100% Parameters (low compared to others)
• Very regular (99% convulutional layers)
• Layers with dimensions of 2N enable multiplications and
divisions with shift operations
• Fits into on-chip Caches
15
ZynqNet CNN
• Trained using the Caffe Framework (offline)
• No fixed-point approximations (32-bit floating-point weights)
Figure 7: High-Level Abstraction of ZynqNet CNN [1]
16
ZynqNet FPGA Accelerator
Specialized FPGA architecture for CNN Acceleration:
• 2D convolution (99% of operations)
• parallelization
• across output channels
• across 3x3 multiply-add
• data reuse (caching)
• filters
• line buffers for input regions
• accumulation of ouput channels
17
ZynqNet: Evaluation and Comparison
• FPGA: Utilization: 80-90%
• Performance: appr. 1 FPS
Figure 9: Comparison of ZynqNet with other CNNs [1]
19
ZynqNet: Modification
ZynqNet was adapted for a gesture recognition system:
• Optimizations to the FPGA Accelerator:
• 8-bit fixed-point scheme
• No off-chip memory usage
• Fine-tuning of the NN leads almost the same accuracy
• Performance: 23.5 FPS
20
References i
[1] David Gschwend. ZynqNet: An FPGA-Accelerated Embedded
Convolutional Neural Network.
https://arxiv.org/abs/2005.06892. Online; accessed 14
December 2020.
[2] A. Shawahna, S. M. Sait and A. El-Maleh. FPGA-Based
Accelerators of Deep Learning Networks for Learning and
Classification: A Review.
https://ieeexplore.ieee.org/document/8594633.
Online; accessed 14 December 2020.
[3] Jyrki Leskela. FPGA VS GPU.
https://haltian.com/resource/fpga-vs-gpu. Online;
accessed 14 December 2020.
21
References ii
[4] R. Nunez-Prieto, P. C. Gomez and L. Liu. A Real-Time Gesture
Recognition System with FPGA Accelerated ZynqNet
Classification.
https://ieeexplore.ieee.org/document/8906956.
Online; accessed 14 December 2020.
[5] Y. Yao et al. A FPGA-based Hardware Accelerator for Multiple
Convolutional Neural Networksn.
https://ieeexplore.ieee.org/document/8565657.
Online; accessed 14 December 2020.
22
References iii
[6] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G.
Venkatesh and D. Marr. Accelerating Binarized Neural
Networks: Comparison of FPGA, CPU, GPU, and ASIC.
https://ieeexplore.ieee.org/document/7929192.
Online; accessed 14 December 2020.
[7] Xilinx. Accelerating DNNs with Xilinx Alveo Accelerator Cards.
https://www.xilinx.com/support/documentation/
white_papers/wp504-accel-dnns.pdf. Online; accessed 14
December 2020.
23