dnn acceleration on fpgas - cornell university · dnn acceleration on fpgas ece 5775 high-level...

DNN Acceleration on FPGAs

ECE 5775High-Level Digital Design Automation

Fall 2018

▸ HW 2 due Friday at 11:59pm

▸ First student-led seminar is on Tuesday 10/16– Format: 18-min talk plus 2-min Q&A

• Q = Questions & Quiz– The group winning most audience vote gets a

bonus point

▸ In-class midterm next Thursday on 10/18– 75 mins – Open book, Open notes, Closed Internet

1

Announcements

2

Outline

▸ Midterm coverage

▸ CNN acceleration on FPGAs

▸ Topics covered– Specialized computing– FPGAs– C-based synthesis– Control flow graph and SSA – Scheduling– Resource sharing– Pipelining

3

Midterm (20%)

▸ Algorithm basics– Time complexity, esp. big-O notation– Graphs

• Trees, DAGs, topological sort• BDDs, timing analysis

▸ FPGAs– LUTs and LUT mapping

▸ C-based synthesis– Arbitrary precision and fixed-point types– Basic C-to-RTL mapping– Key HLS optimizations to improve design performance

4

Key Topics (1)

▸ Compiler analysis– Control data flow graph – Dominance relation– SSA

▸ Scheduling– TCS & RCS algorithms: ILP, list scheduling, SDC

• Operation chaining, frequency/latency/resource constraints– Ability to devise a simple scheduling algorithm to optimize

certain design metric

5

Key Topics (2)

▸ Resource sharing– Conflict and compatibility graphs– Left edge algorithm– Ability to determine minimum resource usage in # of functional

units and/or registers, given a fixed schedule

▸ Pipelining– Dependence types– Ability to determine minimum II given a code snippet

• Modulo scheduling concepts: MII, RecMII, ResMII

6

Key Topics (3)

7

Example: What’s the MII

× +× × ×

× × +

-

-§ Three AddSub units available§ Two Multipliers available§ Assume 1-cycle operations, no chaining

[1]

What is the minimum II (MII) for pipelining the above data flow graph?

Write down both ResMII and RecMII

▸ DNNs require enormous amount of compute – For example, ResNet50 (70 layers) performs 7.7 billion operations

required to classify one image

8

Modern DNNs are Computationally Expensive

X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi. Scaling for Edge Inference of Neural Networks. Nature Electronics, vol 1, Apr 2018.

© 2018 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2018 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

PERSPECTIVE NATURE ELECTRONICS

the interconnect. Neurocube44 stacked the hybrid memory cube die on a single-instruction multiple-data processor, while TETRIS45 combined a hybrid memory cube with a spatial architecture. Unlike general DNN accelerators, near-data processing achieves optimal efficiency by using more area for computing. In order to achieve higher efficiency, some works have even moved the DRAM on chip. DaDianNao23 adopted embedded DRAM for high-density on-chip memory, which achieves a 150-fold reduction in energy at the cost of larger chip size. There are also some works that moved computing units to sensors, thereby further reducing the cost of memory access. ShiDiannao26 put vision processing in the sensor with no DRAM, yielding a 63-fold improvement in energy effi-ciency. RedEye46 even omitted analogue-to-digital conversion and performed DNN computation in the analogue domain at the sensor.

Non-von Neumann architectures have also been explored to reduce computation and memory consumption. One such approach adopts non-volatile resistive memories as programmable resis-tive elements. Because computation is performed in the analogue domain, it can be extremely fast with ReRAM arrays47. The approach also brings high density and high energy efficiency as computation and memory are packed in the same chip area, thereby involving minimal data movement. ISAAC48 adopted multicycle approach to perform high-precision calculations with limited memory using

25.1 million memristors. PRIME49 employed a large memristor array for multi-level computation. Jain et al.50 and Wang et al.51 proposed the use of spin-transfer torque magnetic RAM for DNN computation.

Recently, representative array-level demonstrations have been reported. These include IBM’s 500 × 661 phase change mem-ory array for handwritten-digit recognition using the Modified National Institute of Standards and Technology (MNIST) data-base52, Tsinghua’s 128 × 8 analogue resistive RAM array for face rec-ognition53, UCSB’s 12 × 12 crossbar array for pattern recognition54, and UCSB’s floating-gate array for MNIST image recognition55. Non-von Neumann architectures with memristors have several drawbacks: a large analogue-to-digital/digital-to-analogue conver-sion overhead, limited size of the memristor array, and energy and time overheads for memristor writing. It was recently shown that the analogue-to-digital conversion overhead can be eliminated by training the networks in the analogue domain54, and memristor writing can also be mitigated56. Although non-von Neumann archi-tectures with non-volatile resistive memories have considerable potential in both performance and energy efficiency, a number of requirements are yet to be met: special materials and device engi-neering to support the requirements of synaptic devices, increased array size, DNN mapping and EDA tools, and large-scale prototype

Num

ber o

f ope

ratio

ns (×

109 )

10

20

5

1

50

20% 10% 4%Top-five error

AlexNet(2012)

OverFeat(2013)

GoogLeNet(2014)

VGG-16

VGG-19(2014)

Inception-v1

Inception-v2

Inception-v3

Inception-v4(2016)

MobileNetShuffleNet

Xception

ResNeXt-101 DPN-131

PolyNet

NASNet-A(N=7)

ResNet110

ResNet152(2015)

ResNet50

NASNet-A(N=5)

NASNet-A(N=4)

NASNet-A(N=7)

2011

Cambricon

DaDianNao

Diannao

NeuFlow

ShiDianNao

Arria II EP2AGZ350Strati IV EP4SGX230

Stratix 10 GX 2800Arria V GX660

Year

20

100

200

50

10

500

5

1,000

Per

form

ance

den

sity

(gig

aops

s–1

mm

–2)

PD of GPUs PD of ASICs PD of FPGAs

Moore’s law trend forperformance according to ref. 35

Moore’s law end

GTX 690 GTX Titan

GTX Titan X

GTX 1080

P100 V100

Eyeriss

TPU

EIE

ParkMoves

Myriad2

Arria V5AGZE7

20172012 2013 2014 2015 2016

65 nm 45 nm 40 nm 28 nm 16 nm 12 nm

DNNs in academia withoutstructural optimization

DNNs in academia withstructural optimization

GTX Titan Z

a

b

GTX 1080 Ti

Fig. 3 | Gap between required number of operations and performance density. a, Number of operations versus top-five error rate for leading DNN designs from ImageNet classification competition. b, Performance density (PD) of leading GPU, ASIC and FPGA platforms. To catch up with the required number of operations, simply increasing the chip area is not feasible. Only the leading DNNs are labelled with a year. The y axis is in log scale. Data taken from refs 1–31,119,120.

NATURE ELECTRONICS | VOL 1 | APRIL 2018 | 216–222 | www.nature.com/natureelectronics218

Compute Density

9

DNNs are Moving to Dedicated Hardware

Blue ChipsApple GoogleIntelMicrosoft…

StartupsCambriconCerebrasDeephi (à Xilinx)Graphcore…

AcademiaEIE/ESE [Han ISCA’16, FPGA’17]Eyeriss [Chen ISCA’16]FINN [Umuroglu FPGA’17]Minerva [Reagen ISCA’16]…

10

A “Hyper”-Active Body of Research

41

106

2 2 3 36

96

14

22

17

106

8

3

84

68 7

53

05

101520

Recent publications on "Neural Networks on Silicon" in premier computer hardware conferences

(>150 papers since 2015)

2015 2016 2017Data source: https://github.com/fengbintu/Neural-Networks-on-Silicon

This lecture focuses on FPGA-based inference accelerators for convolutional neural networks (CNNs)

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksCheng Zhang1, Peng Li2, Guangyu Sun1,3, Yijin Guan1, Bingjun Xiao2, Jason Cong2,3,1

1Center for Energy-Efficient Computing and Applications, Peking University2Computer Science Department, University of California, Los Angeles3PKU/UCLA Joint Research Institute in Science and EngineeringFPGA’15, Feb 2015

11

1. Analysis of the different sources of parallelism in the convolution kernel of a CNN

2. Quantitative performance modeling of the hardware design space using the Roofline method

3. Design and implementation of a CNN accelerator for FPGA using Vivado HLS, evaluated on AlexNet

12

Main Contributions

▸ An output pixel is connected to its neighboring region on each input feature map

▸ All pixels on an output feature map use the same filter weights

13

Convolutional Layer

Filters

Output Feature MapsInput Feature Maps

I OC

R

K

▸ Four main sources of parallelism1. Across input feature maps

14

Parallelism in the Convolutional Layer

Filters


I OC

R

K

▸ Four main sources of parallelism1. Across input feature maps2. Across output feature maps

15


Filters


I OC

R

K

▸ Four main sources of parallelism1. Across input feature maps2. Across output feature maps3. Across different output pixels (i.e. filter positions)

16


Filters


I OC

R

K

▸ Four main sources of parallelism1. Across input feature maps2. Across output feature maps3. Across different output pixels (i.e. filter positions)4. Across filter pixels

17


Filters


I OC

R

K

18

Parallelism in the Code

1 for (row=0; row<R; row++) {2 for (col=0; col<C; col++) {3 for (to=0; to<O; to++) {4 for (ti=0; ti<I; ti++) {5 for (ki=0; ki<K; ki++) {6 for (kj=0; kj<K; kj++) {

output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][S*row+ki][S*col+kj];

}}}}}}

Loop over output pixelsLoop over output feature mapsLoop over input feature mapsLoop over filter pixels

Filters


I OC

R

K

S is the stride

19

Challenges for FPGA

...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Limited on-chip storage

Limited off-chip bandwidth

Limited compute and routing resources

▸ We can’t just unroll all the loops due to limited FPGA resources▸ Must choose the right code transformations to exploit the parallelism

in a resource efficient way

Technique: loop unrolling, pipelining, interchange

Technique: loop tiling

Technique: data reuse, compute-memory balance

20

Loop Tiling

1 for (row=0; row<R; row++) {2 for (col=0; col<C; col++) {3 for (to=0; to<O; to++) {4 for (ti=0; ti<I; ti++) {5 for (ki=0; ki<K; ki++) {6 for (kj=0; kj<K; kj++) {

output_fm[to][row][col] += weights[to][ti][ki][kj]*input_fm[ti][S*row+ki][S*col+kj];

}}}}}}

Loop over pixels in an output map

Filters


I OC

R

K

1 for (row=0; row<R; row+=Tr) {2 for (col=0; col<C; col+=Tc) {3 for (to=0; to<O; to++) {4 for (ti=0; ti<I; ti++) {5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {7 for (ki=0; ki<K; ki++) {8 for (kj=0; kj<K; kj++) {

output_fm[to][trr][tcc] += weights[to][ti][ki][kj]*input_fm[ti][S*trr+ki][S*tcc+kj];

}}}}}}}}}} 21

Loop Tiling

Filters


I OC

R

K

Loops over pixelsin each tile

Tc

Tr

Loop over different tiles

Offloading just the inner loops requires only a small portion of the data to be stored on FPGA chip

1 for(row=0; row<R; row+=Tr) {2 for(col=0; col<C; col+=Tc) {3 for(to=0; to<M; to+=Tm) {4 for(ti=0; ti<N; ti+=Tn) {

// write output feature map// write input feature map + filters

5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tn, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {

output_fm[to][row][col] += weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];

}}}}}}

// read output feature map}}}}

22

Code with Loop Tiling

Loop over input feature maps

1 for (row=0; row<R; row+=Tr) {2 for (col=0; col<C; col+=Tc) {3 for (to=0; to<O; to+=Tm) {

// software: write output feature map4 for (ti=0; ti<I; ti+=Tn) {

// software: write input feature map + filters

5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {7 for (too=to; too<min(to+To, O); too++) {8 for (tii=ti; tii<(ti+Ti, I); tii++) {9 for (ki=0; ki<K; ki++) {10 for (kj=0; kj<K; kj++) {

output_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][S*trr+ki][S*tcc+kj];

}}}}}}}// software: read output feature map

}}}

CPU Portion

FPGA Portion

Maximize memory reuse

Maximize computational performance

5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {7 for (too=to; too<min(to+To, O); too++) {8 for (tii=ti; tii<(ti+Ti, I); tii++) {9 for (ki=0; ki<K; ki++) {10 for (kj=0; kj<K; kj++) {


}}}}}}

23

Optimizing for On-Chip Performance

9 for (ki=0; ki<K; ki++) {10 for (kj=0; kj<K; kj++) {5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {7 for (too=to; too<min(to+To, O); too++) {8 for (tii=ti; tii<(ti+Ti, I); tii++) {


}}}}}}

Reorder these two loops to the top level

24


9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[tii][S*trr+i][S*tcc+j];

}}}}}}

Reorder these two loops to the top level9 for (ki=0; ki<K; ki++) {10 for (kj=0; kj<K; kj++) {5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {

#pragma HLS pipeline7 for (too=to; too<min(to+To, O); too++) {8 for (tii=ti; tii<(ti+Ti, I); tii++) {


}}}}}}

25




}}}}}}

Reorder these two loops to the top level9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {8 for(tii=ti; tii<(ti+Tn, N); tii++) {


}}}}}}

9 for (ki=0; ki<K; ki++) {10 for (kj=0; kj<K; kj++) {5 for (trr=row; trr<min(row+Tr, R); trr++) {6 for (tcc=col; tcc<min(col+Tc, C); tcc++) {

#pragma HLS pipeline7 for (too=to; too<min(to+To, O); too++) {

#pragma HLS unroll8 for (tii=ti; tii<(ti+Ti, I); tii++) {


}}}}}}

26




}}}}}}




}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}




#pragma HLS unrolloutput_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][S*trr+ki][S*tcc+kj];

}}}}}}

27


Number of cycles to execute the above loop nest≈ 𝐾×𝐾×𝑇𝑟×𝑇𝑐 + 𝐿 ≈ 𝑇𝑟×𝑇𝑐×𝐾)

𝐿 is the pipeline depth (# of pipeline stages, II=1)



}}}}}}




}}}}}}

9 for(i=0; i<K; i++) {10 for(j=0; j<K; j++) {5 for(trr=row; trr<min(row+Tr, R); trr++) {6 for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {

#pragma HLS pipeline7 for(too=to; too<min(to+Tm, M); too++) {

#pragma HLS unroll8 for(tii=ti; tii<(ti+Tn, N); tii++) {

output_fm[too][trr][tcc] += weights[too][tii][i][j]*input_fm[ti][S*trr+i][S*tcc+j];

}}}}}}




#pragma HLS unrolloutput_fm[too][trr][tcc] += weights[too][tii][ki][kj]*input_fm[tii][S*trr+ki][S*tcc+kj];

}}}}}}

28


Generated HardwarePerformance and size of each PE determined by tile factors Ti and ToNumber of data transfers determined by Tr and Tc

To

Ti

▸ Challenge: Number of available optimizations present a huge space of possible designs

– What is the optimal loop order?– What tile size to use for each loop?

▸ Implementing and testing each design by hand will be slow and error-prone

– Some designs will exceed the on-chip compute/memory capacity

▸ Solution: Performance modeling + automated design space exploration

29

Design Space Complexity

▸ We calculate the following design metrics:– Total number of operations (FLOP)

• Depends on the CNN model parameters– Total external memory access (Byte)

• Depends on the CNN weight and activation size– Total execution time (Sec)

• Depends on the hardware architecture (e.g., tile factors To and Ti )

• Ignore resource constrains for now

30

Performance Modeling

▸ Total operations FLOPS≈ 2×𝑂×𝐼×𝑅×𝐶×𝐾)

▸ Execution time = Number of Cycles×Clock Period– Number of cycles ≈ 0

12× 3

14× 5

16× 7

18× 𝑇𝑟×𝑇𝑐×𝐾)

≈𝑂𝑇𝑜 ×

𝐼𝑇𝑖 ×𝑅×𝐶×𝐾

)

▸ External memory accesses = 𝑎𝑖×𝐵𝑖 + 𝑎𝑤×𝐵𝑤 + 𝑎𝑜×𝐵𝑜– Size of input fmap buffer: Bi = 𝑇𝑖×(Tr+K−1)(Tc+K−1) with stride=1– Size of output fmap buffer: Bo = 𝑇𝑜×Tr×Tc– Size of weight buffer: Bw = 𝑇𝑜×Tr×𝐾)

– External access times:ao = 012

× 516

× 718

,ai = aw = 314×𝑎𝑜

31


32


...Computation Resource

PE-1 PE-2 PE-n

Inter-connect

buffer1On-chip memory

buffer2

External Memory

Off-chip Bus

Computational Throughput

Required Bandwidth

Computation To Communication (CTC) Ratio

GFLOP/Sec

FLOP / (Byte Accessed)

GByte/Sec

=Totalnumberofoperations

Totalexecutiontime

=TotalnumberofoperationsTotalexternalmemoryaccess

𝐺𝐵𝑦𝑡𝑒𝑆𝑒𝑐

= 𝐺𝐹𝐿𝑂𝑃/𝑆𝑒𝑐

𝐹𝐿𝑂𝑃/(𝐵𝑦𝑡𝑒𝐴𝑐𝑐𝑒𝑠𝑠𝑒𝑑)

=ComputationalThroughput

CTCRatio

33

Roofline Method [1]C

ompu

tatio

nal T

hrou

ghpu

t


Required Bandwidth

GFLOP/Sec

FLOP / Byte

Gbyte/Sec

Design

[1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, 2009.

34

Roofline Method [1]C

ompu

tatio

nal T

hrou

ghpu

t


FLOP / Byte

Bandwidth Roof(Device memory

bandwidth)Computational Roof(Max device throughput)

“The Roofline”

Exceeds On-Chip Resources

Bandwidth-Bottlenecked

Theoretical Throughput

Attainable Throughput

GFLOP/Sec

[1] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, CACM, 2009.

0

20

40

60

80

100

120

0 10 20 30 40 50 60

35

Design Space Exploration with Roofline

Computational RoofB

Bandwidth Roof

A

Com

puta

tiona

l Thr

ough

put


▸ All values in floating-point

▸ Only handles Conv layers, not dense or pooling

▸ Virtex7-485t▸ 100MHz▸ 61.6 GOPS▸ 18.6 W power

36

Hardware Implementation

37

Experimental Results

0

10

20

30

40

50

60

70

80

90

100

1 thread -O3 16 threads -O3 FPGA

Energy

0

2

4

6

8

10

12

14

16

18

20

1 thread -O3 16 threads -O3 FPGA

Speedup

90x

20x

18x

5x

CPU Xeon E5-2430 (32nm) 16 cores 2.2 GHz gcc 4.7.2 -O3

OpenMP 3.0

FPGA Virtex7-485t (28nm) 448 PEs 100MHz Vivado 2013.4Vivado HLS 2013.4

An OpenCL Deep Learning Accelerator on Arria 10Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, Gordon R. ChiuIntel Corporation (formerly Altera)Toronto, CanadaFPGA’17, Feb 2017

38

▸ Reducing external memory bandwidth usage by:1. Storing all intermediate feature maps in on-chip

buffers2. Image batching for dense layers

▸ Optimizing the convolution arithmetic using the Winograd Transformation▸ A compute-bound implementation on Arria 10

whose energy efficiency matches the Titan X GPU

39

Main Contributions

▸ The FPGA’15 paper used C++ with Xilinx Vivado HLS to generate RTL

▸ Sequential programming model using loops

▸ Inter-loop-iteration parallelism is implicit (requires unrolling)

40

OpenCL Programming

void sum (float* a, float* b,float* c)

{for (i = 0; i < 100; i++)

c[i] = a[i] * b[i];}

kernelvoid sum (global float* a,

global float* b,global float* c)

{int gid = get_global_id(0);

c[gid] = a[gid] * b[gid];}

OpenCLConventional C++

▸ This work used OpenCL and Intel FPGA SDK to generate RTL

▸ Parallel programming model using multithreaded kernels; where inter-iteration parallelism is explicit – each thread obtains an independent ID

▸ Previous papers stored layer data off-chip

▸ Insufficient on-chip storage to hold all data for a layer (input+output)

▸ This work uses Arria 10 FPGA device

▸ Enough storage to keep data on-chip (for conv layers in AlexNet)

▸ Use double-buffering to store input+output

41

Data Placement

Off-Chip Storage

FPGAOn-Chip Storage

PE

Layer Data1

Off-Chip Storage FPGA

On-Chip Storage

PE

3 2 1

1

2

2

3

3

3 2 1

▸ On FPGA, DSP blocks (used for fixed-point multiplies) are typically the bottlenecked resource

▸ Consider a 1-dimensional convolution with output length 2 and filter length 3, denoted F(2,3)

42

Arithmetic Optimizations

▸ In the 70s, Shmuel Winograd proved that F(m,r) can be computed with a lower bound of only m+r−1 multiplies [1]

F(2,3)6 multiplies

f0 f1 f2

i0 i1 i2 i3∗

o0 o1

f0 f1 f2

i0 i1 i2 i1 i2 i3

f0 f1 f2⨯ ⨯ ⨯ ⨯ ⨯ ⨯

o0 o1

[1] S. Winograd, Arithmetic Complexity of Computations, SIAM, Jan 1, 1980

43

Winograd Transform

𝑜c𝑜d = 𝑖c 𝑖d 𝑖)

𝑖d 𝑖) 𝑖e

𝑓c𝑓d𝑓)

f0 f1 f2

i0 i1 i2 i3∗

o0 o1

= 𝑖c𝑓c + 𝑖d𝑓d + 𝑖)𝑓d𝑖d𝑓c + 𝑖)𝑓d + 𝑖e𝑓)

𝑜c𝑜d =

𝑦c + 𝑦d + 𝑦)𝑦d − 𝑦) − 𝑦e

𝑑c = 𝑖c − 𝑖)𝑑) = 𝑖) − 𝑖d𝑑d = 𝑖d + 𝑖)𝑑e = 𝑖d − 𝑖e

𝑔c = 𝑓c 𝑔e = 𝑓)

𝑔d =𝑓c + 𝑓d + 𝑓)

2 𝑔) =𝑓c − 𝑓d + 𝑓)

2

Each𝑦4 = 𝑑4𝑔4

6 unique multiplies

Naïve Approach

Winograd Approach

1 multiply per yi4 unique multiplies

di and gi are linearly mapped

from ii and fi

▸ Massive increase in performance due to Winograd Transform and storing all features on-chip

▸ Among the first to break the TeraOP/s barrier on FPGA

44

Comparison to Previous Papers

Zhang2015

Qiu2016

This Paper

Platform Virtex7 VX485t

ZynqXC7Z045

Arria 101150

Clock (MHz) 100 150 303Quantization 32-bit

float16-bit fixed

16-bit fixed

Performance (GOP/s) 61.6 137.0 1382Power (W) 18.6 9.6 45Energy Efficiency (GOP/J) 3.3 14.2 30.7

▸ Results show FPGAs can compete with GPUs in energy efficiency ▸ Titan X numbers ignore communication overhead and use random

data instead of real images (highly optimistic)

45

Experimental Evaluation

Platform img/s Power(W)

Energy Efficiency(img/s/W)

Arria 10 DLA (20nm) 1020 45 22.7

Nvidia Titan X (28nm) 5120 227 22.6

Nvidia M4 (28nm) 1150 58 19.8

Benchmark app is AlexNet

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao1, Weinan Song2, Wentao Zhang2, Tianwei Xing3, Jeng-Hau Lin4, Mani Srivastava3, Rajesh Gupta4, Zhiru Zhang1

1Electrical and Computer Engineering, Cornell University2Electronics Engineering and Computer Science, Peking University3Electrical Engineering, University of California Los Angeles4Computer Science and Engineering, University of California San DiegoFPGA’17, Feb 2017

46

▸ Hardware architects widely apply fixed-point optimization for CNN acceleration– Motivation: both neural nets and image/video apps

naturally tolerate small amounts of noise– Approach: take a trained floating-point model and

apply quantization• 16 or 8-bit fixed-point have been shown to be practical

▸ Can we go even lower by training a reduced-numerical-precision CNN from the ground up?

47

CNNs with Reduced Numerical Precision

48

Aggressively Quantized CNNs

▸ ML research papers:– BinaryConnect [NIPS] Dec 2015– BNN [arXiv] Mar 2016– Ternary-Net [arXiv] May 2016– XNOR-Net [ECCV] Oct 2016– HWGQ [CVPR] Jul 2017– LR-Net [arXiv] Oct 2018– Many more!

Near state-of-the-art on MNIST, CIFAR-10 and SVHN at time of publication

[1] Matthew Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830, Feb 2016.

49

CNN vs. BNN

∗Input Map

2.46.2…3.31.8

…

Weights

0.80.10.30.8

∗Input Map

(Binary)

1−1…11

…

Weights(Binary)

1−11−1

=

Output Map

5.09.1…4.37.8

…

=

𝒙4x(Integer)

1−3…3−7

…

𝒚4x =𝒙4x − 𝜇𝜎) − 𝜖� 𝛾 + 𝛽

Output Map(Binary)

1−1…1−1

…

𝒛4x = �+1if𝒚4x ≥ 0−1otherwise

→

Batch Normalization

Binarization

Key Differences1. Inputs are binarized (−1 or +1)2. Weights are binarized (−1 or +1)3. Results are binarized after

batch normalization

CNN

BNN

1. Floating point ops replaced with binary logic ops

– Encode {+1,−1} as {0,1} à multiplies become XORs– Conv/dense layers do dot products à XOR and popcount– Operations can map to LUT fabric as opposed to DSPs

2. Binarized weights may reduce total model size– But note that fewer bits per weight may be offset by having

more weights

50

Advantages of BNN

b1 b2 b1⨯b2+1 +1 +1+1 −1 −1−1 +1 −1−1 −1 +1

b1 b2 b1XOR b20 0 00 1 11 0 11 1 0

▸ 6 conv layers, 3 dense layers, 3 max pooling layers▸ All conv filters are 3x3▸ First conv layer takes in floating-point input▸ 13.4 Mbits total model size (after hardware optimizations)

51

BNN CIFAR-10 Architecture [2]

[2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830, Feb 2016.

32x3216x16

8x84x4

3 128128 256

256 512512

1024 1024

10

Number of feature maps

Feature mapdimensions

52

BNN Accelerator Design Goals

▸ Target low-power embedded applications– Design must be resource efficient to fit a small device– Execute layers sequentially on a single module

▸ Optimize for batch size of 1– Store all feature maps on-chip

• Binarization makes feature maps smaller– Weights are streamed from off-chip storage

▸ Synthesize RTL from C++ source

53

BNN Accelerator Architecture

BitSel

fout ConvWeights

Convolvers

Variable-widthLineBuffer

∗

BitSel ∗

+Integerbuffer

+Pooling,Bnorm,Binarize

fout outputstreams

Inputwords

Inputwords

Outputwords

fin

…

Challenges Our SolutionMany diverse sources of parallelism (across/within images, feature maps, subword)

Highly parallel and pipelined architecture witha parameterized number of convolvers

Design must handle various layer types with different sized feature maps

Novel variable-width line buffer ensure pipeline is fully utilized

Slow interface between accelerator and general-purpose memory system

Careful memory layout and BitSel unit enable word-by-word data processing, instead of pixel-by-pixel

54

BNN HLS Designlogic to be fixed. Note that the VWLB used in our design dif-fers from Figure 3 in a few details. First, we have neglectededge padding. The actual VWLB contains two additional el-ements per bank to hold horizontal pad bits; vertical paddingis handled by inserting lines of zeros. Second, because thepad bits are 0 rather than +1 or -1, we must make each ele-ment in the VWLB two bits instead of one. The conv oper-ation is performed between the 2-bit data and 1-bit weights,and can be implemented as sign inversion and accumulate.

Bin-FC — The binary FC unit is comparatively simple.Each cycle we read in some number of data words and anequal number of weight words. Because there is no edgepadding in an FC layer the computations can be truly bi-nary. We perform a dot product between the data and weightwords by applying a bitwise XOR operation and then sum-ming of the resulting bits with a popcount. Similar to theBin-Conv unit, we accumulate the sum in an integer bufferand apply binarization after all inputs have been processed.Note that the FC layers are typically bound by memorybandwidth of the CPU to FPGA connection, rather than theability of the Bin-FC unit to parallelize the computation.

Since the FC layers have many more weights than inputs(see Table 1), we improve throughput via weight spilling —part of the data buffer is used to hold the weights for the FClayers. This increases the number of weights the acceleratorcan hold and reduces the number of DMA transfers required.

4. HLS Accelerator ImplementationFigure 4 shows the HLS pseudocode for the front half ofthe Bin-Con unit, and demonstrates a key difference betweenBNN and CNN hardware design. For a CNN the code typ-ically loops over an fmap processing one pixel at a time;key design decisions include loop ordering and unroll fac-tors (see [24] for a good example). In our BNN accelera-tor, the basic atom of processing is not a pixel but a word.The example code is designed to sustain one word per cy-cle throughput over the entire input fmap set. Each fmapconsists of words per fmap words (this number differs be-tween layers). As it processes the input set, the code updatesthe weights on each new fmap and accumulates the conv re-sults in outbuf. We call BitSel and conv inside the loop toinstantiate the BitSel units and conv logic as shown in Fig-ure 2(b). To increase the number of input streams we can tilethe loop and unroll the inner loop body.

A key design decision here is the input word size, whichcontrols the level of parallelism across the pixels of an fmap.To guarantee correctness, words per fmap must be an inte-ger greater than zero; this constrains the word size to at mostthe size of the smallest input fmap (8 ⇥ 8 = 64 bits in ourcase). The word size restriction is not a big limiting factor inour design, as 64 is already a very large parallelization factor(it means we perform 64 convolutions per cycle), and thereare other sources of parallelism to exploit in the BNN.

1 VariableLineBuffer linebuf;2 ConvWeights wts;3 IntegerBuffer outbuf;45 for (i = 0; i < n_input_words; i++) {6 #pragma HLS pipeline78 // read input word, update linebuffer9 WordType word = input_data[i];

10 BitSel(linebuf, word, input_width);1112 // update the weights each time we13 // begin to process a new fmap14 if (i % words_per_fmap == 0)15 wts = weights[i / words_per_fmap];1617 // perform conv across linebuffer18 for (c = 0; c < LINE_BUF_COLS; c++) {19 #pragma HLS unroll20 outbuf[i % words_per_fmap][c] +=21 conv(c, linebuf, wts);22 }23 }

Figure 4: HLS pseudocode for part of the Bin-Conv unit— the pseudocode implements a pipeline which reads andperforms convolution on one input word each cycle. Manydetails are left out; the goal is to illustrate how our designcan be expressed in high-level code.

We choose a word size of 64 bits for the data buffers,which is the largest possible value given the condition above.Each data buffer A and B is sized at 2048 words, which isjust enough to store the largest set of fmaps in the BNN. Theweight buffer has a word size of 9 bits so each 3⇥3 convfilter fits in one word. The weight buffer contains 215 words,which is the largest power of 2 we could fit onto the board.The buffer to store batch norm constants contains just 12864-bit words, and is sized to accommodate the maximumnumber of output fmaps we can produce in one acceleratorinvocation. The integer buffer (outbuf in the example) issized for the largest fmap (32⇥ 32), and each element has awidth of 12 determined through range analysis.

We also explored different values for fin

and f

out

in Bin-Conv. It was observed that both have roughly similar effectson execution time, but increasing f

out

has a more severeeffect on total area. f

in

controls the number of BitSels andVWLBs while f

out

controls the number of pooling/batchnorm units and integer buffers. In terms of logic a BitSeland a pooling/batch norm unit is similar, but each VWLBcontains 32 ⇥ 3 2-bit registers while each integer buffercontains 32 ⇥ 32 12-bit registers. Thus all else being equalit is better to increase f

in

. This result shows the importanceof minimizing the storage of intermediate values and onlycommitting binarized data to memory.

▸ User writes and tests in C++– CPU-FPGA interface

automatically synthesized (by Xilinx SDSoC)

– Significant reduction in verification time

• BNN RTL takes days to simulate

HLS code for part of convolver unithttps://github.com/cornell-zhang/bnn-fpga

mGPU CPU GPU FPGARuntime per Image (ms)

90 14.8 0.73 5.94

Speedup 1x 6x 120x 15xPower (W) 3.6* 95 235 4.7Image/sec/Watt 3.1 0.71 5.8 36

55

FPGA Implementation

FPGA: ZedBoard with Xilinx Zynq-7000mGPU: Jetson TK1 embedded GPU board CPU: Intel Xeon E5-2640 multicore processor GPU: NVIDIA Tesla K40 GPU

Misc. HW Optimizations1. Quantized the input image and

batch norm params2. Removed additive biases3. Simplified batch norm computation

BNN Model Test Error

Claimed in paper [2] 11.40%Python out-of-the-box [2] 11.58%C++ optimized model 11.19%Accelerator 11.19%

R. Zhao et al., Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. International Symposium on Field-Programmable Gate Arrays, Feb. 2017.

▸ Recent papers on neural networks on silicon – https://github.com/fengbintu/Neural-Networks-on-Silicon

▸ Tutorial on hardware architectures for DNNs– http://eyeriss.mit.edu/tutorial.html

▸ Landscape of neural network inference accelerators– https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-

accelerator

▸ Most cited deep learning papers (since 2012)– https://github.com/terryum/awesome-deep-learning-papers

56

Additional Useful Resources

▸ This tutorial contains/adapts materials developed by– Ritchie Zhao (PhD student at Cornell)– Authors of the following papers

• Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks (FPGA’15, PKU-UCLA)

• An OpenCL Deep Learning Accelerator on Arria 10 (FPGA’17, Intel)

57

Acknowledgements

dnn acceleration on fpgas - cornell university · dnn acceleration on fpgas ece 5775 high-level...

Documents