safe overclocking for cnn accelerators through algorithm

HAL Id: hal-03094811https://hal.inria.fr/hal-03094811

Submitted on 23 Jun 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Safe Overclocking for CNN Accelerators throughAlgorithm-Level Error Detection

Thibaut Marty, Tomofumi Yuki, Steven Derrien

To cite this version:Thibaut Marty, Tomofumi Yuki, Steven Derrien. Safe Overclocking for CNN Accelerators throughAlgorithm-Level Error Detection. IEEE Transactions on Computer-Aided Design of Integrated Cir-cuits and Systems, IEEE, 2020, 39 (12), pp.4777 - 4790. �10.1109/TCAD.2020.2981056�. �hal-03094811�

https://hal.inria.fr/hal-03094811

https://hal.archives-ouvertes.fr

1

Safe Overclocking for CNN Accelerators throughAlgorithm-Level Error Detection

Thibaut Marty, Tomofumi Yuki, Steven Derrien

Abstract—In this paper, we propose a technique for improvingthe efficiency of CNN hardware accelerators based on timingspeculation (overclocking) and fault tolerance. We augment theaccelerator with a lightweight error detection mechanism toprotect against timing errors in convolution layers, enablingaggressive timing speculation. The error detection mechanism wehave developed works at the algorithm-level, utilizing algebraicproperties of the computation, allowing the full implementationto be realized using High-Level Synthesis tools. Our prototypeon ZC706 demonstrated up to 60% higher throughput withnegligible area overhead for various wordlength implementations.

I. INTRODUCTION

T IMING speculation, also known as overclocking, is a wellknown approach to increase the computational through-

put of programmable processors and hardware accelerators.When used aggressively, timing speculation can lead to incor-rect results due to timing anomalies that typically occur withinlong combinational paths. As reported in the literature [1]–[3],timing errors can cause large numerical errors in the computa-tion. Although many applications are robust to low amplitudeerrors (e.g., errors due to quantization), occasional large errorscan have devastating effect even for such applications.

The frequency of such errors depends on a number offactors, including the intensity of overclocking, operating tem-perature, voltage drops, variability within and across boards,input data, and so on. This makes it extremely difficult todetermine a “safe” overclocking speed analytically or empiri-cally as acknowledged in prior work. This has led to the useof online monitoring to avoid having excessive slacks basedon static timing analysis [4], [5].

In this work, we propose to combine timing speculation withalgorithm-level error detection to make overclocking a viableoption for Convolutional Neural Networks (CNNs). CNN isa variant of multi-layered neural networks that constructsfeatures from local information through convolutions. CNNmodels used in state-of-the-art applications are computationintensive and often need to be accelerated on GPUs or FPGAsto achieve high performance and/or to reach better energyefficiency [6], [7]. The core computations of CNNs haveabundant parallelism, both task-level and fine-grained, thatcan be efficiently mapped to these accelerators. Furthermore,CNNs are known to be tolerant to noise. The reasons abovemake CNN an interesting class of computations to target.

Online error detection is necessary to prevent errors fromaffecting the final output, and it must be lightweight so thatthe gains by overclocking are not nullified. Although manyerror detection techniques exist, they are implemented at the

circuit-level and are tied to the exact design at hand [4], [5],[8]. This limits the ease of design space exploration for a givenalgorithm, especially in the context of High-Level Synthesis(HLS) where these techniques cannot be directly expressed.

We therefore propose a higher level error detection schemesimilar to Algorithm-Based Fault Tolerance (ABFT) [9], whichis used in High Performance Computing as a lightweightprotection from both soft and hard errors [10], [11].

Specifically, our contributions are the following:• A lightweight error detection for convolution layers based

on algorithmic properties. The developed method givestwo-degree savings in complexity with respect to the fullconvolution layer.

• An open-source prototype implementation1 of a timingspeculative CNN accelerator based on the proposed errordetection algorithm. This FPGA prototype is fully imple-mented in C++ with HLS tools and targets Zynq basedsystems.

• An analysis on how our approach fits into the CNNaccelerator design space, and how timing speculationimproves throughput in spite of the misspeculation cost.

• An experimental validation of the approach showing sig-nificant throughput and energy efficiency improvementswith negligible area overhead.

The remainder of this paper is organized as follows. Weprovide background on timing speculation techniques andCNNs in Section II. We describe our proposed approachwith lightweight error detection in Sections III and IV. Wedemonstrate the approach with a prototype implementation inSection V, and then discuss related work in Section VI. Weconclude and give directions for future work in Section VII.

II. BACKGROUND

In this section, we introduce convolutional neural networksand discuss existing implementation strategies for CNN ac-celerators. We then explain timing speculation and show theneed for online error detection by empirical observation of itsimpact on application-level accuracy.

A. Convolutional Neural Networks

In this paper, we are interested in the forward pass of CNNs,i.e., when a trained network is applied to new inputs. Typically,a forward pass of CNNs processes three-dimensional matricesin a pipelined manner through multiple layers. The input isusually an image (with color depth), and the final output is a

1https://github.com/ThibautMarty/conv-hls-overclocking/

https://github.com/ThibautMarty/conv-hls-overclocking/

2

Convolution)HW)(KxK)Data

Weights

Off)chip)memory

RELU

Control

Pooling

Fig. 1. Single Computation Engine

vector of length M indicating the likelihood of each category,which can be viewed as a 1× 1×M matrix.

We are interested in the convolution layer of CNNs, whichis known to be the main bottleneck. Given a P × Q × Ninput matrix x, it computes a R × C × M output y. Eachof the M two-dimensional outputs are computed as a three-dimensional convolution over x with a kernel (weights) ofsize K × K × N . The convolution may be strided by somefactor S. The R and C dimensions of the output matrix isusually smaller than the input, depending on the stride andpadding at boundaries. Different layers take different valuesof the parameters described; the output R × C ×M , calledfeature maps, can be viewed as an “image” where the depthis the extracted features.

Given a P ×Q×N input matrix x and K ×K ×N ×Mmatrix holding the weights w, a convolution layer outputs aR× C ×M matrix y through the following equation:

yr,c,m =

N−1∑n=0

K−1∑i=0

K−1∑j=0

xn,Sr+i,Sc+j · wm,n,i,j (1)

Fully-Connected layers are usually employed as the laststages of the pipeline where the size of the input matrix hasbecome significantly smaller than the original image. Theyperform a collection of dot products with weights and producea one-dimensional vector as outputs. They are identical to thehidden layers in regular artificial neural networks and can becomputed as matrix-matrix products.

Additionally, activation functions (e.g., rectifier) or sub-sampling (e.g., max-pooling) may be considered as layers inCNN. Since these layers are inexpensive and can be mergedwith the preceding layer, we do not discuss them in this paper.

State-of-the-art CNN model are known to be computation-ally demanding (a single run of a VGG16 model requiresmore than 30 G operations), with more than 90% of thecomputational workload spent on the convolutions layers. Ourapproach therefore aims at accelerating the convolution layer.

B. Accelerating CNNs with FPGA

There has been numerous work on accelerating CNNs onFPGAs and other platforms. In this section, we briefly describethe most common system-level architectures, which can beclassified into three families [12].

• Single Computation Engine (SCE) architecture, de-picted in Figure 1, offloads convolution kernels to theaccelerator, and processes layers in a sequential manner.SCEs generally support all types of convolution layers,and can also execute RELU or max-pooling layers. This

RELU Feat.Classifier

FIFO FIFO Pool

FIFO

Layer52Layer51 Layer53

Layer54

+

xx

x

+

xx

x

+

xx

x

++

+

+

xx

x

+

xx

x

+

xx

x

++

+

ConvHW

+

xx

x

+

xx

x

+

xx

x

++

+

ConvHWW

eights

ConvHW W

eights

Weights

Fig. 2. Streaming Accelerator

accelerator style typically employs a form of decoupledaccess/execute model to hide memory access latency.Whenever a convolution is to be executed by the SCE,both its inputs and weights are fetched from externalmemory. This often results in memory bandwidth bot-tlenecks. To alleviate this issue, it is possible to takeadvantage of batching, where a given layer configurationis reused to process several data-sets. However, batchingimproves overall throughput at the cost of latency.

• Streaming Accelerators (SA), as shown in Figure 2,use multiple convolution accelerator instances, typicallyone for each layer, that are pipelined to provide highthroughput. In this approach, each convolution acceleratormust store all its weights on the chip, which can be chal-lenging given the relative size of CNN models. Anotherdifficulty is to balance processing resources among layersto maximize the overall throughput.

• Fused Accelerators (FA) use a combination of previousapproaches, where the hardware accelerator executes asequence of convolution layers. Such approach offers agood trade-off between on-chip memory cost, off-chipmemory bandwidth requirements, and inference latency.

Our approach augments the convolution hardware depictedin Figures 1 and 2 with error detection to enable safeoverclocking. The implementation of the convolution hard-ware is discussed in Section IV. Our approach fits all thearchitectures described above; our only requirement is thatthe inputs/outputs of the convolution accelerator is streamedthrough FIFOs.

C. Timing Speculation through Overclocking in FPGAs

The minimum clock period at which a given FPGA designis expected to work is obtained from a Static Timing Anal-ysis (STA). This analysis assumes the worst case scenario,and hence the design may be operated on higher operatingfrequencies without observing incorrect behaviors. However,this technique, widely known as overclocking or timing spec-ulation, also has many pitfalls:

• Variability among chips and operating conditions makesit difficult to determine how much overclocking can betolerated. The fact that errors do not manifest often (orthe inability to observe errors in a given setup) does notmean that errors do not happen.

• The impact of timing errors on the circuit output isdifficult to evaluate a priori. Unlike errors arising fromtruncation/quantization, the impact is not limited to Least

3

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●

●

●●●●●

●●

●●●●

●●

●

●

●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0

5

10

15

20

195 200 205 210 215

Frequency (MHz)

Cla

ssifi

catio

n E

rror

(%

)Influence of Overclocking AlexNet Layer5 on Classification

Fig. 3. Impact of overclocking on classification; classification error is whenthe output class differs from that of error-free execution, regardless of theground-truth class. The data points are the highest/lowest error rate observedfor the given frequency, and the shaded regions show the range of possibleerror rates influenced by various sources of variability.

Significant Bits (LSBs). Thus, it may result in largenumerical errors, compromising the design functionality.

D. Impact of Overclocking on Classification Accuracy

We performed a series of experiments to asses the impactof overclocking on final outputs. In this section, we reportsome of the results obtained from classifying 5000 images (thefirst 10% of ILSVRC 2012 validation dataset) using AlexNet.We used our accelerator prototype, 16 bits fixed-point design,to execute the last convolutional layer. The last convolutionallayer is selected for these experiments because the last layersare known to be more sensitive to errors than the first layers.We obtained the input data for the accelerator by quantizingthe previous layer’s error-free floating-point output in orderto isolate effect on accuracy. Using low precision fixed-pointrequires special attention to training and quantization to avoidaccuracy loss, which is beyond the scope of this work.

Figure 3 depicts the impact of overclocking on classificationaccuracy. We have performed this experimentation over a setof Zybo platforms to capture inter-board variability, usingdifferent data-sets. Our results show that there are significantvariation when the accuracy degradation starts to happen (notethat we obtained similar results for ZC706 system).

It is acknowledged by prior work that a “safe” frequency isimpossible to determine statically due to multiple sources ofvariability [5]. For example, when considering only a subset ofvariability (inter-board variability), there is as wide as 10MHzgap in the frequency where accuracy starts to drop. In otherwords, it is difficult to ensure that the overclocked acceleratorsdo not significantly alter the application outputs.

E. Online Error Detection for Convolution

The combination of variability and high impact on out-puts makes overclocking impractical without some form ofonline error detection/correction. This has led to circuit-leveltechniques for dynamically checking for incorrect behaviors,including the Razor technique [4], [5], [13], [14]. In particular,Nunez-Yanez [8] has used this technique in the context oftiming-speculative Binary Neural Network accelerators. Thesecircuit-level approaches do not support designs with DSP

blocks, and often need semi-manual design steps to be im-plemented on FPGAs (see Section VI for further discussion).

III. ALGORITHM-LEVEL ERROR DETECTION

In this section, we describe our lightweight error detectionfor convolutions. Our approach is at the algorithm-level, incontrast to circuit (register) level techniques.

A. Overview of the Approach

Our approach builds on a lightweight mechanism for de-tecting errors based on algorithmic properties. This approachshares some ideas with earlier work on Algorithm-Based FaultTolerance (ABFT) for matrix operations [9].

This error detection mechanism uses two checksums, onecomputed from the outputs, and another computed directlyfrom the inputs, which we refer to as output-checksum andinput-checksum, respectively. We flag the computation as er-roneous whenever the checksums do not match.

The key intuition is that the computation of the input-checksum can be simplified by taking advantage of algebraicproperties, leading to a low-complexity error detector.

B. ABFT for 2D Convolutions

The 2D case is missing the depth dimension in the processedmatrices, but the main ideas carry over to the 3D case. Giventhe 2D output y, the output-checksum is:

σ =

R∑r=0

C∑c=0

yr,c

Substituting the definition of y (Eq. 1 without m,n and S = 1)gives the direct computation from the inputs:

ρ =

R∑r=0

C∑c=0

K−1∑i=0

K−1∑j=0

(xr+i,c+j · wi,j)

The goal is to simplify the computation of ρ so that thechecksum comparison can be performed with reduced cost.

The additional two-dimensional summation provides twosources for simplification. The combination of the two sim-plifications described below reduces the cost of computingthe input-checksum for 2D case from O(RCK2) to K2

multiplications and O(RC) additions.1) Factorization: Multiplications can be factored out to

eliminate RC multiplications. This may be viewed as areordering of the summations followed by factorization:

ρ =

K−1∑i=0

K−1∑j=0

wi,j

(R∑

r=0

C∑c=0

xr+i,c+j

)

A graphical illustration is given in Figure 4. This reduces thenumber of multiplications from K2RC to K2.

4

K C

K

R

(a) 2D convolution without zero-padding

K C

K

R

w2,2 w2,1 w2,0

w1,2 w1,0

w0,2 w0,1 w0,0

(b) Contribution of an input by weights

w1,1w0,0

w2,2

K C

K

R

(c) Input groups by weights

Fig. 4. Illustration of the factorization for 2D case. As depicted in Figure 4a, 2D convolution can be viewed as a dot product between the weights and theneighboring inputs with a sliding window. An alternative view shown in Figure 4b is that an input value is used to compute K2 output values, contributingto each of them through multiplication by different weights. We can thus group the input data into (overlapping) subsets based on the weights, which is whatis shown for three weight values in Figure 4c. Since sum of convolution outputs is completely linear, the multiplication can be factorized to save work.

w0,1w0,0

K C

K

R

(a) Two neighboring groups

K C

K

Rreuse

subtract add

(b) Reuse pattern

Fig. 5. The reuse between two input groups corresponding to weights w0,0

and w0,1. The sum of all elements in group w0,1 can be computed byaddition/subtraction of columns from that of w0,0, instead of repeating R×Cadditions.

2) Reuse in Summations: The groups of summations afterfactorization have significant overlaps, which can be used toreduce the number of additions. This simplification concernsthe computation of the inner two summations, or input groups:

Xi,j =

R∑r=0

C∑c=0

xr+i,c+j

Note that each input group, Xi,j , is a summation over slightlydifferent regions of x due to the offsets by i and j. Thesevalues of X takes K2RC additions when computed naı̈vely.However, once a value of X for a specific instance of i, j iscomputed, the remaining instances can be computed by onlyO(C) or O(R) additions as explained in Figure 5.

There are multiple ways to rewrite the definition of X totake advantage of this reuse. One example is as follows:

Xi,j =

R∑r=0

C∑c=0

xr,c :i = 0j = 0

Xi,j−1 +

R∑r=0

xr+i,C−1+j −R∑

r=0

xr+i,j−1 :i = 0j > 0

Xi−1,j +

C∑c=0

xR−1+i,c+j −C∑

c=0

xi−1,c+j : i > 0

In the above, the R×C summation for X0,0 is performed first.Then the remaining instances of Xi,j are computed by additionand subtraction of one row/column. The R×C summation is

not repeated for each Xi,j (K × K instances) reducing thecomplexity by O(K2) in exchange for 2R or 2C additions.

C. Lightweight Checksum for 3D Convolution Layers

For the 3D case, the output y has the third dimensioncorresponding to the different kernels applied to the input.The checksum is over all three dimensions of the output:

σ =

M−1∑m=0

R−1∑r=0

C−1∑c=0

ym,r,c (2)

Substituting Eq. 1 (again, assuming unit stride) gives thedirect checksum computation from the inputs:

ρ =

M−1∑m=0

R−1∑r=0

C−1∑c=0

N−1∑n=0

K−1∑i=0

K−1∑j=0

xn,r+i,c+j · wm,n,i,j

Reordering of the summations permits the factorization of

the multiplication by weights:

ρ =

N−1∑n=0

K−1∑i=0

K−1∑j=0

((R−1∑r=0

C−1∑c=0

xn,r+i,c+j

)·M−1∑m=0

wm,n,i,j

)Note that in the 3D case, the different convolution kernelsapplied to the same input (the m dimension) can be first addedtogether, since m is invariant to the expression involving x.

Taking advantage of the reuse in summations yields:

ρ =

N−1∑n=0

K−1∑i=0

K−1∑j=0

(Xn,i,j ·

M−1∑m=0

wm,n,i,j

)(3)

where X takes the same form as the 2D case, except for theadditional n dimension, which does not have any reuse.

D. Managing Strided Convolutions

Some CNNs use strided convolution in their first layers,especially for large convolution kernels. The striding factoris usually proportional to the kernel size. For example, thetypical stride is 2 for 5× 5 kernels, whereas a striding factorof 4 is used for 11 × 11 kernels. In the following, we showhow our checksums support strides, at the price of a slightincrease in algorithmic complexity.

5

Strided convolution adds a parameter S as shown in Eq. 1.This changes the convolution behavior: the dot product is nowapplied on a sliding window over the input with a step of Sin both directions. This further reduces the size of the outputsand the number of multiply-accumulate operators by a factorof S2.

Non-unit strides introduce sparsity to how each input ismultiplied by the weights. All weights are not used for eachinput value, but only for one out of S, on the two dimensions.This results in S2 partitions of inputs and weights, where eachpartition may be viewed as a separate non-strided convolutionusing subsets of inputs and weights sampling every S elementsin both dimensions. The factorization of multiplications whencomputing the input-checksum is unaffected by strides, sincethe total number of input groups remains the same; eachpartition contributes to disjoint subsets of the input groups.Then, the input-checksum can be computed with Eq. 3 with astrided definition of X:

Xn,i,j =

R∑r=0

C∑c=0

xn,Sr+i,Sc+j

The reuse in summations now happens within each partitionwhen the inputs are used for multiple weights, that is, whenS < K. Exploiting the reuse in each partition gives:

Xn,i,j =

R∑r=0

C∑c=0

xn,Sr,Sc :i < Sj < S

Xn,i,j−S +

R∑r=0

xn,Sr+i,S(C−1)+j

−R∑

r=0

xn,Sr+i,j−S

:i < Sj ≥ S

Xn,i−S,j +

C∑c=0

xn,S(R−1)+i,Sc+j

−C∑

c=0

xn,i−S,Sc+j

: i ≥ S

There are two differences from the non-strided case: it requiresseparate initial computations for each partition (S2 total) andthe difference computations are strided. Note that S = 1 issimply a special case of the above.

The computation of output-checksum remains the same asthe non-strided case.

We note that layers with non-unit strides typically havelarger values of R,C, and K increasing the reuse. ForAlexNet, the first layer uses S = 4, R = C = 55, and K = 11,which corresponds to a factor of 189 savings on additions anda factor of 8 savings on multiplications. These savings aresimilar to those for the last layers with S = 1, R = C = 13,and K = 11, which corresponds to savings on additions andmultiplications with a factor of 169 and 9, respectively. Thus,we expect similar levels of overhead for strided and non-strided layers. Furthermore, the calculation of X for eachpartition is independent, and may be parallelized if latencybecomes an issue, at the cost of extra hardware resources.

TABLE ICOMPLEXITY OF CHECKSUMS VS. CONVOLUTION.

additions multiplicationsconvolution O(MNRCK2) O(MNRCK2)

output-checksum O(MRC) 0input-checksum O

(NK2(M + C) +RCS2

)O(NK2)

layer 1 layer 2 layer 3 layer 4 layer 5

53M

112M

75M

56M

37M53

M

112M

75M

56M

37M

342k

344k

571k

436k

315k

363

1k 2k 2k 2k

conv. additionsconv. multiplications

ABFT additionsABFT multiplications

Fig. 6. Number of additions and multiplications for convolution and ABFTchecksums for AlexNet [7] convolutional layers. Layer 1 has S = 4, otherlayers have S = 1.

E. Max Pooling and RELU Layers

Other layers such as pooling or rectified linear unit arenot compute-intensive. These layers may simply be executedwithout overclocking to ensure that no errors are introduced.

F. Checksum Overhead Analysis

Table I shows the complexity (in terms of number ofarithmetic operations: additions and multiplications) of boththe checksums and the convolution layer. Note that R × Cis the size of output images. Input images dimensions are(K+S(R−1))× (K+S(C−1)). The complexity reductiontranslates into differences of several orders of magnitude innumber of arithmetic operations as exemplified in Figure 6.The checkums require much fewer arithmetic operations com-pared to the convolution kernel for AlexNet convolution layers.

G. Correctness of Error Detection

In some cases, our checksum-based mechanism may lead toa false positive (detecting an error whereas no error occurred)or a false negative (not detecting an actual error), like anyerror detection mechanism.

A timing error can impact any computation in the circuit andthe erroneous results will be stored in registers. The erroneousvalues will in turn impact the convolution output and/or thechecksums. Therefore, there are several scenarios dependingon the number of timing errors and their impact.

The scenarios are summarized in Table II. The two basiccases are highlighted: when no error occur or when an error inconvolution layer occurs. The algebraic properties guaranteethat a single error in the convolution output is detected. Inother cases, the result may be a false positive or a falsenegative. A false positive only affects performance: the tile is

6

TABLE IIERROR DETECTION OUTCOME FOR ALL SCENARIOS: NO ERROR, SINGLE

ERROR OR MULTIPLE ERRORS IN THE CHECKSUMS AND/OR THECONVOLUTION OUTPUT. EACH OUTCOME IS LABELED AS ERROR-FREE

EXECUTION, FALSE POSITIVE (FP), OR FALSE NEGATIVE (FN).

Errors in convolution layer

none single multiple

Err

ors

inch

ecks

ums none error-free 100% detection possible FN

single 100% FPpossible FN

multiple possible FP

TABLE IIIDATA AND CHECKSUM WORDLENGTHS USED IN EXPERIMENTAL RESULTS(SEE SECTION V). THIS LAYER’S DIMENSIONS ARE N = 192, M = 128,R = C = 13, AND S = 1, BUT THE CHECKSUMS ARE COMPUTED ON

TILES. THE DESIGNS USE TILES OF SIZE 32 FOR THE N DIMENSION AND64 FOR THE M DIMENSION.

Input data Weight data K Checksums16 bits 16 bits 3 55 bits16 bits 8 bits 3 47 bits16 bits 4 bits 3 43 bits16 bits 2 bits 3 41 bits16 bits 1 bits 3 40 bits8 bits 8 bits 3 39 bits4 bits 4 bits 3 31 bits1 bits 1 bits 3 25 bits16 bits 8 bits 5 48 bits16 bits 8 bits 11 50 bits

computed again even though it was correct. A false negativeis problematic, as the errors are not detected. We have empiri-cally confirmed that the false negatives are rare in Section V-E.In the following, we give an intuition for the low rate of falsenegatives using a simplistic model of error.

Although they are unlikely, false negatives are still possiblewhen multiple errors occur. The errors during the convolutionlayer affect the output-checksum. Several errors can canceleach other if their effects on the checksum are opposite.Intuitively, the probability of such cancellation to happen isaffected by the precision (wordlength) used for the checksumcalculation. Since the checksum is computed with full preci-sion, its wordlength b depends on the input data types and thenumber of accumulations. For a full convolution checksum, wehave b = D+W + dlog2(NK2)e+ dlog2(RCM)e, where Dis the input data wordlength, W is the weight data wordlengthand N , K, R, C, and M are the convolution parameters asdefined in Section II-A.

Table III shows wordlengths of checksums for designs usedin our evaluation. The wider the checksum’s wordlength, thelower the chances to get a false negative. Assuming a simplemodel where an error ultimately leads to a single bit flip inthe checksum with uniform probability, the chances that morethan one error cancel each other is 1

2b.

Errors can also occur in the checksums computation circuit.If errors occur only in checksums computation, a false positiveis highly probable (except if errors cancel each other). Iferrors occur in both the convolution and the output-checksumcomputation, a false negative is possible, as explained above.

However, the convolution kernel circuit is expected to bemuch less sensitive than the checksums circuit to timing errors.This is because the hardware implementation of checksumsare much simpler compared to the convolution implementationdue to lower levels of parallelism. Therefore, we expect theconvolution computation to be impacted by timing errors at alower frequency than the checksums computation.

IV. IMPLEMENTATION

In this section, we describe how we designed the fault-tolerant convolution accelerator for FPGA and give anoverview on how the errors and frequency can be managed.

A. Baseline Accelerator Design

Our accelerator kernel follows an implementation strategyproposed by Zhang et al. [6], which is described in Figure 7.Note that we do not claim that this is the optimal convo-lution accelerator implementation. Finding the best designinstance often require extensive exploration of a large designspace involving loop transformations and data layout optimiza-tion [15], and this problem is completely orthogonal to ourapproach. One important notion in their work that we also useis the tiles. The CNN computations are largely data parallel,and can be decomposed into smaller chunks to reduce on-chip memory cost. The granularity of these chunks is usuallyselected such that the accelerator fits on the target board. Weuse the word tile to refer to the unit of computation performedby the accelerator.

B. Accelerator with Error Detection

Our modified architecture is depicted in Figure 8. Two addi-tional modules that compute the checksums are added betweenthe convolution kernel and the FIFOs. The important propertyis that the checksums are computed in a streaming fashionsuch that the latency of the accelerator is unaffected. Theentire accelerator sits within its own clock domain managed bysoftware through dynamic reconfiguration of the Mixed-ModeClock Manager (MMCM) module.

To preserve the overall performance of the accelerator, bothinput- and output-checksum components are designed to beable to sustain the same I/O throughput as the convolutionaccelerator. In our current implementation, we support up toone input data per cycle, but higher throughput is possible byexposing more parallelism in the hardware component, whichcan be easily achieved in HLS through an unroll directive.

In our proposed design, the entire accelerator operate inthe overclocked domain. We made this choice to simplify thedesign of the checksum components; more resources wouldhave been needed to compensate for the clock speed gap.

C. Implementation of Checksum Calculations

The output-checksum is implemented as a simple accumu-lator over convolution outputs, with a small area overheadcompared to the rest of the datapath. The hardware componentresponsible for computing the input-checksum (Eq. 3) is more

7

for(to = 0; to < M ; to++) { for(row = 0; row < R ; row++) {

for(col = 0; col < C ; col++) { #pragma pipeline for(i = 0; i < K ; i++) {

#pragma unrollfor(ti = 0; ti < N ; ti++) {

#pragma unrollfor(j = 0; j < K ; j++) { y[to][row][col] +=

w[to][ti][i][j] *x[ti][S*row+i][S*col+j];

} }

} }}}

ti=1 ti=N-1ti=0

j=0 j=2

Xi,j Ci,j

+

Σ

x

Pipelined

multip

lier

Xi,j Ci,j

x

Xi,j Ci,j

x

Yi,j

Xi,j Ci,j

+

Σ

x

Pipelined

multip

lier

Xi,j Ci,j

x

Xi,j Ci,j

x

Yi,j

Xi,j Ci,j

+

Σ

x

Pipelined

multip

lier

Xi,j Ci,j

x

Xi,j Ci,j

x

Yi,j

FIFO

FIFO

Control

Convolution-accelerator datapathConvolution-HLS-input-code

Fig. 7. The convolution kernel based on the design by Zhang et al. [6]. The convolution kernel has three sources of parallelism. The main convolution hasample parallelism, which is used as the fine-grained parallelism (unrolling in HLS). This datapath is also aggressively pipelined to process different inputs.Furthermore, this datapath itself is replicated for convolutions by different weights to the same input. The factor of unrolling and/or replication controls thethroughput of the accelerator.

ConvolutionAccelerator

≠

FIFO FIFO

AsyncFIFO

AsyncFIFO

error

InputChecksum

OutputChecksum

Clock Domain

Fig. 8. Accelerator with Error Detection.

costly, as it involves a multiplier, storage for partial sums, andnon-trivial control logic.

Both of these components are significantly less complexcompared to the main kernel. The main performance constraintis to ensure that the data processing rate matches that of theconvolution kernel. The checksum calculations need sufficientparallelism to keep up with the rate of input consumption aswell as the rate of output production. The parallelism in thesecalculations are realized as aggressive pipelining (II = 1) tominimize area cost.

The output-checksum, which is an accumulation, has atrivial parallelism to process the outputs. The computation isindependent except for the aggregation at the end.

The input-checksum computes the input groups and thesummation of weights over M as the data is streamed through.Then, the rest of the computation is overlapped with theconvolution kernel: the computation of the input groups (X),multiplication by the summed weights, and the final accumu-lation. Note that the inputs memory transfer (and summations)for next tile is overlapped with the kernel, the rest of the input-checksum calculation, and the outputs memory transfer for theprevious tile. This ensures that the input-checksum calculationdoes not impact the overall latency of the accelerator.

D. Integration with Complete CNN Accelerators

As mentioned in Section II-B, accelerator of the convolutionkernel is only a component of CNN accelerators. In this

section, we discuss how our convolution accelerator can beintegrated with different styles of CNN accelerators.

In the case of single computation engine, using our approachis straightforward. The convolution accelerator operates withinits specific clock domains, and asynchronous FIFOs are usedto stream in and out data to and from the accelerator. Asdescribed in Section V, this architecture has been choosenfor our prototype.

Taking advantage of our technique in streaming acceleratoror fused accelerator architectures is not more difficult. Con-volution layer can be mapped to different pieces of hardware,Thus, they may be overclocked at different frequencies, de-pending on the number of clock managers available on thedevices (Virtex-7 have up to 24 of such components). In thecase where the number of clock manager is not high enough,they may need to be clustered into a smaller subset of clockdomains. Asynchronous FIFOs must be used for data transferbetween clock domains.

Note that RELU and max-pool layers cannot benefit fromABFT, and thus they must operate either at a safe clock speed,or use alternative timing error detection techniques. Sincethese layers are not computational bottleneck, this does notimpact the performance of the accelerator.

E. Failure Recovery Cost

The misspeculation overhead depends on the system-levelarchitecture discussed in Section II-B. This section proposesan analytical model of the misspeculation cost.

For a single compute engine, an erroneous tile is simplyfed back to the accelerator at a later time. The macro-pipelinedoes not have to be stalled because the tiles are parallel; theycan be executed in any order.

For a streaming accelerator, the overhead depends on thedetails of the architecture. If there are buffers between layers,e.g., for crossing FPGA boundaries, then full pipeline stallcan be avoided. In the worst case, the full pipeline must berestarted, discarding all intermediate results.

As a simplistic model, consider the following:• each tile takes 1 unit of time with baseline frequency;

8

• failed tiles are executed at the baseline frequency;• N : total number of tiles;• O: mean overclocked frequency, normalized to the base-

line;• E: error rate, probability of a tile to fail;• S: number of stages in SA.

Note that the time it takes to switch the frequency is negligi-ble — less than 1% of a tile computation in our prototype.

Then the total time T can be modeled as: T = NO +S ·N ·E

where the SCE is a special case when S = 1.The gain by overclocking is N− N

O , which should be lowerthan the overhead S ·N ·E for overclocking to be profitable.Clearly, the error rate plays an important role, and we mayderive the break-even point (overhead = gain) as: E = O−1

SIn other words, even for a small gain in frequency, com-

pletely negating its gain takes a high error rate. For instance,with 25% overclocking on average with 5-stage SA archi-tecture, it requires 5% error rate to negate the gain. With astrict dynamic scaling policy that only attempts to increasefrequency in long intervals once an error is detected, the errorrate is extremely low (< 0.1%), rendering the recovery costnegligible. In addition, we expect that the error-free executionis not necessary in most use cases.

In the case where the application can tolerate a few errors(e.g., missclassifications), in a non critical context for instance,a small number of erroneous tiles may be tolerated. In thisspecific case, the detected errors do not need to be corrected.Hence, there is no additional latency overhead with ourapproach. However, how to precisely find the tolerable amountof errors is use-case specific and is orthogonal to this work.

F. Frequency ScalingThis section discusses how to achieve high throughput given

an accelerator with adjustable frequency and online errordetection capability.

As discussed in Section IV-E, the error rate is the mainfactor that determines throughput. Figure 9 illustrates thisbehavior. It shows the normalized throughput (related to thenon overclocked throughput) taking into account the misspec-ulation cost for a frequency range based on macro-pipelinedepth and measured error rate. We have measured the error rateand the average processing time for 100,000 tiles computedfrom uniform random data per frequency with a step of0.25MHz around the frequency where errors start to occur.Then, we have calculated the effective throughput using themodel described in Section IV-E. We can see that the resultingthroughput is linear with the frequency, until errors occur. Themacro-pipeline depth influences the throughput penalty but themain factor is the error rate.

As the environment (temperature, input data, etc) changes,the frequency at which the maximum throughput is achievedmay change over time and can not be known in advance,as exemplified by the red arrows in the figure. Note thatthese results reflect only one instance of the experimentunder specific operating conditions and the results may differdepending on the environment.

Therefore, there should be a dynamic mechanism that is ableto select the suitable frequency, i.e., the highest one without

errors, while keeping the error rate as low as possible. As thetarget may change over time, the dynamic mechanism needsto be used during the whole application execution, not only atthe beginning.

We propose a simple algorithm to dynamically scale thefrequency at runtime. The algorithm has two parameters:

• G: granularity of frequency adjusted at a time;• I: interval; number of successful tiles before attempting

to increasing frequency.The algorithm initially increases the frequency by G after

each tile until an error is observed. Then the algorithmrepeatedly takes one of the following actions:

• decrease by G if an error is observed;• increase by G after I successful tiles.

Parameters I and G expose a tradeoff between reactivity anderror rate and they must be chosen by the user.

This algorithm is very simple, and we acknowledge thatmore sophisticated control algorithms could have been used.However, our results (see Section V-F) show that this approachperforms well in our context.

V. EXPERIMENTAL RESULTS

We present our empirical study in this section. The studyconsist of two parts: (i) the area overhead, and (ii) frequencyscaling on AlexNet image classification.

A. Experimental Platform

The whole accelerator was designed using SDSoC (2018.2),demonstrating the suitability of our approach to modern FPGAtools that operate at higher levels of abstractions. We use theZC706 board (XC7Z045) from the Xilinx Zynq-7000 familyas our target platform.

The hardware designs used in this section target the fifthconvolutional layer in the AlexNet CNN architecture [7] withthe following standard configuration: N = 192, M = 128,R = C = 13, K = 3, and S = 1. Preliminary experiencesshowed that this layer was the one that most impact classifica-tion rate when impacted by timing errors. Note that the thirdand fourth layers of AlexNet also have similar configurationswith larger N and/or M .

B. Area Results

We report the results for various kernel size (K), andwordlengths. The wordlengths are denoted as D ×W , whereD is the input data wordlength and W is the weight datawordlength. Other design parameters (tile sizes and unrollfactors) were selected to achieve highest throughput on thetarget board with the largest wordlengths, and the sameparameters were used for lower wordlength designs (exceptfor K = 11 where an unroll factor has been divided by 2).

All designs were synthesized with several target frequenciesin order to get the maximum achievable STA frequency. Allreported numbers are after P&R.

The amount of hardware resources used by different hard-ware configurations are shown in Figure 10, from which wecan make the following observations:

9

140 160 180 200 220 240Frequency (MHz)

0.0

0.5

1.0

1.5

Norm

alize

d th

roug

hput

S = 1S = 10S = 100

Variability

236 238 240 242Frequency (MHz)

0.0

0.5

1.0

1.5 0.0%0.01%

1.03%1.03%1.03%

4.63%4.63%4.63%100.0%100.0%100.0%

Fig. 9. Normalized throughput as a function of the frequency for three scenarios computed from the model introduced in Section IV-E using empiricallycollected error rate and processing time. The left plot shows the full frequency range. The red arrows illustrate variability: the frequency at which the maximumthroughput is achieved cannot be known in advance and can change. The right plot shows the same data over a narrower frequency range where errors startto manifest. The labels correspond to error rates, ranging from 0% to 100%. At a given frequency, the error rate is the same for the three scenarios, the onlydifference being the misspeculation cost.

• DSP and BRAM cost, which often are the limitingresources for CNN implementation, is mostly unaffectedby the ABFT mechanism.

• There is a small overhead in terms of slice count.

Overall, our results show that the overhead is negligible.For the sake of completeness, we also compared our ap-

proach to error detection techniques based on Residue Num-bering Systems (RNS) protection as proposed in previouswork [16]. We considered a RNS protection using mod 3,mod 7, and mod 15 residue code protecting 16 bits data.Protection with mod 3 code is the RNS error detection strategywith the lowest possible area overhead, and also with thelowest error coverage.

Figure 10 shows the area overhead of RNS protectioncompared to the unprotected baseline. These results show thatour ABFT approach has a significantly lower area footprint(1.3k slices vs 2.7k/8k/15.2k slices). This is despite the factthat ABFT provides much better error coverage than RNS.For example mod 3 RNS provides only 66% error coveragein the case of multiple bit-flips, even when these bit-flipsare caused by a single timing error. In addition, RNS codesare much more fine-grained, and additional mechanisms areneeded to protect the full accelerator in the same manner asour approach. For instance, errors in control-logic may causea loop to be skipped, which would be undetected by typicaluses of RNS codes if the check happens at the end of the loop.

C. Performance Results

The overclocked frequencies enabled by our approach areshown in Figure 11 for accelerators with various wordlengthsand K = 3. For all data wordlengths except when W = 1bit, results show a mean performance improvement of 68%with little difference across configurations. For W = 1 bitcases, the performance improvement is more modest (21.3%and 26.6%) and, for the 1×1 case, well below the average 50%improvements reported by Nunez-Yanez [8] with the Elongateframework. We believe that this is because our acceleratoris designed to work for multiple wordlengths, and is not

specialized for binary CNNs. In particular, we observed thatfor designs with 1 bit weights, errors manifest much earlier.

Figure 14a shows the performance of accelerators with andwithout overclocking; the mean improvement was 54%. TheABFT-enabled frequency would be the one eventually reachedby a monitoring system in the same environment. The potentialfailure recovery penalty is not taken in account as it has beenshown to be negligible in Section IV-E.

D. Observed Errors

We also took a detailed look at how timing errors affectthe output of convolution layers and observed that LSBs havemuch higher chances to be corrupt in all frequency ranges, asshown in Figure 12. It may seem counter-intuitive, but it makessense as the routing phase tries to balance paths. On FPGAs,routing has become an important portion of the delays, andit impacts the timing behavior of the circuit as well as thecombinational paths. Furthermore, the dot product operationsuse DSPs that are blackboxes in which the timing behaviourcan not be easily analyzed.

At frequency ranges where timing errors start to manifest,almost all errors are at the LSBs. Many of these errors arebenign, since some of the LSBs are usually truncated tomatch the data representation used in the next CNN layer.This gives an opportunity to detect timing errors before theyimpact the final output with no penalty as the tile does notneed to be recomputed. The convolution is computed in anexact manner, with the intermediate results stored with enoughbits for the multiply-accumulate chain. The output-checksumis computed on untruncated output; otherwise the invariantwould not hold due to rounding errors. Therefore, a timingerror may be eventually masked by the truncation, resultingin correct, truncated outputs although the checksums do notmatch. When comparing the two checksums, it is possibleto know if an error affected the truncated output by lookingat which bits are incorrect. Assuming that only one errorhappened, and truncation of T LSBs, then the error did notaffect the truncated output if the error in the checksum islocated in the T LSBs. For such errors, recomputing the tile

10

BRAMDSP

SLICE

0%

20%

40%

60%

80%

100%

16 1

7.7k

16 1

8.5k

1 × 1 bits, K = 3

BaseBase+ABFT

BRAMDSP

SLICE

112

514

13.9

k

112

514

14.9

k

8 × 8 bits, K = 3

BRAMDSP

SLICE

112

512

11.6

k

112

512

13.2

k

16 × 4 bits, K = 3

BRAMDSP

SLICE

128

512

20.1

k

128

512

21.5

k

16 × 16 bits, K = 3

BRAMDSP

SLICE

112

512

14.4

k

112

512

15.8

k

16 × 8 bits, K = 3

BRAMDSP

SLICE

160

512

19.5

k

160

512

20.2

k

16 × 8 bits, K = 5

BRAMDSP

SLICE

0%

20%

40%

60%

80%

100%

672

261

8.5k

673

264

9.7k

16 × 8 bits, K = 11

SLICE

1.3k 2.

7k8.

0k15

.2k

16 × 16 bits, ABFT vs RNS overhead

ABFTmod 3 RNSmod 7 RNSmod 15 RNS

Fig. 10. Area results for ZC706 after P&R with and without error detection for different configurations.

16×1

616×8

16×4

16×2

16×1 8×

84×

41×

10MHz

100MHz

200MHz

300MHz

400MHz

132

92 88 93 91

45

89 115

57

136

142

144

147 211

141

127 21

4

Vivado STAABFT (20°C)Elongate [8]

Fig. 11. ZC706 clock speed by Static Timing Analysis (STA), our proposedapproach (ABFT) for a 0.1% tile error rate at room temperature, and theElongate technique proposed by Nunez-Yanez [8].

is unnecessary, but this information can still be used to tunethe frequency and avoid subsequent errors.

E. False Negatives

As discussed in Section III-G, the risk of having falsenegatives exists. The rate of false negatives depends on sev-eral factors such as the overclocking factor, the fabric, thewordlengths of data and operators, the design, etc.

We ran 8 × 8 and 4 × 4 bits accelerators over a rangeof frequencies to empirically show that false negative are

●

●

●●

●

●

● ●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ●

● ●● ● ● ● ● ● ● ● ● ● ●

●●

●●

●

●

● ●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ●

● ●● ● ● ● ● ● ● ● ● ● ●

●

●

●●

●

●

● ●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●

●

● ●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●●

●

●

●●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●●

●

●

●●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

● ●

●

●

● ●

●

● ●

●●

●● ●

●●

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●●

●

●

●●

●

● ●

●●

●● ●

●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

● ●

●

●

●●

●

● ●

●●

●● ●

●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●●

●

●

●●

●

● ●

●●

●● ●

●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●●

●●

●

●

● ●

●

● ●

●●

●● ●

●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

● ●

●

●

● ●

●

● ●

●●

●

● ●●

●● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

●

●

●

● ●

●

● ●

●●

●

● ●●

●● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

●

●●

●

● ●

●

● ●

●●

●

● ●●

●● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●●

●

● ●

●

● ●

●●

●

● ●●

●● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●●

●

● ●

●

● ●

●●

●

● ●●

●● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1e−05

1e−03

1e−01

1e+01

236 238 240 242 244 246

Frequency (MHz)

Bit

Err

or R

ate

(%)

0

5

10

15bit

Bit−wise Error Rate with Overclocking

Fig. 12. Observed bit-wise error rate in the outputs with 16×16 bits design,with uniform random inputs.

highly improbable. We define the false negative rate as thenumber of missed erroneous tiles to the number of erroneoustiles. We computed more than 700,000 tiles for 21 differentfrequencies selected to observe 0% to 10% failed tiles. Wedeliberately overclocked the circuit until large numbers oferrors were observed. Since the false negative rate is expectedto be extremely small, we would not be able to observe falsenegatives under normal configuration (with less than 0.1%error rate) in a reasonable amount of time.

We did not observe any false negative for the 8 × 8 bitsconfiguration, but observed some for the 4× 4 bits configura-tion. This agrees with our analysis in Section III-G that showsdesigns with shorter checksum lengths are more likely to have

11

Fig. 13. The fifth layer of AlexNet with frequency scaling for classificationof 1000 images (= tiles for this configuration) using the 16× 16 bits design.The algorithm in Section IV-F was configured to G = 1MHz and I = 100.

false negatives. The false negative rate was at most 0.4% ofthe incorrectly computed tiles. This translates to 0.06% of thetotal number of tiles computed at the same frequency, whichis being aggressively overclocked to exhibit many errors. Thefalse negative rate is expected to be much lower in a normalsetting where the error rate is much lower. Therefore, we claimthat false negatives are not an issue in our approach.

F. Dynamic Frequency Scaling

Figure 13 shows the result of applying frequency scaling onthe fifth layer of AlexNet with the 16 × 16 bits design. Thefrequency scales up to around 230MHz and then stabilizeswith occasional bumps. The mean frequency over the 1000images was 225.5MHz. Compared to the baseline frequency,this is about 66% increase in throughput. Note that this resultis from a single board, single setup. The gains are expectedto vary under different operating conditions.

G. Impact on Energy Efficiency

Since energy efficiency is one of the reason for resortingto FPGAs, we measured the impact of our technique on theoverall energy efficiency of the system. The power dissipatedby an FPGA accelerator can be decomposed into a static powercontribution Pstat and a dynamic power contribution Pdyn, thelatter being proportional to the circuit clock frequency.

Because FPGAs are built out of SRAM technology, theytend to have relatively high static to dynamic power ratio.In addition, for a fair analysis, the value Pstat must alsoinclude static power contributions from all the componentsof the system: external memory, peripherals, regulators, etc.

Our results are provided in Figure 14b that shows theenergy efficiency in Giga-operation per Joule for differentwordlengths. We measured the execution time of the acceler-ator computing 64 images, and the power consumption of theentire ZC706 board (except power adapter) during that time.As expected, our overclocking scheme impacts only a fractionof the total power consumption of the system and improvesthe overall energy efficiency of the system by 46% on average.This is perfectly in line with prior observations [8], [17].

VI. DISCUSSION AND RELATED WORK

CNN is a rapidly changing field, and many approaches tohardware accelerations have been proposed in the literature.

16×1

616×8

16×4

16×2

16×1 8×

84×

41×

10 Gop/s

20 Gop/s

40 Gop/s

60 Gop/s

23 24 24.3

24.8

35.6

23.8

24.8

35.638.5

38.9

40 40.2 43.2

38.8

40.9 45.7

BaseBase+ABFT

(a) Throughput

16×1

616×8

16×4

16×2

16×1 8×

84×

41×

10 Gop/J

1 Gop/J

2 Gop/J

3 Gop/J

4 Gop/J

5 Gop/J

2.3 2.5

2.5 2.6

3.7

2.5 2.6

3.9

3.5 3.8 3.9 4

4.4

3.8 4.1

4.9Base

Base+ABFT

(b) Energy Efficiency

Fig. 14. Performance comparison for different wordlengths in throughput(Giga-operations per second; Gop/s) and energy efficiency (Gop/J) at twofrequencies: Base (fSTA) and ABFT-enabled frequency (foc), which is themaximum frequency at which we did not observe any error during theexperiments.

In the following, we review several CNN strategies/designchoices, and discuss how their impact of the feasibility orefficiency of our propose ABFT scheme.

A. Data RepresentationsEarly implementations of CNNs on FPGAs used floating

point arithmetic to perform convolution operations [6]. Float-ing point arithmetic is not associative and leads to roundingerrors during computations. Thus, the values of input- andoutput-checksum obtained through ABFT may not match evenin the absence of timing errors.

The invariant may be relaxed to tolerate a certain errormargin to alleviate this issue, but at the price of significantincreases in the rates of both false positives and false negatives.Since checksum computations are linear, it is possible to deriveanalytical models of the error distribution, and to determinethe most adequate threshold. Moreover, such relaxations donot affect the capability to detect large errors, e.g., corruptedexponents.

12

In this work, we focused on fixed-point implementations,since it has been shown that CNN inference can be performedwith fixed-point arithmetic (including ternary and binary neu-ral networks, which are special cases of integer arithmetic)with little or no impact on the classification accuracy.

B. Algorithmic Variations

Independently of the choice of data representation, manyresearch work has proposed algorithm-level optimizations toimprove the performance of CNNs.

For example, FFT based convolutions have been used toimprove the performance of CNN accelerators [18], [19]. Theimplementation of convolution in the frequency domain helpsreducing the overall computational complexity from O(N2) toO(NlogN). However, all existing implementations are basedon floating point arithmetics, for which our proposed ABFTscheme cannot guarantee the same error coverage due torounding errors as explained above.

Winograd based convolutions have shown to improve theperformance of CNNs on FPGAs [20] by reducing the averagenumber of multiplications required per convolution output.However, this reduction comes at the price of increased num-ber of additions, and more complex reuse patterns. In contrastto FFT based convolutions, Winograd based CNN does notrequire intermediate quantization stages. As a consequence,our approach can be applied at the layer level almost as is.

Weight pruning is an effective technique to reduce stor-age/bandwidth requirements when implementing CNNs. Ear-lier results [21], [22] have shown that up to 90% of the weightscould be pruned out without impacting classification accuracy.Efficiently implementing pruned CNN requires convolutionaccelerators that are different from the one used in this work asthey operate on sparse data. Nevertheless, our error detectionscheme can be reused as is, with a slight increase in relativeoverhead being the only difference. Since the number ofoperations in a 90% pruned network is one-tenth of that ofan unpruned network, the relative cost of ABFT is ten timeshigher. This is at most 3% for the configuration shown inFigure 6, compared to more than 50% speed improvements.

C. Comparison with Existing ABFT Techniques

Our approach is similar to other ABFT techniques presentedin previous work [9], [23]. The idea behind Algorithm BasedFault tolerance (ABFT) is to use high-level algebraic proper-ties (or algebraic invariants) relating the inputs and outputs ofan algorithm. When the results do not satisfy these invariants,then at least one error has occurred during the execution. Insome cases, the results of the incorrect invariant may even beused to locate (and correct) errors.

ABFT was initially proposed for error detection and correc-tion of matrix operations [9], but was later adapted to other al-gebraic kernels, such as matrix decomposition (QR, LU) [24],iterative stencils [25], and Fast Fourier Transforms [24]. Theoriginal ABFT for matrix product is illustrated in Figure 15.

The convolution layer and the fully-connected layer canbe viewed as a collection of matrix products, making theclassical ABFT directly applicable. The lightweight checksum

× =

n

m

p

n

121012

210120

210021

000102

202112

22210

00211

12221

11102

00101

10823

108

622

1056

11635

1010

945377

412133

7 6 6 3 8

45845

41 23 45 35 14

402114143534158

41 31 45 35 14

402114223534

Fig. 15. ABFT for matrix product [9]. The input matrices are augmentedwith additional row and column (shaded ones) that are column-wise androw-wise checksums, respectively. The product of the augmented matriceshave additional row and column that are themselves checksums of the innersub-matrix (the original output without augmentation), which is due to thealgebraic property. In this example, the value at 4th row, 2nd column isincorrect—it should be 2. This can be detected and located by the mismatchin the row-wise and column-wise checksums that are computed during thematrix product, and those separately computed from the output matrix.

calculation proposed in this paper is not a direct applicationof the ABFT for matrix operations. We use algorithmicinvariants in collections of convolutions to further reducethe cost of checksums. This is evident in the fact that wereduce the algorithmic complexity by twofold, exploiting reusein two dimensions, whereas the original ABFT brings one-degree savings. Some ABFT extensions employ a variant ofthe original method by identifying pieces of computationsthat can be viewed as matrix operations [25], [26]. However,ABFT is a more general concept that can be applied to othercomputations, e.g., FFT [23]. The original ABFT for matrixoperations has been used in the context of FPGAs as well [27],[28]. However, these earlier work did not consider ABFT asa mean to support overclocking.

Some of the ABFT techniques employ two or more check-sums to enable error correction. It is possible to use a similarmethod for the convolution layer by giving up one-degree ofcomplexity savings. However, this is not attractive because:

• Error correction based on checksums would not workif there are two or more errors (or at the cost of lesscomplexity reduction), which is frequently the case withoverclocking.

• Recovery by recomputing erroneous outputs does not addsignificant cost as discussed in Section IV-E.

• Adding the error recovery logic in hardware can be costlyin terms of area.

D. Other Techniques for Timing Error Detection

Other error detection techniques, such as those based onResidue Numbering Systems (RNS) could be used. RNSprotection provides low overhead error detection [16] forconvolution kernels. However, they provide limited protectionagainst typical timing errors that impact several outputs at atime. We have compared the area overhead of RNS with ourapproach in Section V-B.

Some error detection techniques have been designed specif-ically to detect timing errors. Both Razor [4] and online slackmeasurement [5], [14] use shadow registers to detect timing

13

violation, or to measure timing slacks. These approaches adjustthe frequency/voltage using the measured errors, or timingslack, which is similar to our work. The main differencebetween our approach and those based on shadow registersis that we provide error detection at the algorithm-level incontrast to register/circuit level. This makes our approach moreflexible, as it can be reused without additional design effortover a large design space (tile size, unrolling, data wordlength,etc.) of CNN accelerators. In contrast, circuit-level protectionsmust be redesigned for every new design. Although someproof-of-concept toolchains have been proposed [8], [29] toautomate such approach, they are not publicly available, andit is difficult to assess how robust they are in practice.

Existing circuit-level techniques trade coverage with areaoverhead. One of the main difficulty is to properly select thesmallest subset of the registers to protect/measure to keep theoverhead low. This process is complicated as the addition ofthe shadow registers may impact the timing, and is also limitedby the static timing analysis to determine the (near-)criticalpaths. Although low overhead (≤ 4%) has been claimed forsmall kernels prototyped in FPGAs [5], the overhead of avariant of Razor added to an ASIC implementation of theARM Cortex-M3 [13] is reported to be much higher at 20%or even more depending on the expected coverage.

It is also not clear how device level variability impacts errorcoverage. Because of the large variability in logic block delayswithin a single chip (see the work of Gojman et al. [30]),the 10% longest paths reported by STA are seldom the actuallongest paths for the device at hand. Therefore, it is difficultto evaluate the actual error coverage of such approach.

More importantly, as acknowledged by Nunez-Yanez [8],key FPGA components such as DSP blocks include internal(e.g., pipelining) registers that cannot be protected due to thelack of adequate routing resources. Given that CNN work-loads mainly consist of multiply-accumulate patterns, sucherror detectors have limited applicability for FPGA designs.Specifically, it is limited to binary neural network acceleratorsthat do not use the accumulators in DSP blocks.

In spite of these challenges, Nunez-Yanez [8] has usedthis approach to explore energy/performance trade-offs usingDVFS on FPGAs in the context of binary neural networks.The results reported in their work are in line with our ob-servations (50% performance margins are common for Zynq)and suggest that more recent generations of FPGAs, such asthe UltraScale, can tolerate even larger overclocking margins,possibly due to the increasing variability in smaller processnodes. The main difference lies in the results that we obtainfor the overclocking margin for 1 bit CNNs. However, this iscompletely orthogonal to the error detection technique and ismore related to convolution implementation.

Our approach does not share these limitations in circuit-level techniques, because it operates at the algorithm level. Itprovides excellent error coverage for lower hardware overhead:less than 1% compared to a few reference values of around5% reported in prior work [5], [14].

The main limitation of our work is that we require the com-putation to have algorithmic properties to enable lightweighterror detection. Although convolution is not the only computa-

tion where ABFT is available, our technique is not applicableto arbitrary computations unlike those based on shadow regis-ters. An interesting direction of future work is to automaticallyidentify the necessary properties to improve the applicabilityof our technique.

E. Applicability to Other Architectures

Although the proposed approach is specifically tailored toFPGA accelerators, we expect our technique to be useful onother types of architectures (CPUs, GPUs).

For example, Leng et al. [31] evaluated how the impact oftiming errors, due to voltage reduction (DVFS), would impactthe correctness of the program output. Their results show thatthe impact of timing errors is highly dependant of the pro-gram behavior. They also show that, under aggressive voltagereduction, convolutions and FFT kernels induce many silentdata errors. Theses results suggest that our approach would bedirectly applicable to GPUs accelerated CNN, provided thatfixed point arithmetic is used.

We also expect the technique to be applicable to in-orderCPU micro-architectures with simple control logic, such asthose used for CNN workloads in embedded systems [32].

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we propose timing speculation coupled withlightweight error detection as an approach to further improvethe performance of hardware accelerators for CNNs. We havedemonstrated the efficacy of our approach with a prototypeimplementation and an extensive empirical study. In addition,our approach is well suited for implementation in high-leveldesign tools such as Vivado HLS/SDSoC, which is becomingmore and more attractive for productivity reasons.

We believe that similar techniques can be applied to manyother application domains (bioinformatics, iterative solvers,etc.) by taking advantage of existing ABFT techniques, or bydevising new algorithms tailored for this task.

ACKNOWLEDGEMENTS

This work was partially funded by the Brittany Region.

REFERENCES

[1] S. Chaudhuri, J. S. J. Wong, and P. Y. K. Cheung, “Timing speculation inFPGAs: Probabilistic inference of data dependent failure rates,” in Pro-ceedings of the 2011 International Conference on Field-ProgrammableTechnology, ser. FPT ’11, 2011, pp. 1–8.

[2] G. Li, S. K. S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer,and S. W. Keckler, “Understanding Error Propagation in Deep LearningNeural Network (DNN) Accelerators and Applications,” in Proceedingsof the International Conference for High Performance Computing,Networking, Storage and Analysis, ser. SC ’17, 2017, pp. 8:1–8:12.

[3] K. Shi, D. Boland, and G. A. Constantinides, “Accuracy-PerformanceTradeoffs on an FPGA through Overclocking,” in Proceedings of theIEEE 21st Annual International Symposium on Field-ProgrammableCustom Computing Machines, ser. FCCM ’13, 2013, pp. 29–36.

[4] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler,D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A Low-PowerPipeline Based on Circuit-Level Timing Speculation,” in Proceedings ofthe 36th IEEE/ACM International Symposium on Microarchitecture, ser.MICRO 36, 2003, pp. 7–.

14

[5] J. M. Levine, E. Stott, and P. Y. Cheung, “Dynamic Voltage & FrequencyScaling with Online Slack Measurement,” in Proceedings of the 2014ACM/SIGDA International Symposium on Field-Programmable GateArrays, ser. FPGA ’14, 2014, pp. 65–74.

[6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimiz-ing FPGA-based Accelerator Design for Deep Convolutional NeuralNetworks,” in Proceedings of the 2015 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, ser. FPGA ’15, 2015,pp. 161–170.

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classificationwith Deep Convolutional Neural Networks,” in Advances in Neural In-formation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[8] J. L. Nunez-Yanez, “Energy proportional neural network inferencewith adaptive voltage and frequency scaling,” IEEE Transactions onComputers, pp. 1–1, 2018.

[9] K.-H. Huang and J. A. Abraham, “Algorithm-Based Fault Tolerance forMatrix Operations,” IEEE Transactions on Computers, vol. C-33, no. 6,pp. 518–528, 1984.

[10] Z. Chen, “Online-ABFT: An Online Algorithm Based Fault ToleranceScheme for Soft Error Detection in Iterative Methods,” in Proceedingsof the 18th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, ser. PPoPP ’13, 2013, pp. 167–176.

[11] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, “Algorithm-based Fault Tolerance for Dense Matrix Factorizations,” in Proceedingsof the 17th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, ser. PPoPP ’12, 2012, pp. 225–234.

[12] S. I. Venieris, A. Kouris, and C.-S. Bouganis, “Toolflows for MappingConvolutional Neural Networks on FPGAs: A Survey and Future Di-rections,” ACM Comput. Surv., vol. 51, no. 3, pp. 56:1–56:39, 2018.

[13] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. M. Harris, D. Blaauw,and D. Sylvester, “Bubble Razor: Eliminating Timing Margins in anARM Cortex-M3 Processor in 45 nm CMOS Using ArchitecturallyIndependent Error Detection and Correction,” IEEE Journal of Solid-State Circuits, vol. 48, no. 1, pp. 66–81, 2013.

[14] J. L. Nunez-Yanez, M. Hosseinabady, and A. Beldachi, “Energy op-timization in commercial FPGAs with voltage, frequency and logicscaling,” IEEE Transactions on Computers, vol. 65, no. 5, pp. 1484–1493, May 2016.

[15] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,“A framework for acceleration of CNN training on deeply-pipelinedFPGA clusters with work and weight load balancing,” in Proceedingsof the 28th International Conference on Field Programmable Logic andApplications, ser. FPL ’18. IEEE, 2018, pp. 394–3944.

[16] S. J. Piestrak and P. Patronik, “Design of Fault-Secure Transposed FIRFilters Protected Using Residue Codes,” in Proceedings of the 17thEuromicro Conference on Digital System Design, 2014, pp. 575–582.

[17] T. Yuki and S. Rajopadhye, “Folklore confirmed: Compiling for speed= compiling for energy,” in Proceedings of the 26th InternationalWorkshop on Languages and Compilers for Parallel Computing, ser.LCPC ’13, Sep. 2013.

[18] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian,Y. Bai, G. Yuan et al., “CirCNN: Accelerating and CompressingDeep Neural Networks Using Block-circulant Weight Matrices,” inProceedings of the 50th Annual IEEE/ACM International Symposiumon Microarchitecture. ACM, 2017, pp. 395–408.

[19] C. Zhang and V. Prasanna, “Frequency domain acceleration of con-volutional neural networks on CPU-FPGA shared memory system,”in Proceedings of the 2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays. ACM, 2017, pp. 35–44.

[20] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms forconvolutional neural networks on FPGAs,” in Proceedings of the IEEE25th Annual International Symposium on Field-Programmable CustomComputing Machines, ser. FCCM ’17. IEEE, 2017, pp. 101–108.

[21] S. Li, W. Wen, Y. Wang, S. Han, Y. Chen, and H. Li, “An FPGA designframework for CNN sparsification and acceleration,” in Proceedings ofthe IEEE 25th Annual International Symposium on Field-ProgrammableCustom Computing Machines, ser. FCCM ’17, April 2017, pp. 28–28.

[22] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights andconnections for efficient neural networks,” in Proceedings of the 28thInternational Conference on Neural Information Processing Systems -Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp.1135–1143.

[23] S.-J. Wang and N. K. Jha, “Algorithm-based fault tolerance for FFTnetworks,” IEEE Transactions on Computers, vol. 43, no. 7, pp. 849–854, 1994.

[24] A. L. N. Reddy and P. Banerjee, “Algorithm-based fault detectionfor signal processing applications,” IEEE Transactions on Computers,vol. 39, no. 10, pp. 1304–1308, 1990.

[25] A. Roy-Chowdhury, N. Bellas, and P. Banerjee, “Algorithm-based error-detection schemes for iterative solution of partial differential equations,”IEEE Transactions on Computers, vol. 45, no. 4, pp. 394–407, 1996.

[26] G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra,“Composing resilience techniques: ABFT, periodic and incrementalcheckpointing,” International Journal of Networking and Computing,vol. 5, no. 1, pp. 2–25, 2015.

[27] A. Jacobs, G. Cieslewski, and A. D. George, “Overhead and reliabilityanalysis of algorithm-based fault tolerance in FPGA systems,” in Pro-ceedings of the 22nd International Conference on Field ProgrammableLogic and Applications, ser. FPL ’12, 2012, pp. 300–306.

[28] J. J. Davis and P. Y. K. Cheung, “Achieving low-overhead fault tolerancefor parallel accelerators with dynamic partial reconfiguration,” in Pro-ceedings of the 24th International Conference on Field ProgrammableLogic and Applications, ser. FPL ’14, 2014, pp. 1–6.

[29] J. M. Levine, E. Stott, G. A. Constantinides, and P. Y. K. Cheung,“SMI: Slack Measurement Insertion for online timing monitoring inFPGAs,” in Proceedings of the 23rd International Conference on FieldProgrammable Logic and Applications, ser. FPL ’13, 2013, pp. 1–4.

[30] B. Gojman, S. Nalmela, N. Mehta, N. Howarth, and A. Dehon, “GROK-LAB: Generating real on-chip knowledge for intra-cluster delays usingtiming extraction,” ACM Trans. Reconfigurable Technol. Syst., vol. 7,no. 4, pp. 32:1–32:23, 2014.

[31] J. Leng, A. Buyuktosunoglu, R. Bertran, P. Bose, and V. J. Reddi,“Safe limits on voltage reduction efficiency in GPUs: A direct measure-ment approach - IEEE Conference Publication,” in Microarchitecture(MICRO), 2015 48th Annual IEEE/ACM International Symposium on.IEEE, 2015, pp. 294–307.

[32] (2019) HERO: Open Heterogeneous Research Platform. [Online].Available: https://pulp-platform.org/hero.html

Thibaut Marty obtained his BSc degree and MScdegree from University of Rennes, France, in 2015and 2017 respectively. He is currently working to-ward the PhD degree in computer sciences at theUniversity of Rennes, under supervision of StevenDerrien and Tomofumi Yuki. His research interestsinclude High-Level Synthesis and fault tolerance onFPGAs.

Steven Derrien obtained his PhD from Universityof Rennes 1 in 2003, and is now professor atUniversity of Rennes 1. He is also a member of theCairn research group at IRISA. His research interestsinclude High-Level Synthesis, loop parallelization,and reconfigurable systems design.

Tomofumi Yuki received his Ph.D. in computerscience from Colorado State University in 2012. Heis currently a researcher at Inria, France. His mainresearch interests are in parallel/high performancecomputing, compilers/programming models for par-allel/HPC, and High-Level Synthesis.

https://pulp-platform.org/hero.html

safe overclocking for cnn accelerators through algorithm

Documents