general-purpose code acceleration with limited-precision ...hadi/doc/paper/2014-isca-anpu.pdfappears...

12
Appears in the Proceedings of the 41 st International Symposium on Computer Architecture, 2014 General-Purpose Code Acceleration with Limited-Precision Analog Computation Renée St. Amant ? Amir Yazdanbakhsh § Jongse Park § Bradley Thwaites § Hadi Esmaeilzadeh § Arjang Hassibi ? Luis Ceze Doug Burger ? University of Texas at Austin § Georgia Institute of Technology University of Washington Microsoft Research [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract As improvements in per-transistor speed and energy effi- ciency diminish, radical departures from conventional ap- proaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We out- line the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computa- tion, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a hardware/software interface, neural- network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.7× and energy savings of 6.3× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applica- tions including financial analysis, signal processing, robotics, 3D gaming, compression, and image processing. 1. Introduction Energy efficiency now fundamentally limits microproces- sor performance gains. CMOS scaling no longer provides gains in efficiency commensurate with transistor density in- creases [16, 25]. As a result, both the semiconductor industry and the research community are increasingly focused on spe- cialized accelerators, which provide large gains in efficiency and performance by restricting the workloads that benefit. The community is facing an “iron triangle”; we can choose any two of performance, efficiency, and generality at the expense of the third. Before the effective end of Dennard scaling, we improved all three consistently for decades. Solutions that improve performance and efficiency, while retaining as much generality as possible, are highly desirable, hence the growing interest in GPGPUs and FPGAs. A growing body of recent work [13, 17, 44, 2, 35, 8, 28, 38, 18, 51, 46] has focused on approximation as a strategy for the iron triangle. Many classes of applications can tolerate small errors in their outputs with no discernible loss in QoR (Quality of Result). Many conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension—that of error. Many design alternatives become possible once precision is relaxed. An obvious candidate is the use of analog circuits for computation. However, computation in the analog domain has several major challenges, even when small errors are per- missible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths they can accommodate are smaller than current floating-point stan- dards (i.e. 32/64 bits), since the ranges must be represented by physical voltage or current levels. Another consideration is determining where the boundaries between digital and analog computation lie. Using individual analog operations will not be effective due to the overhead of A/D and D/A conversions. Finally, effective storage of temporary analog results is chal- lenging in current CMOS technologies. These limitations has made it ineffective to design analog von Neumann processors that can be programmed with conventional languages. Despite these challenges, the potential performance and energy gains from analog execution are highly attractive. An important challenge is thus to architect designs where a signif- icant portion of the computation can be run in the analog do- main, while also addressing the issues of value range, domain conversions, and relative error. Recent work on Neural Pro- cessing Units (NPUs) may provide a possible approach [18]. NPU-enabled systems rely on an algorithmic transformation that converts regions of approximable general-purpose code into a neural representation (specifically, multi-layer percep- trons) at compile time. At run-time, the processor invokes the NPU instead of running the original code. NPUs have shown large performance and efficiency gains, since they subsume an entire code region (including all of the instruction fetch, decode, etc., overheads). They have an added advantage in that they convert many distinct code patterns into a common representation that can be run on a single physical accelerator, improving generality. NPUs may be a good match for mixed-signal implementa- tions for a number of reasons. First, prior research has shown that neural networks can be implemented in analog domain to solve classes of domain-specific problems, such as pattern recognition [5, 47, 49, 32]. Second, the process of invok- ing a neural network and returning a result defines a clean,

Upload: others

Post on 21-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

Appears in the Proceedings of the 41st International Symposium on Computer Architecture, 2014

General-Purpose Code Acceleration with Limited-Precision Analog ComputationRenée St. Amant? Amir Yazdanbakhsh§ Jongse Park§ Bradley Thwaites§

Hadi Esmaeilzadeh§ Arjang Hassibi? Luis Ceze† Doug Burger‡

?University of Texas at Austin §Georgia Institute of Technology†University of Washington ‡Microsoft Research

[email protected] [email protected] [email protected] [email protected]

[email protected] [email protected] [email protected] [email protected]

AbstractAs improvements in per-transistor speed and energy effi-

ciency diminish, radical departures from conventional ap-proaches are becoming critical to improving the performanceand energy efficiency of general-purpose processors. Wepropose a solution—from circuit to compiler—that enablesgeneral-purpose use of limited-precision, analog hardwareto accelerate “approximable” code—code that can tolerateimprecise execution. We utilize an algorithmic transformationthat automatically converts approximable regions of code froma von Neumann model to an “analog” neural model. We out-line the challenges of taking an analog approach, includingrestricted-range value encoding, limited precision in computa-tion, circuit inaccuracies, noise, and constraints on supportedtopologies. We address these limitations with a combinationof circuit techniques, a hardware/software interface, neural-network training techniques, and compiler support. Analogneural acceleration provides whole application speedup of3.7× and energy savings of 6.3× with quality loss less than10% for all except one benchmark. These results show thatusing limited-precision analog circuits for code acceleration,through a neural approach, is both feasible and beneficialover a range of approximation-tolerant, emerging applica-tions including financial analysis, signal processing, robotics,3D gaming, compression, and image processing.

1. IntroductionEnergy efficiency now fundamentally limits microproces-sor performance gains. CMOS scaling no longer providesgains in efficiency commensurate with transistor density in-creases [16, 25]. As a result, both the semiconductor industryand the research community are increasingly focused on spe-cialized accelerators, which provide large gains in efficiencyand performance by restricting the workloads that benefit. Thecommunity is facing an “iron triangle”; we can choose anytwo of performance, efficiency, and generality at the expenseof the third. Before the effective end of Dennard scaling,we improved all three consistently for decades. Solutionsthat improve performance and efficiency, while retaining asmuch generality as possible, are highly desirable, hence thegrowing interest in GPGPUs and FPGAs. A growing bodyof recent work [13, 17, 44, 2, 35, 8, 28, 38, 18, 51, 46] hasfocused on approximation as a strategy for the iron triangle.Many classes of applications can tolerate small errors in theiroutputs with no discernible loss in QoR (Quality of Result).

Many conventional techniques in energy-efficient computingnavigate a design space defined by the two dimensions ofperformance and energy, and traditionally trade one for theother. General-purpose approximate computing explores athird dimension—that of error.

Many design alternatives become possible once precisionis relaxed. An obvious candidate is the use of analog circuitsfor computation. However, computation in the analog domainhas several major challenges, even when small errors are per-missible. First, analog circuits tend to be special purpose,good for only specific operations. Second, the bit widths theycan accommodate are smaller than current floating-point stan-dards (i.e. 32/64 bits), since the ranges must be representedby physical voltage or current levels. Another consideration isdetermining where the boundaries between digital and analogcomputation lie. Using individual analog operations will notbe effective due to the overhead of A/D and D/A conversions.Finally, effective storage of temporary analog results is chal-lenging in current CMOS technologies. These limitations hasmade it ineffective to design analog von Neumann processorsthat can be programmed with conventional languages.

Despite these challenges, the potential performance andenergy gains from analog execution are highly attractive. Animportant challenge is thus to architect designs where a signif-icant portion of the computation can be run in the analog do-main, while also addressing the issues of value range, domainconversions, and relative error. Recent work on Neural Pro-cessing Units (NPUs) may provide a possible approach [18].NPU-enabled systems rely on an algorithmic transformationthat converts regions of approximable general-purpose codeinto a neural representation (specifically, multi-layer percep-trons) at compile time. At run-time, the processor invokes theNPU instead of running the original code. NPUs have shownlarge performance and efficiency gains, since they subsumean entire code region (including all of the instruction fetch,decode, etc., overheads). They have an added advantage inthat they convert many distinct code patterns into a commonrepresentation that can be run on a single physical accelerator,improving generality.

NPUs may be a good match for mixed-signal implementa-tions for a number of reasons. First, prior research has shownthat neural networks can be implemented in analog domainto solve classes of domain-specific problems, such as patternrecognition [5, 47, 49, 32]. Second, the process of invok-ing a neural network and returning a result defines a clean,

Page 2: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

coarse-grained interface for D/A and A/D conversion. Third,the compile-time training of the network permits any analog-specific restrictions to be hidden from the programmer. Theprogrammer simply specifies which region of the code canbe approximated, without adding any neural-network-specificinformation. Thus, no additional changes to the programmingmodel are necessary.

In this paper we evaluate an NPU design with mixed-signalcomponents and develop a compilation workflow for utilizingthe mixed-signal NPU for code acceleration. The goal of thisstudy is to investigate challenges and define potential solutionsto enable effective mixed-signal NPU execution. The objectiveis to both bound application error to sufficiently low levelsand achieve worthwhile performance or efficiency gains forgeneral-purpose approximable code. This study makes thefollowing four findings:1. Due to range limitations, it is necessary to limit the scope

of the analog execution to a single neuron; inter-neuroncommunication should be in the digital domain.

2. Again due to range issues, there is an interplay betweenthe bit widths (inputs and weights) that neurons can useand the number of inputs that they can process. We foundthat the best design limited weights and inputs to eight bits,while also restricting the number of inputs to each neuronto eight. The input count limitation restricts the topologicalspace of feasible neural networks.

3. We found that using a customized continuous-discretelearning method (CDLM) [10], which accounts for limited-precision computation at training time, is necessary to re-duce error due to analog range limitations.

4. Given the analog-imposed topology restrictions, we foundthat using a Resilient Back Propagation (RPROP) [30] train-ing algorithm can further reduce error over a conventionalbackpropagation algorithm.We found that exposing the analog limitations to the com-

piler allowed for the compensation of these shortcomings andproduced sufficiently accurate results. The latter three find-ings were all used at training time; we trained networks atcompile time using 8-bit values, topologies restricted to eightinputs per neuron, plus RPROP and CDLM for training. Usingthese techniques together, we were able to bound error on allapplications but one to a 10% limit, which is commensuratewith entirely digital approximation techniques. The averagetime required to compute a neural result was 3.3× better thana previous digital implementation with an additional energysavings of 12.1×. The performance gains result in an averagefull-application-level improvement of 3.7× and 23.3× in per-formance and energy-delay product, respectively. This studyshows that using limited-precision analog circuits for code ac-celeration, by converting regions of imperative code to neuralnetworks and exposing the circuit limitations to the compiler,is both feasible and advantageous. While it may be possibleto move more of the accelerator architecture design into theanalog domain, the current mixed-signal design performs well

enough that only 3% and 46% additional improvements inapplication-level energy consumption and performance arepossible with improved accelerator designs. However, improv-ing the performance of the analog NPU may lead to higheroverall performance gains.

2. Overview and BackgroundProgramming. We use a similar programming model as de-scribed in [18] to enable programmers to mark error-tolerantregions of code as candidates for transformation using a sim-ple keyword, approximable. Explicit annotation of code forapproximation is a common practice in approximate program-ming languages [45, 7]. A candidate region is an error-tolerantfunction of any size, containing function calls, loops, and com-plex control flow. Frequently executed functions provide agreater opportunity for gains. In addition to error tolerance,the candidate function must have well-defined inputs and out-puts. That is, the number of inputs and outputs must be knownat compile time. Additionally, the code region must not readany data other than its inputs, nor affect any data other than itsoutputs. No major changes are necessary to the programminglanguage beyond adding the approximable keyword.Exposing analog circuits to the compiler. Although ananalog accelerator presents the opportunity for gains in ef-ficiency over a digital NPU, it suffers from reduced accuracyand flexibility, which results in limitations on possible networktopologies and limited-precision computation, potentially re-sulting in a decreased range of applications that can utilizethe acceleration. These shortcomings at the hardware level,however, can be exposed as a high-level model and consideredin the training phase.

Four characteristics need to be exposed: (1) limited preci-sion for input and output encoding, (2) limited precision forencoding weights, (3) the behavior of the activation function(sigmoid), (4) limited feasible neural topologies. Other low-level circuit behavior such as response to noise can also beexposed to the compiler. Section 5 describes this necessaryhardware/software interface in more detail.Analog neural accelerator circuit design. To extract thehigh-level model for the compiler and to be able to acceler-ate execution, we design a mixed-signal neural hardware formultilayer perceptrons. The accelerator must support a largeenough variety of neural network topologies to be useful over awide range of applications. As we will show, each applicationsrequires a different topology for the neural network that is re-placing its approximable regions of code. Section 4 describesa candidate A-NPU circuit design, and outlines the challengesand tradeoffs present with an analog implementation.Compiling for analog neural hardware. The compileraims to mimic approximable regions of code with neural net-works that can be executable on the mixed-signal accelerator.While considering the limitation of the analog hardware, thecompiler searches the topology space of the neural networksand selects and trains a neural network to produce outputs

Page 3: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

A"NPUCircuit,Design

A-NPUHigh-Level Model

Annotated,Source,Code

Programmer

Programming Design

Profiling,Path,for,Training,Data,Collec<on

Training,Data

Trained,Neural,Network

Custom,Training,Algorithm,for,

Limited"Precision,Analog,Accelerator

Compila/on

Code,Genera<on

Instrumented,Binary

Accelerator,Config

CORE

A"NPU

Accelera/on

Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages.

comparable to those produced by the original code segment.1) Profile-driven training data collection. During a pro-

filing stage, the compiler runs the application with represen-tative profiling inputs and collects the inputs and outputs tothe approximable code region. This step provides the trainingdata for the rest of the compilation workflow.

2) Training for a limited-precision A-NPU. This stageis where our compilation workflow significantly deviates fromthe framework presented in [18] that targets digital NPUs.The compiler uses the collected training data to train a multi-layer perceptron neural network, choosing a network topology,i.e. the number of neurons and their connectivity, and takinga gradient descent approach to find the synaptic weights ofthe network while minimizing the error with respect to thetraining data. This compilation stage does a neural topologysearch to find the smallest neural network that (a) adheres tothe organization of the analog circuit and (b) delivers accept-able accuracy at the application level. The network trainingalgorithm, which finds optimal synaptic weights, uses a combi-nation of a resilient back propagation algorithm, RPROP [30],that we found to outperform traditional back propagation forrestricted network topologies, and a continuous-discrete learn-ing method, CDLM [10], that attempts to correct for error dueto limited-precision computation. Section 5 describes thesetechniques that address analog limitations.

3) Code generation for hybrid analog-digital execution.Similar to prior work [18], in the code generation phase, thecompiler replaces each instance of the original program codewith code that initiates a computation on the analog neuralaccelerator. Similar ISA extensions are used to specify theneural network topology, send input and weight values to theA-NPU, and retrieve computed outputs from the A-NPU.

3. Analog Circuits for Neural ComputationThis section describes how analog circuits can perform thecomputation of neurons in multi-layer perceptrons, whichare widely used neural networks. We also discuss, at a high-level, how limitations of the analog circuits manifest in thecomputation. We explain how these restrictions are exposedto the compilation framework. The next section presents aconcrete design for the analog neural accelerator.

As Figure 2a illustrates, each neuron in a multi-layer per-ceptron takes in a set of inputs (xi) and performs a weightedsum of those input values (∑i xiwi). The weights (wi) are theresult of training the neural network on . After the summation

x0

y = sigmoid(X

(xiwi))

w0 wi wn

xi xn

X(xiwi)

… … I(xi) I(xn)I(x0)

R(wi) R(wn)

ADC

X(I(xi)R(wi))

y sigmoid(X

(I(xi)R(wi)))

DAC DAC DACx0 xi xn

… …

(a) (b)

V to I V to I V to I

R(w0)

Figure 2: One neuron and its conceptual analog circuit.

stage, which produces a linear combination of the weightedinputs, the neuron applies a nonlinearity function, sigmoid, tothe result of summation.

Figure 2b depicts a conceptual analog circuit that performsthe three necessary operations of a neuron: (1) scaling inputsby weight (xiwi), (2) summing the scaled inputs (∑i xiwi), and(3) applying the nonlinearity function (sigmoid). This con-ceptual design first encodes the digital inputs (xi) as analogcurrent levels (I(xi)). Then, these current levels pass througha set of variable resistances whose values (R(wi)) are set pro-portional to the corresponding weights (wi). The voltage levelat the output of each resistance (I(xi)R(wi)), is proportional toxiwi. These voltages are then converted to currents that can besummed quickly according to Kirchhoff’s current law (KCL).Analog circuits only operate linearly within a small range ofvoltage and current levels, outside of which the transistorsenter saturation mode with IV characteristics similar in shapeto a non-linear sigmoid function. Thus, at the high level, thenon-linearity is naturally applied to the result of summationwhen the final voltage reaches the analog-to-digital converter(ADC). Compared to a digital implementation of a neuron,which requires multipliers, adder trees and sigmoid lookuptables, the analog implementation leverages the physical prop-erties of the circuit elements and is orders of magnitude moreefficient. However, it operates in limited ranges and thereforeoffers limited precision.Analog-digital boundaries. The conceptual design in Fig-ure 2b draws the analog-digital boundary at the level of analgorithmic neuron. As we will discuss, the analog neuralaccelerator will be a composition of these analog neural units(ANUs). However, an alternative design, primarily optimizingfor efficiency, may lay out the entirety of a neural networkwith only analog components, limiting the D-to-A and A-to-Dconversions to the inputs and outputs of the neural network

Page 4: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

and not the individual neurons. The overhead of conversionsin the ANUs significantly limits the potential efficiency gainsof an analog approach toward neural computation. However,there is a tradeoff between efficiency, reconfigurability (gener-ality), and accuracy in analog neural hardware design. Pushingmore of the implementation into the analog domain gains ef-ficiency at the expense of flexibility, limiting the scope ofsupported network topologies and, consequently, limiting po-tential network accuracy. The NPU approach targets codeapproximation, rather than typical, simpler neural tasks, suchas recognition and prediction, and imposes higher accuracyrequirements. The main challenge is to manage this tradeoff toachieve acceptable accuracy for code acceleration, while deliv-ering higher performance and efficiency when analog neuralcircuits are used for general-purpose code acceleration.

As prior work [18] has shown and we corroborate, regionsof code from different applications require different topolo-gies of neural networks. While a holistically analog neuralhardware design with fixed-wire connections between neu-rons may be efficient, it effectively provides a fixed topologynetwork, limiting the scope of applications that can benefitfrom the neural accelerator, as the optimal network topologyvaries with application. Additionally, routing analog signalsamong neurons and the limited capability of analog circuitsfor buffering signals negatively impacts accuracy and makesthe circuit susceptible to noise. In order to provide additionalflexibility, we set the digital-analog boundary in conjunctionwith an algorithmic, sigmoid-activated neuron. where a setof digital inputs and weights are converted to the analog do-main for efficient computation, producing a digital output thatcan be accurately routed to multiple consumers. We refer tothis basic computation unit as an analog neural unit, or ANU.ANUs can be composed, in various physical configurations,along with digital control and storage, to form a reconfigurablemixed-signal NPU, or A-NPU.

One of the most prevalent limitations in analog design isthe bounded range of currents and voltages within which thecircuits can operate effectively. These range limitations restrictthe bit-width of input and weight values and the networktopologies that can be computed accurately and efficiently.We expose these limitations to the compiler and our customtraining algorithm and compilation workflow considers theserestrictions when searching for optimal network topologiesand training neural networks. As we will show, one of theinsights from this work is that even with limited bit-width(≤ 8), and a restricted neural topology, many general-purposeapproximate applications achieve acceptable accuracy andsignificantly benefit from mixed-signal neural acceleration.Value representation and bit-width limitations. One ofthe fundamental design choices for an ANU is the bit-widthof inputs and weights. Increasing the number of bits resultsin an exponential increase in the ADC and DAC energy dis-sipation and can significantly limit the benefits from analogacceleration. Furthermore, due to the fixed range of voltage

and current levels, increasing the number of bits translates toquantizing this fixed value range to fine granularities that prac-tical ADCs can not handle. In addition, the fine granularityencoding makes the analog circuit significantly more suscepti-ble to noise, thermal, voltage, current, and process variations.In practice, these non-ideal effects can adversely affect thefinal accuracy when more bit-width is used for weights andinputs. We design our ANUs such that the granularity of thevoltage and current levels used for information encoding is toa large degree robust to variations and noise.Topology restrictions. Another important design choice isthe number of inputs in the ANU. Similar to bit-width, in-creasing the number of ANU inputs translates to encodinga larger value range in a bounded voltage and current range,which, as discussed, becomes impractical. There is a tradeoffbetween accuracy and efficiency in choosing the number ANUinputs. The larger the number of inputs, the larger the numberof multiply and add operations that can be done in parallel inthe analog domain, increasing efficiency. However, due to thebounded range of voltage and currents, increasing the numberof inputs requires decreasing the number of bits for inputs andweights. Through circuit-level simulations, we empiricallyfound that limiting the number of inputs to eight with 8-bitinputs and weights strikes a balance between accuracy andefficiency. A digital implementation does not impose suchrestrictions on the number of inputs to the hardware neuronand it can potentially compute arbitrary topologies of neuralnetworks. However, this unique ANU limitation restricts thetopology of the neural network that can run on the analogaccelerator. Our customized training algorithm and compi-lation workflow takes into account this topology limitationand produces neural networks that can be computed on ourmixed-signal accelerator.Non-ideal sigmoid. The saturation behavior of the analogcircuit that leads to sigmoid-like behavior after the summationstage represents an approximation of the ideal sigmoid. Wemeasure this behavior at the circuit level and expose it to thecompiler and the training algorithm.

4. Mixed-Signal Neural Accelerator (A-NPU)This section describes a concrete ANU design and the mixed-signal, neural accelerator, A-NPU.

4.1. ANU Circuit Design

Figure 3 illustrates the design of a single analog neuron (ANU).The ANU performs the computation of one neuron, or y ≈sigmoid(∑i wixi). We place the analog-digital boundary atthe ANU level, with computation in the analog domain andstorage in the digital domain. Digital input and weight valuesare represented in sign-magnitude form. In the figure, swi

and sxi represent the sign bits and wi and xi represent themagnitude. Digital input values are converted to the analogdomain through current-steering DACs that translate digitalvalues to analog currents. Current-steering DACs are used for

Page 5: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

Current'Steering'DAC

x0sx0w0sw0

Resistor'Ladder

Current'Steering'DAC

Resistor'Ladder

Diff'Pair

V (|w0x0|)

I+(w0x0)

V +X

wixi

swnsxnwn xn

R(|w0|) R(|wn|)

I(|xn|)

V (|wnxn|)

I+(wnxn)

sy

y

FlashADC

y sigmoidVX

wixi

I(|x0|)

Diff'Pair

I(w0x0)

I(wnxn)

V X

wixi

Diff'Amp

+

-

+

;

Figure 3: A single analog neuron (ANU).their speed and simplicity. In Figure 3, I(|xi|) is the analogcurrent that represents the magnitude of the input value, xi.Digital weight values control resistor-string ladders that createa variable resistance depending on the magnitude of eachweight (R(|wi|)) . We use a standard resistor ladder thatsconsists of a set of resistors connected to a tree-structuredset of switches. The digital weight bits control the switches,adjusting the effective resistance, R(|wi|), seen by the inputcurrent (I(|xi|)). These variable resistances scale the inputcurrents by the digital weight values, effectively multiplyingeach input magnitude by its corresponding weight magnitude.The output of the resistor ladder is a voltage: V (|wixi|) =I(|xi|)×R(|wi|). The resistor network requires 2m resistorsand approximately 2m+1 switches, where m is the number ofdigital weight bits. This resistor ladder design has been shownto work well for m ≤ 10. Our circuit simulations show thatonly minimally sized switches are necessary.

V (|wixi|), as well as the XOR of the weight and input signbits, feed to a differential pair that converts voltage valuesto two differential currents (I+(wixi), I−(wixi)) that capturethe sign of the weighted input. These differential currentsare proportional to the voltage applied to the differential pair,V (|wixi|). If the voltage difference between the two gates iskept small, the current-voltage relationship is linear, producingI+(wixi) =

Ibias2 +∆I and I−(wixi) =

Ibias2 −∆I. Resistor ladder

values are chosen such that the gate voltage remains in therange that produces linear outputs, and consequently a moreaccurate final result. Based on the sign of the computation, aswitch steers either the current associated with a positive valueor the current associated with a negative value to a single wireto be efficiently summed according to Kirchhoff’s current law.The alternate current is steered to a second wire, retainingdifferential operation at later design stages. Differential op-eration combats environmental noise and increases gain, thelater being particularly important for mitigating the impact ofanalog range challenges at later stages.

Resistors convert the resulting pair of differential currentsto voltages, V+(∑i wixi) and V−(∑i wixi), that represent theweighted sum of the inputs to the ANU. These voltages are

Column'Selector

8"WideAnalog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer

Row'Selector

Output'FIFO

Config'FIFOInput'FIFO

Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of theANUs are shown. Each ANU processes eight 8-bit inputs.used as input to an additional amplification stage (implementedas a current-mode differential amplifier with diode-connectedload). The goal of this amplification stage is to significantlymagnify the input voltage range of interest that maps to thelinear output region of the desired sigmoid function. Ourexperiments show that neural networks are sensitive to thesteepness of this non-linear function, losing accuracy withshallower, non-linear activation functions. This fact is rele-vant for an analog implementation because steeper functionsincrease range pressure in the analog domain, as a small rangeof interest must be mapped to a much larger output range inaccordance with ADC input range requirements for accurateconversion. We magnify this range of interest, choosing circuitparameters that give the required gain, but also allowing forsaturation with inputs outside of this range.

The amplified voltage is used as input to an ADC that con-verts the analog voltage to a digital value. We chose a flashADC design (named for its speed), which consists of a setof reference voltages and comparators [1, 31]. The ADC re-quires 2n comparators, where n is the number of digital outputbits. Flash ADC designs are capable of converting 8 bits ata frequency on the order of one GHz. We require 2–3 mVbetween ADC quantization levels for accurate operation andnoise tolerance. Typically, ADC reference voltages increaselinearly; however, we use a non-linearly increasing set of ref-erence voltages to capture the behavior of a sigmoid function,which also improves the accuracy of the analog sigmoid.

4.2. Reconfigurable Mixed-Signal A-NPU

We design a reconfigurable mixed-signal A-NPU that can per-form the computation of a wide variety of neural topologiessince each requires a different topology. Figure 4 illustratesthe A-NPU design with some details omitted for clarity. Thefigure shows four ANUs while the actual design has eight.The A-NPU is a time-multiplexed architecture where the al-gorithmic neurons are mapped to the ANUs based on a staticscheduling algorithm, which is loaded to the A-NPU beforeinvocation. The multi-layer perceptron consists of layers ofneurons, where the inputs of each layer are the outputs ofthe previous layer. The ANU starts from the input layer andperforms the computations of the neurons layer by layer. TheInput Buffer always contains the inputs to the neurons, either

Page 6: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

coming from the processor or from the previous layer com-putation. The Output Buffer, which is a single entry buffer,collects the outputs of the ANUs. When all of its columns arecomputed, the results are pushed back to the Input Buffer to en-able calculation of the next layer. The Row Selector determineswhich entry of the input buffer will be fed to the ANUs. Theoutput of the ANUs will be written to a single-entry outputbuffer. The Column Selector determines which column of theoutput buffer will be written by the ANUs. These selectors areFIFO buffers whose values are part of the preloaded A-NPUconfiguration. All the buffers are digital SRAM structures.

Each ANU has eight inputs. As shown in Figure 4, eachA-NPU is augmented with a dedicated weight buffer, storingthe 8-bit weights. The weight buffers simultaneously feed theweights to the ANUs. The weights and the order in which theyare fed to the ANUs are part of the A-NPU configuration. TheInput Buffer and Weight Buffers synchronously provide the inputsand weights for the ANUs based on the pre-loaded order.A-NPU configuration. During code generation, the com-piler produces an A-NPU configuration that constitutes theweights and the schedule. The static A-NPU scheduling al-gorithm first assigns an order to the neurons of the neuralnetwork, in which the neurons will be computed in the ANUs.The scheduler then takes the following steps for each layerof the neural network: (1) Assign each neuron to one of theANUs. (2) Assign an order to neurons. (3) Assign an or-der to the weights. (4) Generate the order for inputs to feedthe ANUs. (5) Generate the order in which the outputs willbe written to the Output Buffer. The scheduler also assigns aunique order for the inputs and outputs of the neural networkin which the core communicates data with the A-NPU.

4.3. Architectural interface for A-NPU

We adopt the same FIFO-based architectural interface throughwhich a digital NPU communicates with the processor [18].The A-NPU is tightly integrated to the pipeline. The processoronly communicates with the ANUs through the Input, Output,Config FIFOs. The processor ISA is extended with special in-structions that can enqueue and dequeue data from these FIFOsas shown in Figure 4. When a data value is queued/dequeuedto/from the Input/Output FIFO, the A-NPU converts the valuesto the appropriate representation for the A-NPU/processor.

5. Compilation for Analog AccelerationAs Figure 1 illustrates, the compilation for A-NPU executionconsists of three stages: (1) profile-driven data collection, (2)training for a limited-precision A-NPU, and (3) code genera-tion for hybrid analog-digital execution. In the profile-drivendata collection stage, the compiler instruments the applicationto collect the inputs and outputs of approximable functions.The compiler then runs the application with representativeinputs and collects the inputs and their corresponding outputs.These input-output pairs constitute the training data. Section 4briefly discussed ISA extensions and code generation. While

compilation stages (1) and (3) are similar to the techniques pre-sented for a digital implementation [18], the training phase isunique to an analog approach, accounting for analog-imposed,topology restrictions and adjusting weight selection to accountfor limited-precision computation.Hardware/software interface for exposing analog circuitsto the compiler. As we discussed in Section 3, we ex-pose the following analog circuit restrictions to the compilerthrough a hardware/software interface that captures the fol-lowing circuit characteristics: (1) input bit-width limitations,(2) weight bit-width limitations, (3) limited number of inputsto each analog neuron (topology restriction), and (4) the non-ideal shape of the analog sigmoid. The compiler internallyconstructs a high-level model of the circuit based on theselimitations and uses this model during the neural topologysearch and training with the goal of limiting the impact ofinaccuracies due to an analog implementation.Training for limited bit widths and analog computation.Traditional training algorithms for multi-layered perceptronneural networks use a gradient descent approach to minimizethe average network error, over a set of training input-outputpairs, by backpropagating the output error through the net-work and iteratively adjusting the weight values to minimizethat error. Traditional training techniques, however, that donot consider limited-precision inputs, weights, and outputsperform poorly when these values are saturated to adhere tothe bit-width requirements that are feasible for an implemen-tation in the analog domain. Simply limiting weight valuesduring training is also detrimental to achieving quality outputsbecause the algorithm does not have sufficient precision toconverge to a quality solution.

To incorporate bit-width limitations into the training al-gorithm, we use a customized continuous-discrete learningmethod (CDLM) [10]. This approach takes advantage of theavailability of full-precision computation at training time andthen adjusts slightly to optimize the network for errors dueto limited-precision values. In an initial phase, CDLM firsttrains a fully-precise network according to a standard trainingalgorithm, such as backpropagation [43]. In a second phase,it discretizes the input, weight, and output values accordingthe the exposed analog specification. The algorithm calcu-lates the new error and backpropagates that error through thefully-precise network using full-precision computation andupdates the weight values according to the algorithm also usedin stage 1. This process repeats, backpropagating the ’dis-crete’ errors through a precise network. The original CDLMtraining algorithm was developed to mitigate the impact oflimited-precision weights. We customize this algorithm byincorporating the input bit-width limitation and the outputbit-width limitation in addition to limited weight values. Ad-ditionally, this training scheme is advantageous for an analogimplementation because it is general enough to also make upfor errors that arise due to an analog implementation, such as anon-ideal sigmoid function and any other analog non-ideality

Page 7: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

that behaves consistently.In essence, after one round of full-precision training, the

compiler models an analog-like version of the network. Asecond, CDLM-based training pass adjusts for these analog-imposed errors, enabling the inaccurate and limited A-NPU asan option for a beneficial NPU implementation by maintainingacceptable accuracy and generality.Training with topology restrictions. In addition to deter-mining weight values for a given network topology, the com-piler searches the space of possible topologies to find an opti-mal network for a given approximable code region. Conven-tional multi-layered perceptron networks are fully connected,i.e. the output of each neuron in one layer is routed to the inputof each neuron in the following layer. However, analog rangelimitations restrict the number of inputs that can be computedin a neuron (eight in our design). Consequently, network con-nections must be limited, and in many cases, the network cannot be fully connected.

We impose the circuit restriction on the connectivity be-tween the neurons during the topology search and we use asimple algorithm guided by the mean-squared error of thenetwork to determine the best topology given the exposedrestriction. The error evaluation uses a typical cross-validationapproach: the compiler partitions the data collected duringprofiling into a training set, 70% of the data, and a test set, theremaining 30%. The topology search algorithm trains manydifferent neural-network topologies using the training set andchooses the one with the highest accuracy on the test set andthe lowest latency on the A-NPU hardware (prioritizing accu-racy). The space of possible topologies is large, so we restrictthe search to neural networks with at most two hidden layers.We also limit the number of neurons per hidden layer to pow-ers of two up to 32. The numbers of neurons in the input andoutput layers are predetermined based on the number of inputsand outputs in the candidate function.

To further improve accuracy, and compensate for topology-restricted networks, we utilize a Resilient Back Propagation(RPROP) [30] training algorithm as the base training algorithmin our CDLM framework. During training, instead of updatingthe weight values based on the backpropagated error (as inconventional backpropagation [43]), the RPROP algorithmincreases or decreases the weight values by a predefined valuebased on the sign of the error. Our investigation showed thatRPROP significantly outperforms conventional backpropaga-tion for the selected network topologies, requiring only half ofthe number of training epochs as backpropagation to convergeon a quality solution. The main advantage of the application ofRPROP training to an analog approach to neural computing isits robustness to the sigmoid function and topology restrictionsimposed by the analog design. Backpropagation, for example,is extremely sensitive to the steepness of the sigmoid function,and allowing for a variety of steepness levels in a fixed, analogimplementation is challenging. Additionally, backpropagationperforms poorly with a shallow sigmoid function. The require-

ment of a steep sigmoid function exacerbates analog rangechallenges, possibly making the implementation infeasible.RPROP tolerates a more shallow sigmoid activation steepnessand performs consistently utilizing a constant activation steep-ness over all applications. Our RPROP-based, customizedCDLM training phase requires 5000 training epochs, with theanalog-based CDLM phase adding roughly 10% to the trainingtime of the baseline training algorithm.

6. EvaluationsCycle-accurate simulation and energy modeling. We usethe MARSSx86 x86-64 cycle-accurate simulator [39] to modelthe performance of the processor. The processor is modeledafter a single-core Intel Nehalem to evaluate the performancebenefits of A-NPU acceleration over an aggressive out-of-order architecture1. We extended the simulator to include ISA-level support for A-NPU queue and dequeue instructions. Wealso augmented MARSSx86 with a cycle-accurate simulatorfor our A-NPU design and an 8-bit, fixed-point D-NPU witheight processing engines (PEs) as described in [18]. We useGCC v4.7.3 with -o3 to enable compiler optimization. Thebaseline in our experiments is the benchmark run solely on theprocessor without neural transformation. We use McPAT [33]for processor energy estimations. We model the energy of an8-bit, fixed-point D-NPU using results from McPAT, CACTI6.5 [37], and [22] to estimate its energy. Both the D-NPU andthe processor operate at 3.4GHz at 0.9 V, while the A-NPU isclocked at one third of the digital clock frequency, 1.1GHz at1.2 V, to achieve acceptable accuracy.Circuit design for ANU. We built a detailed transistor-levelSPICE model of the analog neuron, ANU. We designed andsimulated the 8-bit, 8-input ANU in the Cadence Analog De-sign Environment using predictive technology models at 45nm [6]. We ran detailed Spectre SPICE simulations to under-stand circuit behavior and measure ANU energy consumption.We used CACTI to estimate the energy of the A-NPU buffers.Evaluations consider all A-NPU components, both digital andanalog. For the analog parts, we used direct measurementsfrom the transistor-level SPICE simulations. For SRAM ac-cesses, we used CACTI. We built an A-NPU cycle-accuratesimulator to evaluate the performance improvements. Similarto McPAT, we combined simulation statistics with measure-ments from SPICE and CACTI to calculate A-NPU energy. Toavoid biasing our study toward analog designs, all energy andperformance comparisons are to an 8-bit, fixed-point D-NPU(8-bit inputs/weights/multiply-adders). For consistency withthe available McPAT model for the baseline processor, we

1Processor: Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/StoreFUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP PhysicalRegisters: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways:1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, DependencePredictor: 4096-entry Bloom Filter, ITLB/DTLB Entries: 128/256 L1: 32KB Instruction, 32 KB Data, Line Width: 64 bytes, 8-Way, Latency: 3 cyclesL2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB,Line Width 64 bytes, 16-Way, Latency: 27 cycles Memory Latency: 50 ns

Page 8: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network.

Benchmark*Name Descrip0on Type#*of*

Func0on*Calls

#*of*Loops

#*of*Ifs/elses

#*of*x86@64*

Instruc0ons

Evalua0on*Input*Set Training*Input*Set Neural*Network*

Topology

Fully*Digital*NN*

MSE

Analog*NN*MSE*(8@bit)

Applica0on*Error*Metric

Fully*Digital*Error

Analog*Error

blackscholesMathema'cal*model*of*a*financial*market*

Financial*Analysis 5 0 5 309

4096*Data*Point*from*PARSEC

16384*Data*Point*from*PARSEC 6*E>*8*E>*8E>*1 0.000011 0.00228 Avg.*Rela've*Error 6.02% 10.2%

M RadixE2*CooleyETukey*fast*fourier

Signal*Processing 2 0 0 34

2048*Random*Floa'ng*Point*Numbers

32768*Random*Floa'ng*Point*Numbers

1*E>*4*E>*4*E>*2 0.00002 0.00194 Avg.*Rela've*Error 2.75% 4.1%

inversek2j Inverse*kinema'cs*for*2Ejoint*arm Robo'cs 4 0 0 100

10000*(x,*y)*Random*Coordinates

10000*(x,*y)*Random*Coordinates

2*E>*8*E>*2 0.000341 0.00467 Avg.*Rela've*Error 6.2% 9.4%

jmeintTriangle*intersec'on*detec'on

3D*Gaming 32 0 23 1,079

10000*Random*Pairs*of*3D*Triangle*Coordinates

10000*Random*Pairs*of*3D*Triangle*Coordinate

18*E>*32*E>*8*E>*2 0.05235 0.06729 Miss*Rate 17.68% 19.7%

jpeg JPEG*encoding Compression 3 4 0 1,257 220x200EPixel*Color*Image

Three*512x512EPixel*Color*Images 64*E>*16*E>*8*E>*64 0.0000156 0.0000325 Image*Diff 5.48% 8.4%

kmeans KEmeans*clustering Machine*Learning 1 0 0 26

220x200EPixel*Color*Image

50000*Pairs*of*Random*(r,*g,*b)*Values

6*E>*8*E>*4*E>*1 0.00752 0.009589 Image*Diff 3.21% 7.3%

sobel Sobel*edge*detector

Image*Processing 3 2 1 88 220x200EPixel*

Color*ImageOne*512x512EPixel*Color*Image 9*E>*8*E>*1 0.000782 0.00405 Image*Diff 3.89% 5.2%

Table 2: Area estimates for the analog neuron (ANU).

Sub-circuit Area

8×8-bit DAC 3,096 T∗

8×Resistor Ladder (8-bit weights) 4,096 T + 1 KΩ (≈ 450T)8×Differential Pair 48 TI-to-V Resistors 20 KΩ (≈ 30 T)Differential Amplifier 244 T8-bit ADC 2550 T + 1 KΩ (≈ 450 T)

Total ≈ 10,964 T

∗Transistor with width/length = 1

used McPAT and CACTI to estimate D-NPU energy. Eventhough we do not have a fabrication-ready layout for the de-sign, in Table 2, we provide an estimate of the ANU area interms of number of transistors. T denotes a transistor withwidthlength = 1. As shown, each ANU (which performs eight, 8-bitanalog multiply-adds in parallel followed by a sigmoid) re-quires about 10,964 transistors. An equivalent digital neuronthat performs eight, 8-bit multiply-adds and a sigmoid wouldrequire about 72,456 T from which 56,000 T are for the eight,8-bit multiply-adds and 16,456 T for the sigmoid lookup. Withthe same compute capability, the analog neuron requires 6.6×fewer transistors than its equivalent digital implementation.Benchmarks. We use the benchmarks in [18] and add onemore, blackscholes. These benchmarks represent a diverse setof application domains, including financial analysis, signal pro-cessing, robotics, 3D gaming, compression, image processing.Table 1 summarizes information about each benchmark: ap-plication domain, target code, neural-network topology, train-ing/test data and final application error levels for fully-digitalneural networks and analog neural networks using our cus-tomized RPROP-based CDLM training algorithm. The neuralnetworks were trained using either typical program inputs,such as sample images, or a limited number of random inputs.Accuracy results are reported using an independent data set,e.g, an input image that is different than the image used dur-ing training. Each benchmark requires an application-specificerror metric, which is used in our evaluations. As shown in Ta-

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Applica'

on*Spe

edup

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobel

Floating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6 1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8

1.9

0.5 0.8 1.2

2.43.1 3.

6

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobelPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672Im

provement

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup

EnergyBSaving

8-bit D-NPU vs. A-NPU-1

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2.

711

.8

12.1

3.3

Application Speedup-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672blackscholes G inversek2j jmeint jpeg kmeans sobel

FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%

blackscholes 0 inversek2j jmeint jpeg kmeans sobel

FloaEngBPointBDHNPU

AHNPUB+BIdealBSigmoid

AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

Figure 5: A-NPU with 8 ANUs vs. D-NPU with 8 PEs.ble 1, each application benefits from a different neural networktopology, so the ability to reconfigure the A-NPU is critical.A-NPU vs 8-bit D-NPU. Figure 5 shows the average energyimprovement and speedup for one invocation of an A-NPUover one invocation of an 8-bit D-NPU, where the A-NPUis clocked at 1

3 the D-NPU frequency. On average, the A-NPU is 12.1× more energy efficient and 3.3× faster thanthe D-NPU. While consuming significantly less energy, theA-NPU can perform 64 multiply-adds in parallel, while theD-NPU can only perform eight. This energy-efficient, par-allel computation explains why jpeg–with the largest neuralnetwork (64→16→8→64)–achieves the highest energy andperformance improvements, 82.2× and 15.2×, respectively.The larger the network, the higher the benefits from A-NPU.Compared to a D-NPU, an A-NPU offers a higher level of par-allelism with low energy cost that can potentially enable usinglarger neural networks to replace more complicated code.Whole application speedup and energy savings. Figure 6shows the whole application speedup and energy savings whenthe processor is augmented with an 8-bit, 8-PE D-NPU, our8-ANU A-NPU, and an ideal NPU, which takes zero cyclesand consumes zero energy. Figure 6c shows the percentage ofdynamic instructions subsumed by the neural transformationof the candidate code. The results show, following the Am-dahl’s Law, that the larger the number of dynamic instructionssubsumed, the larger the benefits from neural acceleration.

Page 9: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Applica'

on*Spe

edup

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobel

Floating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6

1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8

1.9

0.5 0.8 1.2

2.4 3.

1 3.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobelPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Improvement

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup

EnergyBSaving

8-bit D-NPU vs. A-NPU-1

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2.

711

.8

12.1

3.3

Application Speedup-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672blackscholes G inversek2j jmeint jpeg kmeans sobel

FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%

blackscholes 0 inversek2j jmeint jpeg kmeans sobel

FloaEngBPointBDHNPU

AHNPUB+BIdealBSigmoid

AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

(a) Whole application speedup.

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Applica'

on*Spe

edup

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobel

Floating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6

1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8

1.9

0.5 0.8 1.2

2.4 3.

1 3.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobelPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Improvement

0

1

2

3

4

5

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup

EnergyBSaving

8-bit D-NPU vs. A-NPU-1

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2.

711

.8

12.1

3.3

Application Speedup-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672blackscholes G inversek2j jmeint jpeg kmeans sobel

FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%

blackscholes 0 inversek2j jmeint jpeg kmeans sobel

FloaEngBPointBDHNPU

AHNPUB+BIdealBSigmoid

AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

(b) Whole application energy saving.

Appl

icat

ion

Ener

gy R

educ

tion

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

Core + D-NPUCore + A-NPUCore + Ideal NPU

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

swaptions 1 -> 16 -> 8 -> 1 50 10.30 8 1.22 2.08 3.13 8.45 7.58

geomean 59.90 9.71 6.35 0.84 3.33 4.71 12.14 10.32

Impr

ovem

ent

0

1

2

3

4

5

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Energy SavingSpeedup

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

10.3

7.6

10.5

7.6

73.825.5

4.1

3.3

8.5

12.14

8.45

11.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Appl

icat

ion

Spee

dup

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

Core + D-NPUCore + A-NPUCore + Ideal NPU

Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobel

Floating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6 1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.

6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.

81.

9

0.5 0.8 1.

2

2.4 3.

1 3.6

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobelPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

82.2

15.2

8.5

2.4

11.8

2.7

8.4

2.0

12.1

3.3

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Impr

ovem

ent

0

1

2

3

4

5

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

SpeedupEnergy Saving

8-bit D-NPU vs. A-NPU-1

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2.

711

.8

12.1

3.3

Application Speedup-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672

(c) % dynamic instructions subsumed.

Figure 6: Whole application speedup and energy saving with D-NPU,A-NPU, and an Ideal NPU that consumes zero energy and takes zerocycles for neural computation.Geometric mean speedup and energy savings with an A-NPUis 3.7× and 6.3× respectively, which is 48% and 24% betterthan an 8-bit, 8-PE NPU. Among the benchmarks, kmeanssees slow down with D-NPU and A-NPU-based acceleration.All benchmarks benefit in terms of energy. The speedup withA-NPU acceleration ranges from 0.8× to 24.5×. The energysavings range from 1.3× to 51.2×.As the results show, thesavings with an A-NPU closely follows the ideal case, and,in terms of “energy”, there is little value in designing a moresophisticated A-NPU. This result is due to the fact that theenergy cost of executing instructions in the von Neumann,out-of-order pipeline is much higher than performing sim-ple multiply-adds in the analog domain. Using physics laws(Ohm’s law for multiplication and Kirchhoff’s law for summa-tion) and analog properties of devices to perform computationcan lead to significant energy and performance benefits.Application error. Table 3 shows the application-level er-rors with a floating point D-NPU, A-NPU with ideal sigmoidand our A-NPU which incorporates non-idealities of the ana-log sigmoid. Except for jmeint, which shows error above 10%,all of the applications show error less than or around 10%.Application average error rates with the A-NPU range from

Table 3: Error with a floating point D-NPU, A-NPU with ideal sigmoid,and A-NPU with non-ideal sigmoid.

Appl

icat

ion

Ener

gy R

educ

tion

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

Core + D-NPUCore + A-NPUCore + Ideal NPU

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

swaptions 1 -> 16 -> 8 -> 1 50 10.30 8 1.22 2.08 3.13 8.45 7.58

geomean 59.90 9.71 6.35 0.84 3.33 4.71 12.14 10.32

Impr

ovem

ent

0

1

2

3

4

5

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Energy SavingSpeedup

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

10.3

7.6

10.5

7.6

73.825.5

4.1

3.3

8.5

12.14

8.45

11.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Appl

icat

ion

Spee

dup

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

Core + D-NPUCore + A-NPUCore + Ideal NPU

Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobel

Floating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6 1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.

6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.

81.

9

0.5 0.8 1.

2

2.4 3.

1 3.6

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptionsPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

82.2

15.2

8.5

2.4

11.8

2.7

8.4

2.0

12.1

3.3

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Impr

ovem

ent

0

1

2

3

4

5

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

SpeedupEnergy Saving

8-bit D-NPU vs. A-NPU-1

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)

Energy Improvement (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2.

711

.8

12.1

3.3

Application Speedup-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Error

0%

20%

40%

60%

80%

100%

Perc

enta

ge o

f Out

put E

lem

ents

blackscholesfftjmeintinversek2jjpegkmeanssobel

Figure 7: CDF plot of application output error. A point (x,y) indicatesthat y% of the output elements see error≤ x%.

4.1% to 10.2%. This quality-of-result loss is commensuratewith other work on quality trade-offs. Among digital hard-ware approximation techniques, Truffle [17] and EnerJ [45]shows similar error (3–10%) for some applications and muchgreater error (above 80%) for others in a moderate configura-tion. Green [3] has error rates below 1% for some applicationsbut greater than 20% for others. A case study [36] exploresmanual optimizations of the x264 video encoder that tradeoff 0.5–10% quality loss. As expected, the quality-of-resultsdegradation with an A-NPU is more than a floating pointD-NPU. However, the quality losses are commensurate withdigital approximate computing techniques.

To study the application-level quality loss in more detail,Figure 7 illustrates the CDF (cumulative distribution function)plot of final error for each element of application’s output.Each benchmark’s output consists of a collection of elements–an image consists of pixels; a vector consists of scalars; etc.This CDF reveals the distribution of error among an applica-tion’s output elements and shows that only a small fractionof the output elements see large quality loss with analog ac-celeration. The majority (80% to 100%) of each application’soutput elements have error less than 10% except for jmeint.Exposing circuit limitations to the compiler. Figure 8shows the effect of bit-width restrictions on application-levelerror, assuming 8 inputs per neuron. As the results suggest,exposing the bit-width limitations and the topology restric-tions to the compiler enables our RPROP-based, customizedCDLM training algorithm to find and train neural networksthat can achieve accuracy levels commensurate with the digitalapproximation techniques, using only eight bits of precisionfor inputs, outputs, and weights, and eight inputs to the analogneurons. Several applications show less than 10% error evenwith fewer than eight bits. The results shows that there aremany applications that can significantly benefit from analog

Page 10: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

Figure 8: Application error with limited bit-width analog neural computation.

Speedup

Application Topology 80% Offloading 85% Offloading 90% Offloading 95% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 4.29879115398644 5.4152844266765 7.31520984460266 11.2688183058235 24.5221729784241 11.2688183058235

fft 1 -> 4 -> 4 -> 2 1.24291093738476 1.26207162497662 1.28183232380289 1.30222166592591 1.32327013458615 1.30222166592591

inversek2j 2 -> 8 -> 2 3.66612081850573 4.39916488650216 5.49861846323813 7.33074108518411 10.9938617211157 7.33074108518411

jmeint 18 -> 32 -> 8 -> 2 3.05104421601116 3.49966751720672 4.10296438810026 4.9575875972808 6.26190545947988 4.9575875972808

jpeg 64 -> 16 -> 8 -> 64 1.59832986868194 1.66042238801816 1.7275342892346 1.80029983714727 1.87946485929561 1.80029983714727

kmeans 6 -> 8 -> 4 -> 1 0.871890119261146 0.864964463370785 0.858147965078337 0.851438063857892 0.844832278645737 0.851438063857892

sobel 9 -> 8 -> 1 2.18596481041593 2.36096624039658 2.56642620661238 2.8110545094951 3.10723166292606 2.8110545094951

geomean 2.10323387019544 2.31542035187076 2.60078141883403 3.02129183761473 3.7797513074705 3.02129183761473

Applica'

on*Spe

edup

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Energy

Application Topology 80% Offloading 85% Offloading 90% Offloading 95% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 4.63801183770937 6.00293321989393 8.50623461623297 14.5907800958696 51.250425520663

fft 1 -> 4 -> 4 -> 2 1.49339422451373 1.54091152752417 1.59155204808073 1.64563417396981 1.70352109148241

inversek2j 2 -> 8 -> 2 4.41210751318201 5.60806507402552 7.69348022144627 12.2480344702273 30.0198258158588

jmeint 18 -> 32 -> 8 -> 2 4.08646628510342 5.06317341211034 6.65340510063347 9.69994502836948 17.8930069836166

jpeg 64 -> 16 -> 8 -> 64 1.87320180113408 1.9813332883883 2.10271344731737 2.23993609469231 2.39631940662302

kmeans 6 -> 8 -> 4 -> 1 1.25276901445104 1.27287802214948 1.29364312672525 1.31509697041071 1.33727439731608

sobel 9 -> 8 -> 1 2.07272551473412 2.22167873100213 2.39369814970405 2.59459132340805 2.83229403346624

geomean 2.49828814534424 2.83493584289978 3.32702119489218 4.166650961657 6.37013417935512

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

4.2

24.5

1.2

1.3

3.6

10.9

3.0

6.2

1.5 1.8

0.8

0.8

2.1

3.1

4.6

51.2

1.5 1.7

4.4

30.0

4.0

17.8

1.8 2.3

1.2

1.3 2.

0 2.8

2.1

3.7

6.3

2.5

Appl

icat

ion

Spee

dup

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

80% Offloading85% Offloading90% Offloading95% Offloading100% Offloading

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions

80% Offloading85% Offloading90% Offloading95% Offloading100% Offloading

2.3

1.1 1.3 2.

2

2.0

6.2

1.4 1.

80.

9 0.8 1.

6

3.1

2.4

51.2

1.3 1.

7 2.3

30.0

2.3

17.8

1.5 2.

31.

1 1.3 1.

6

2.8

1.6

3.7

6.3

1.7

Speedup-1

Application Topology 60% Offloading 70% Offloading 80% Offloading 90% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 2.3558921269523 3.04371757679209 4.29879115398644 7.31520984460266 24.5221729784241

fft 1 -> 4 -> 4 -> 2 1.17175303062181 1.20628350525801 1.24291093738476 1.28183232380289 1.32327013458615

inversek2j 2 -> 8 -> 2 2.19985260892858 2.7497313266094 3.66612081850573 5.49861846323813 10.9938617211157

jmeint 18 -> 32 -> 8 -> 2 2.01687120906375 2.42843959329192 3.05104421601116 4.10296438810026 6.26190545947988

jpeg 64 -> 16 -> 8 -> 64 1.39035685940477 1.48710728068585 1.59832986868194 1.7275342892346 1.87946485929561

kmeans 6 -> 8 -> 4 -> 1 0.900738494539225 0.886079562648145 0.871890119261146 0.858147965078337 0.844832278645737

sobel 9 -> 8 -> 1 1.68606220082748 1.90374324790557 2.18596481041593 2.56642620661238 3.10723166292606

swaptions 1 -> 16 -> 8 -> 1

geomean 1.59150802823661 1.80117697344101 2.10323387019544 2.60078141883403 3.7797513074705

Energy-1

Application Topology 60% Offloading 70% Offloading 80% Offloading 90% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 2.42891052606711 3.1881815570466 4.63801183770937 8.50623461623297 51.250425520663

fft 1 -> 4 -> 4 -> 2 1.32941304492404 1.40664067630675 1.49339422451373 1.59155204808073 1.70352109148241

inversek2j 2 -> 8 -> 2 2.38102726132093 3.09293091260182 4.41210751318201 7.69348022144627 30.0198258158588

jmeint 18 -> 32 -> 8 -> 2 2.30663132833562 2.94879625663539 4.08646628510342 6.65340510063347 17.8930069836166

jpeg 64 -> 16 -> 8 -> 64 1.53755323713305 1.68886212045322 1.87320180113408 2.10271344731737 2.39631940662302

kmeans 6 -> 8 -> 4 -> 1 1.178309012499 1.21439871854634 1.25276901445104 1.29364312672525 1.33727439731608

sobel 9 -> 8 -> 1 1.63440778233613 1.82765411462566 2.07272551473412 2.39369814970405 2.83229403346624

swaptions 1 -> 16 -> 8 -> 1

geomean 1.76097005127676 2.05223572654731 2.49828814534424 3.32702119489218 6.37013417935512

80%BOffloading85%BOffloading90%BOffloading80%BOffloading100%BOffloading

80%BOffloading85%BOffloading90%BOffloading80%BOffloading100%BOffloading

(a) Total application speedup with limited A-NPU invocations.

Speedup

Application Topology 80% Offloading 85% Offloading 90% Offloading 95% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 4.29879115398644 5.4152844266765 7.31520984460266 11.2688183058235 24.5221729784241 11.2688183058235

fft 1 -> 4 -> 4 -> 2 1.24291093738476 1.26207162497662 1.28183232380289 1.30222166592591 1.32327013458615 1.30222166592591

inversek2j 2 -> 8 -> 2 3.66612081850573 4.39916488650216 5.49861846323813 7.33074108518411 10.9938617211157 7.33074108518411

jmeint 18 -> 32 -> 8 -> 2 3.05104421601116 3.49966751720672 4.10296438810026 4.9575875972808 6.26190545947988 4.9575875972808

jpeg 64 -> 16 -> 8 -> 64 1.59832986868194 1.66042238801816 1.7275342892346 1.80029983714727 1.87946485929561 1.80029983714727

kmeans 6 -> 8 -> 4 -> 1 0.871890119261146 0.864964463370785 0.858147965078337 0.851438063857892 0.844832278645737 0.851438063857892

sobel 9 -> 8 -> 1 2.18596481041593 2.36096624039658 2.56642620661238 2.8110545094951 3.10723166292606 2.8110545094951

geomean 2.10323387019544 2.31542035187076 2.60078141883403 3.02129183761473 3.7797513074705 3.02129183761473

Applica'

on*Spe

edup

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Energy

Application Topology 80% Offloading 85% Offloading 90% Offloading 95% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 4.63801183770937 6.00293321989393 8.50623461623297 14.5907800958696 51.250425520663

fft 1 -> 4 -> 4 -> 2 1.49339422451373 1.54091152752417 1.59155204808073 1.64563417396981 1.70352109148241

inversek2j 2 -> 8 -> 2 4.41210751318201 5.60806507402552 7.69348022144627 12.2480344702273 30.0198258158588

jmeint 18 -> 32 -> 8 -> 2 4.08646628510342 5.06317341211034 6.65340510063347 9.69994502836948 17.8930069836166

jpeg 64 -> 16 -> 8 -> 64 1.87320180113408 1.9813332883883 2.10271344731737 2.23993609469231 2.39631940662302

kmeans 6 -> 8 -> 4 -> 1 1.25276901445104 1.27287802214948 1.29364312672525 1.31509697041071 1.33727439731608

sobel 9 -> 8 -> 1 2.07272551473412 2.22167873100213 2.39369814970405 2.59459132340805 2.83229403346624

geomean 2.49828814534424 2.83493584289978 3.32702119489218 4.166650961657 6.37013417935512

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

4.2

24.5

1.2

1.3

3.6

10.9

3.0

6.2

1.5 1.8

0.8

0.8

2.1

3.1

4.6

51.2

1.5 1.7

4.4

30.0

4.0

17.8

1.8 2.3

1.2

1.3 2.

0 2.8

2.1

3.7

6.3

2.5

Appl

icat

ion

Spee

dup

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

80% Offloading85% Offloading90% Offloading95% Offloading100% Offloading

Ener

gy Im

prov

emen

t

0

2

4

6

8

10

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions

80% Offloading85% Offloading90% Offloading95% Offloading100% Offloading

2.3

1.1 1.3 2.

2

2.0

6.2

1.4 1.

80.

9 0.8 1.

6

3.1

2.4

51.2

1.3 1.

7 2.3

30.0

2.3

17.8

1.5 2.

31.

1 1.3 1.

6

2.8

1.6

3.7

6.3

1.7

Speedup-1

Application Topology 60% Offloading 70% Offloading 80% Offloading 90% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 2.3558921269523 3.04371757679209 4.29879115398644 7.31520984460266 24.5221729784241

fft 1 -> 4 -> 4 -> 2 1.17175303062181 1.20628350525801 1.24291093738476 1.28183232380289 1.32327013458615

inversek2j 2 -> 8 -> 2 2.19985260892858 2.7497313266094 3.66612081850573 5.49861846323813 10.9938617211157

jmeint 18 -> 32 -> 8 -> 2 2.01687120906375 2.42843959329192 3.05104421601116 4.10296438810026 6.26190545947988

jpeg 64 -> 16 -> 8 -> 64 1.39035685940477 1.48710728068585 1.59832986868194 1.7275342892346 1.87946485929561

kmeans 6 -> 8 -> 4 -> 1 0.900738494539225 0.886079562648145 0.871890119261146 0.858147965078337 0.844832278645737

sobel 9 -> 8 -> 1 1.68606220082748 1.90374324790557 2.18596481041593 2.56642620661238 3.10723166292606

swaptions 1 -> 16 -> 8 -> 1

geomean 1.59150802823661 1.80117697344101 2.10323387019544 2.60078141883403 3.7797513074705

Energy-1

Application Topology 60% Offloading 70% Offloading 80% Offloading 90% Offloading 100% Offloading

blackscholes 6 -> 8 -> 8 -> 1 2.42891052606711 3.1881815570466 4.63801183770937 8.50623461623297 51.250425520663

fft 1 -> 4 -> 4 -> 2 1.32941304492404 1.40664067630675 1.49339422451373 1.59155204808073 1.70352109148241

inversek2j 2 -> 8 -> 2 2.38102726132093 3.09293091260182 4.41210751318201 7.69348022144627 30.0198258158588

jmeint 18 -> 32 -> 8 -> 2 2.30663132833562 2.94879625663539 4.08646628510342 6.65340510063347 17.8930069836166

jpeg 64 -> 16 -> 8 -> 64 1.53755323713305 1.68886212045322 1.87320180113408 2.10271344731737 2.39631940662302

kmeans 6 -> 8 -> 4 -> 1 1.178309012499 1.21439871854634 1.25276901445104 1.29364312672525 1.33727439731608

sobel 9 -> 8 -> 1 1.63440778233613 1.82765411462566 2.07272551473412 2.39369814970405 2.83229403346624

swaptions 1 -> 16 -> 8 -> 1

geomean 1.76097005127676 2.05223572654731 2.49828814534424 3.32702119489218 6.37013417935512

80%BOffloading85%BOffloading90%BOffloading80%BOffloading100%BOffloading

80%BOffloading85%BOffloading90%BOffloading80%BOffloading100%BOffloading

(b) Total application energy saving for limited A-NPU invocations.

Figure 9: Speedup/energy saving with limited A-NPU invocations.acceleration without significant output quality loss.Limited analog acceleration. We examine the effects onthe benefits when, due to noise or pathological inputs, onlya fraction of the invocations are offloaded to the A-NPU. Inthis case, the application falls back to the original code forthe remaining invocations. Figure 9 depicts the applicationspeedup and energy improvement when only 80%, 85%, 90%,95%, and 100% of the invocations are offloaded to the A-NPU.The results suggest that even limited analog accelerators canprovide significant energy and performance improvements.

7. Limitations and ConsiderationsApplicability. Not all applications can benefit from analogacceleration; however, our work shows that there are many thatcan. More rigorous optimization at the circuit level, as well asbroadening the scope off application coverage by continuedadvancements at the neural transformation step, may provide

significant improvements in accuracy and generality.Other design points. This study evaluates the performanceand energy improvements of an A-NPU assuming integrationwith a modern, high-performance processor. If low-powercores are used instead, we expect to see, and preliminaryresults confirm, that the performance benefits of an A-NPUincrease, and that the energy benefits decrease.Variability and noise. We designed the circuit with vari-ability and noise as first-order concerns, and we made severaldesign decisions to mitigate them. We limit both the input andweight bit widths, as well as the analog neuron input countto eight to provide quantization margins for variation/noise.We designed the sigmoid circuit one order of magnitude moreshallow than the digital implementation to provide additionalmargins for variation and noise. We used a differential design,which provides resilience to noise by representing a value bythe difference between two signals; as noise affects the pairof nearby signals similarly, the difference between the sig-nals remains intact and the computation correct. Conversionto the digital domain after each analog neuron computationenforces computation integrity and reduces variation/noisesusceptibility, while incurring energy and speed overheads.As mentioned in Section 6, to further improve the quality ofthe final result, we can refrain from A-NPU invocations andfall back to the original code as needed. An online noise-monitoring system could potentially limit the invocation ofthe A-NPU to low-noise situations. Incorporating a quanti-tative noise model into the training algorithm may improverobustness to analog noise.Training for variability. A neural approach to approximatecomputing presents the opportunity to correct for certain typesof analog-imposed inaccuracy, such as process variation, non-linearity, and other forms of non-ideality that are consistent forexecutions on a particular A-NPU hardware instance for someperiod of time. After an initial training phase that accounts forthe predictable, compiler-exposed analog limitations, a second(shorter) training phase can adjust for hardware-specific non-idealities, sending training inputs and outputs to the A-NPUand adjusting network weights to minimize error. This correc-

Page 11: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

tion technique is able to address inter and intra-chip processvariation and hardware-dependent, non-ideal analog behavior.Smaller technology nodes. This work is the start of usinganalog circuits for code acceleration. Providing benefits atsmaller nodes may require using larger transistors for analogparts, trading off area for resilience. Energy-efficient perfor-mance is growing in importance relative to area efficiency,especially as CMOS scaling benefits continue to diminish.

8. Related WorkThis research lies at the intersection of (a) general-purposeapproximate computing, (b) accelerators, (c) analog and digitalneural hardware, (d) neural-based code acceleration, (e) andlimited-precision learning. This work combines techniquesin all these areas to provide a compilation workflow and thearchitecture/circuit design that enables code acceleration withlimited-precision mixed-signal neural hardware. In each area,we discuss the key related work that inspired our work.General-purpose approximate computing. Several stud-ies have shown that diverse classes of applications are tolerantto imprecise execution [20, 54, 34, 12, 45]. A growing bodyof work has explored relaxing the abstraction of full accuracyat the circuit and architecture level for gains in performance,energy, and resource utilization [13, 17, 44, 2, 35, 8, 28, 38,18, 51, 46]. These circuit and architecture studies, althoughproven successful, are limited to purely digital techniques. Weexplore how a mixed-signal, analog-digital approach can gobeyond what digital approximate techniques offer.Accelerators. Research on accelerators seeks to synthesizeefficient circuits or FPGA configurations to accelerate general-purpose code [41, 42, 11, 19, 29]. Similarly, static specializa-tion has shown significant efficiency gains for irregular andlegacy code [52, 53]. More recently, configurable acceleratorshave been proposed that allow the main CPU to offload cer-tain code to a small, efficient structure [23, 24]. This paperextends the prior work on digital accelerators with a new classof mixed-signal, analog-digital accelerators.Analog and digital neural hardware. There is an exten-sive body of work on hardware implementations of neural net-works both in digital [40, 15, 55] and analog [5, 47, 49, 32, 48].Recent work has proposed higher-level abstractions for imple-mentation of neural networks [27]. Other work has examinedfault-tolerant hardware neural networks [26, 50]. In particu-lar, Temam [50] uses datasets from the UCI machine learningrepository [21] to explore fault tolerance of a hardware neuralnetwork design. In contrast, our compilation, neural-networkselection/training framework, and architecture design aim atapplying neural networks to general-purpose code written infamiliar programming models and languages, not explicitlywritten to utilize neural networks directly.Neural-based code acceleration. A recent study [9] showsthat a number of applications can be manually reimplementedwith explicit use of various kinds of neural networks. That

study did not prescribe a programming workflow, nor a pre-ferred hardware architecture. More recent work exposes ana-log spiking neurons as primitive operators [4]. This workdevises a new programming model that allows programmersto express digital signal-processing applications as a graph ofanalog neurons and automatically maps the expressed graphto a tiled analog, spiking-neural hardware. The work in [4] isrestricted to the domain of applications whose inputs are real-world signals that should be encoded as pulses. Our approachaddresses the long-standing challenges of using analog com-putation (programmability and generality) by not imposingdomain-specific limitations, and by providing analog circuitrythat is integrated with a conventional digital processor in away that does not require a new programming paradigm.Limited-precision learning. The work in [14] provides acomplete survey of learning algorithms that consider limitedprecision neural hardware implementation. We tried various al-gorithms, but we found that CDLM [10] was the most effective.More sophisticated limited-precision learning techniques canimprove the reported quality results in this paper and furtherconfirm the feasibility and effectiveness of the mixed-signal,approach for neural-based code acceleration.

9. ConclusionsFor decades, before the effective end of Dennard scaling, weconsistently improved performance and efficiency while main-taining generality in general-purpose computing. As the bene-fits from scaling diminish, the community is facing an iron tri-angle; we can choose any two of performance, efficiency, andgenerality at the expense of the third. Solutions that improveperformance and efficiency, while retaining as much generalityas possible, are growing in importance. Analog circuits inher-ently trade accuracy for significant gains in energy-efficiency.However, it is challenging to utilize them in a way that is bothprogrammable and generally useful. As this paper showed, theneural transformation of general-purpose approximable codeprovides an avenue for realizing the benefits of analog compu-tation while targeting code written in conventional languages.This work provided an end-to-end solution for utilizing ana-log circuits for accelerating approximate applications, fromcircuits to compiler design. The insights from this work showthat it is crucial to expose analog circuit characteristics to thecompilation and neural network training phases. The NPUmodel offers a way to exploit analog efficiencies, despite theirchallenges, for a wider range of applications than is typicallypossible. Further, mixed-signal execution delivers much largersavings for NPUs than digital. However, this study is notconclusive. The full range of applications that can exploitmixed-signal NPUs is still unknown, as is whether it will besufficiently large to drive adoption in high-volume micropro-cessors. It is still an open question how developers mightreason about the acceptable level of error when an applicationundergoes an approximate execution including analog accel-eration. Finally, in a noisy, high-performance microprocessor

Page 12: General-Purpose Code Acceleration with Limited-Precision ...hadi/doc/paper/2014-isca-anpu.pdfAppears in the Proceedings of the 41st International Symposium on Computer Architecture,

environment, it is unclear that an analog NPU would not beadversely affected. However, the significant gains from A-NPU acceleration and the diversity of the studied applicationssuggest a potentially promising path forward.

AcknowledgmentsThis work was supported in part by gifts from Microsoft Re-search and Google, NSF award CCF-1216611, and by C-FAR,one of six centers of STARnet, a Semiconductor ResearchCorporation program sponsored by MARCO and DARPA.

References[1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design. Oxford

University Press, 2002.[2] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy memoization for floating-

point multimedia applications,” IEEE TC, 2005.[3] W. Baek and T. M. Chilimbi, “Green: A framework for supporting

energy-conscious programming using controlled approximation,” inPLDI, 2010.

[4] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuousreal-world inputs can open up alternative accelerator designs,” in ISCA,2013.

[5] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, andS. Member, “An analog neural network processor with programmabletopology,” JSSC, 1991.

[6] Y. Cao, “Predictive technology models,” 2013. Available: http://ptm.asu.edu

[7] M. Carbin, S. Misailovic, and M. C. Rinard, “Verifying quantitativereliability for programs that execute on unreliable hardware,” in OOP-SLA, 2013.

[8] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz,K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOCarchitectures based on probabilistic CMOS (PCMOS) technology,” inDATE, 2006.

[9] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti,A. Nere, S. Qiu, M. Sebag, and O. Temam, “BenchNN: On the broadpotential application scope of hardware neural network accelerators,”in IISWC, 2012.

[10] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weightdiscretization paradigm for optical neural networks,” in ICOE, 1990.

[11] N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, “Application-specific processing on a general-purpose core via transparent instruc-tion set customization,” in MICRO, 2004.

[12] M. de Kruijf and K. Sankaralingam, “Exploring the synergy of emerg-ing workloads and silicon reliability trends,” in SELSE, 2009.

[13] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architec-tural framework for software recovery of hardware faults,” in ISCA,2010.

[14] S. Draghici, “On the capabilities of neural networks using limitedprecision weights,” Elsevier NN, 2002.

[15] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie,“Neural network stream processing core (NnSP) for embedded systems,”in ISCAS, 2006.

[16] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, andD. Burger, “Dark silicon and the end of multicore scaling,” in ISCA,2011.

[17] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecturesupport for disciplined approximate programming,” in ASPLOS, 2012.

[18] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural accel-eration for general-purpose approximate programs,” in MICRO, 2012.

[19] K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, “Bridging the computa-tion gap between programmable processors and hardwired accelerators,”in HPCA, 2009.

[20] Y. Fang, H. Li, and X. Li, “A fault criticality evaluation framework ofdigital systems for error tolerant video applications,” in ATS, 2011.

[21] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010.Available: http://archive.ics.uci.edu/ml

[22] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,”IEEE TC, 2011.

[23] V. Govindaraju, C. Ho, and K. Sankaralingam, “Dynamically special-ized datapaths for energy efficient computing,” in HPCA, 2011.

[24] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, “Bundledexecution of recurring traces for energy-efficient general purpose pro-cessing,” in MICRO, 2011.

[25] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Towarddark silicon in servers,” IEEE Micro, 2011.

[26] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstrac-tion and fault tolerance in cortical microarchitectures,” in ISCA, 2011.

[27] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuro-morphic ISAs,” in ASPLOS, 2011.

[28] R. Hegde and N. R. Shanbhag, “Energy-efficient signal processing viaalgorithmic noise-tolerance,” in ISLPED, 1999.

[29] Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, “Elastic cgras,” inFPGA, 2013.

[30] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,”in NC, 2000.

[31] D. A. Johns and K. Martin, Analog Integrated Circuit Design. JohnWiley and Sons, Inc., 1997.

[32] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spikingneurons design: Analog or digital?” in IJCNN, 2012.

[33] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi, “McPAT: An integrated power, area, and timing modelingframework for multicore and manycore architectures,” in MICRO,2009.

[34] X. Li and D. Yeung, “Exploiting soft computing for increased faulttolerance,” in ASGI, 2006.

[35] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, “Flikker: Sav-ing dram refresh-power through critical data partitioning,” in ASPLOS,2011.

[36] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality ofservice profiling,” in ICSE, 2010.

[37] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “OptimizingNUCA organizations and wiring alternatives for large caches withCACTI 6.0,” in MICRO, 2007.

[38] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochas-tic processors,” in DATE, 2010.

[39] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full systemsimulator for x86 CPUs,” in DAC, 2011.

[40] K. W. Przytula and V. K. P. Kumar, Eds., Parallel Digital Implementa-tions of Neural Networks. Prentice Hall, 1993.

[41] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, andS. Eggers, “CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures,” in FPGA, 2008.

[42] R. Razdan and M. D. Smith, “A high-performance microarchitecturewith hardware-programmable functional units,” in MICRO, 1994.

[43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internalrepresentations by error propagation,” in Parallel Distributed Process-ing: Explorations in the Microstructure of Cognition. MIT Press,1986.

[44] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage:Self-tuning approximation for graphics engines,” in MICRO, 2013.

[45] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, andD. Grossman, “EnerJ: Approximate data types for safe and generallow-power computation,” in PLDI, 2011.

[46] A. Sampson, J. Nelson, K. Strauss, and L. Ceze, “Approximate storagein solid-state memories,” in MICRO, 2013.

[47] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration ofanalog neural networks,” in IJCNN, 2008.

[48] R. St. Amant, D. A. Jiménez, and D. Burger, “Mixed-signal approxi-mate computation: A neural predictor case study,” IEEE MICRO TopPicks, vol. 29, no. 1, January/February 2009.

[49] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on ananalog VLSI neural network chip,” in SMC, 1990.

[50] O. Temam, “A defect-tolerant accelerator for emerging high-performance applications,” in ISCA, 2012.

[51] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Quality-programmable vector processors for ap-proximate computing,” in MICRO, 2013.

[52] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, “Conservation cores: Reduc-ing the energy of mature computations,” in ASPLOS, 2010.

[53] G. Venkatesh, J. Sampson, N. Goulding, S. K. Venkata, M. Taylor,and S. Swanson, “QsCores: Trading dark silicon for scalable energyefficiency with quasi-specific cores,” in MICRO, 2011.

[54] V. Wong and M. Horowitz, “Soft error resilience of probabilistic infer-ence applications,” in SELSE, 2006.

[55] J. Zhu and P. Sutton, “FPGA implementations of neural networks: Asurvey of a decade of progress,” in FPL, 2003.