rram-based analog approximate computing · figurable approximate computing framework with both...

13
1 RRAM-based Analog Approximate Computing Boxun Li, Student Member, IEEE, Peng Gu, Student Member, IEEE, Yi Shan, Yu Wang, Senior Member, IEEE, Yiran Chen, Member, IEEE, Huazhong Yang, Senior Member, IEEE Abstract—Approximate computing is a promising design paradigm for better performance and power efficiency. In this work, we propose a power efficient framework for analog approx- imate computing with the emerging metal-oxide resistive switch- ing random-access memory (RRAM) devices. A programmable RRAM-based approximate computing unit (RRAM-ACU) is introduced first to accelerate approximated computation, and an approximate computing framework with scalability is then proposed on top of the RRAM-ACU. In order to program the RRAM-ACU efficiently, we also present a detailed configuration flow, which includes a customized approximator training scheme, an approximator-parameter-to-RRAM-state mapping algorithm, and an RRAM state tuning scheme. Finally, the proposed RRAM- based computing framework is modeled at system level. A predictive compact model is developed to estimate the configura- tion overhead of RRAM-ACU and help explore the application scenarios of RRAM-based analog approximate computing. The simulation results on a set of diverse benchmarks demonstrate that, compared with a x86-64 CPU at 2GHz, the RRAM-ACU is able to achieve 4.06196.41× speedup and power efficiency of 24.59567.98 GFLOPS/W with quality loss of 8.72% on average. And the implementation of HMAX application demonstrates that the proposed RRAM-based approximate computing framework can achieve >12.8× power efficiency than its pure digital implementation counterparts (CPU, GPU, and FPGA). Index Terms—RRAM, approximate computing, power efficien- cy, neural network I. I NTRODUCTION Power efficiency has become a major concern in modern computing system design [1]. The limited battery capacity urges power efficiency of hundreds of giga floating point oper- ation per second per watt (GFLOPS/W) for mobile embedded systems to achieve the desirable portability and performance [2]. However, the highest power efficiency of contemporary CPU and GPU systems is only 10 GFLOPS/W, which is expected not to substantially improve in the predictable scaled technology node [3], [4]. As a result, researchers are looking for alternative architectures and technologies to achieve further performance and efficiency gains [5]. Approximate computing provides a promising solution to close the gap of power efficiency between present-day capa- This work was supported by 973 Project 2013CB329000, National Natural Science Foundation of China (No. 61373026), Brain Inspired Computing Research, Tsinghua University (20141080934), Tsinghua University Initiative Scientific Research Program, the Importation and Development of High- Caliber Talents Project of Beijing Municipal Institutions, NSF Grant CNS- 1253424 and ECCS-1202225. B. Li, P. Gu, Y. Wang, and H. Yang are with the Department of Elec- tronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China (e-mail: yu- [email protected]). Y. Shan is with the Baidu Research - Institute for Deep Learning (IDL), Baidu, Inc. , Beijing 100085, China (e-mail: [email protected]). Y. Chen is with the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA (e-mail: [email protected]). bilities and future requirements [6]. Approximate computing takes advantage of the characteristic that many modern appli- cations, ranging from signal processing, pattern recognition to computer vision, are able to produce results with acceptable quality even if many computation are executed imprecisely [7]. This tolerance of imprecise computation can be leveraged for substantial performance and efficiency gains and has inspired a wide range of architectural innovations [1], [8]. Recent work in approximate computing mainly focuses on hardware design of the basic computing elements, such as approximate adders and logics [9]–[11]. These techniques have adequately demonstrated the benefit of approximate computing, but the fixed functionality and low-level design stage limit the further improvement of performance and ef- ficiency. Moreover, these techniques are all based on the traditional CMOS technology, despite of the circumstance that the innovations of device technology have offered a great opportunity for radically different forms of architecture design and can significantly promote the performance and efficiency of computing systems [12]. Our objective is to use the emerging metal-oxide resistive random-access memory (RRAM) devices to design a recon- figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is one of the promising innovations that can advance Moore’s Law beyond the present silicon roadmap horizons [13]. RRAM devices are able to support a large number of signal connections within a small footprint by taking advantage of the ultra-integration density. And more importantly, RRAM devices can be used to build resistive cross-point structure [14], also known as the RRAM crossbar array, which can naturally transfer the weighted combination of input signals to output voltages and realize the matrix-vector multiplication with incredible power efficiency [15], [16]. To realize this goal, the following challenges must be overcome: First of all, an architecture, from the basic pro- cessing unit to a scalable framework, is required to provide an efficient hardware implementation for RRAM-based analog approximate computing. Secondly, from the perspective of software, a detailed configuration flow is demanded to program the hardware efficiently for each specific application. Finally, a comprehensive analysis of the system performance and major trade-offs is needed to explore the application scenarios of RRAM-based analog approximate computing. In this work, we explore the potential of RRAM-based analog approximate computing. The main contributions of this work include: We propose a power efficient RRAM-based approximate computing framework. The framework is scalable and is integrated with our programmable RRAM-based approx-

Upload: others

Post on 21-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

1

RRAM-based Analog Approximate ComputingBoxun Li, Student Member, IEEE, Peng Gu, Student Member, IEEE, Yi Shan,

Yu Wang, Senior Member, IEEE, Yiran Chen, Member, IEEE, Huazhong Yang, Senior Member, IEEE

Abstract—Approximate computing is a promising designparadigm for better performance and power efficiency. In thiswork, we propose a power efficient framework for analog approx-imate computing with the emerging metal-oxide resistive switch-ing random-access memory (RRAM) devices. A programmableRRAM-based approximate computing unit (RRAM-ACU) isintroduced first to accelerate approximated computation, andan approximate computing framework with scalability is thenproposed on top of the RRAM-ACU. In order to program theRRAM-ACU efficiently, we also present a detailed configurationflow, which includes a customized approximator training scheme,an approximator-parameter-to-RRAM-state mapping algorithm,and an RRAM state tuning scheme. Finally, the proposed RRAM-based computing framework is modeled at system level. Apredictive compact model is developed to estimate the configura-tion overhead of RRAM-ACU and help explore the applicationscenarios of RRAM-based analog approximate computing. Thesimulation results on a set of diverse benchmarks demonstratethat, compared with a x86-64 CPU at 2GHz, the RRAM-ACU isable to achieve 4.06∼196.41× speedup and power efficiency of24.59∼567.98 GFLOPS/W with quality loss of 8.72% on average.And the implementation of HMAX application demonstrates thatthe proposed RRAM-based approximate computing frameworkcan achieve >12.8× power efficiency than its pure digitalimplementation counterparts (CPU, GPU, and FPGA).

Index Terms—RRAM, approximate computing, power efficien-cy, neural network

I. INTRODUCTION

Power efficiency has become a major concern in moderncomputing system design [1]. The limited battery capacityurges power efficiency of hundreds of giga floating point oper-ation per second per watt (GFLOPS/W) for mobile embeddedsystems to achieve the desirable portability and performance[2]. However, the highest power efficiency of contemporaryCPU and GPU systems is only ∼10 GFLOPS/W, which isexpected not to substantially improve in the predictable scaledtechnology node [3], [4]. As a result, researchers are lookingfor alternative architectures and technologies to achieve furtherperformance and efficiency gains [5].

Approximate computing provides a promising solution toclose the gap of power efficiency between present-day capa-

This work was supported by 973 Project 2013CB329000, National NaturalScience Foundation of China (No. 61373026), Brain Inspired ComputingResearch, Tsinghua University (20141080934), Tsinghua University InitiativeScientific Research Program, the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions, NSF Grant CNS-1253424 and ECCS-1202225.

B. Li, P. Gu, Y. Wang, and H. Yang are with the Department of Elec-tronic Engineering, Tsinghua National Laboratory for Information Scienceand Technology, Tsinghua University, Beijing 100084, China (e-mail: [email protected]).

Y. Shan is with the Baidu Research - Institute for Deep Learning (IDL),Baidu, Inc. , Beijing 100085, China (e-mail: [email protected]).

Y. Chen is with the Department of Electrical and Computer Engineering,University of Pittsburgh, Pittsburgh, PA 15261, USA (e-mail: [email protected]).

bilities and future requirements [6]. Approximate computingtakes advantage of the characteristic that many modern appli-cations, ranging from signal processing, pattern recognition tocomputer vision, are able to produce results with acceptablequality even if many computation are executed imprecisely [7].This tolerance of imprecise computation can be leveraged forsubstantial performance and efficiency gains and has inspireda wide range of architectural innovations [1], [8].

Recent work in approximate computing mainly focuses onhardware design of the basic computing elements, such asapproximate adders and logics [9]–[11]. These techniqueshave adequately demonstrated the benefit of approximatecomputing, but the fixed functionality and low-level designstage limit the further improvement of performance and ef-ficiency. Moreover, these techniques are all based on thetraditional CMOS technology, despite of the circumstance thatthe innovations of device technology have offered a greatopportunity for radically different forms of architecture designand can significantly promote the performance and efficiencyof computing systems [12].

Our objective is to use the emerging metal-oxide resistiverandom-access memory (RRAM) devices to design a recon-figurable approximate computing framework with both powerefficiency and computation generality. The RRAM device (orthe memristor) is one of the promising innovations that canadvance Moore’s Law beyond the present silicon roadmaphorizons [13]. RRAM devices are able to support a largenumber of signal connections within a small footprint bytaking advantage of the ultra-integration density. And moreimportantly, RRAM devices can be used to build resistivecross-point structure [14], also known as the RRAM crossbararray, which can naturally transfer the weighted combinationof input signals to output voltages and realize the matrix-vectormultiplication with incredible power efficiency [15], [16].

To realize this goal, the following challenges must beovercome: First of all, an architecture, from the basic pro-cessing unit to a scalable framework, is required to providean efficient hardware implementation for RRAM-based analogapproximate computing. Secondly, from the perspective ofsoftware, a detailed configuration flow is demanded to programthe hardware efficiently for each specific application. Finally, acomprehensive analysis of the system performance and majortrade-offs is needed to explore the application scenarios ofRRAM-based analog approximate computing.

In this work, we explore the potential of RRAM-basedanalog approximate computing. The main contributions of thiswork include:• We propose a power efficient RRAM-based approximate

computing framework. The framework is scalable and isintegrated with our programmable RRAM-based approx-

Page 2: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

2

imate computing units (RRAM-ACUs), which work asuniversal approximators. Simulation results show that theRRAM-ACU offers less than 1.87% error for 6 commoncomplex functions.

• A configuration flow is proposed to program RRAM-ACUs. The configuration flow includes three phases: i). atraining scheme customized for RRAM-ACU to train itsneural approximator; ii). a parameter mapping scheme toconvert the parameters of a trained neural approximatorto appropriate RRAM resistance states; and iii). a statetuning scheme to tune RRAM devices to target states.

• The proposed RRAM-based computing system is mod-eled at system level to estimate the system performanceand explore the major trade-offs and application scenariosof RRAM-based analog approximate computing. Particu-larly, a predictive compact model is developed to evaluatethe configuration overhead of RRAM-ACU.

• A set of diverse benchmarks are used to evaluate theperformance of RRAM-based approximate computing.Experiment results demonstrate that, compared with ax86-64 CPU at 2GHz, our RRAM-ACU provides powerefficiency of 249.14 GFLOPS/W and speedup of 67.29×with quality loss of 8.72% on average. And the imple-mentation of HMAX application demonstrates that theproposed RRAM-based approximate computing frame-work is able to support large scale applications underdifferent noisy conditions, and can achieves >12.8×power efficiency improvements than the CPU, GPU, andFPGA implementation counterparts.

The rest of this paper is organized as follows: Section IIprovides the basic background knowledge. Section III intro-duces the details of the proposed RRAM-based approximatecomputing framework. The configuration flow and modelingmethod are depicted in Section IV and V, respectively. Ex-perimental results of different benchmarks are presented inSection VI. Finally, Section VII concludes this work.

II. PRELIMINARIES

A. RRAM Characteristics and Device Model

The RRAM device is a passive two-port elements basedon TiOx, WOx, HfOx [17] or other materials with variableresistance states. The most attractive feature of RRAM devicesis that they can be used to build resistive cross-point structure,which is also known as the RRAM crossbar array. Com-pared with other non-volatile memories like flash, the RRAMcrossbar array can naturally transfer the weighted combinationof input signals to output voltages and realize the matrix-vector multiplication efficiently by reducing the computationcomplexity from O(n2) to O(1). And the continuous variableresistance states of RRAM devices enable a wide range ofmatrices that can be represented by the crossbar. These uniqueproprieties make RRAM devices and the RRAM crossbararray promising tools to realize analog computing with greatefficiency.

Fig. 1(a) demonstrates a model of the HfOx based RRAMdevice [18]. The structure is a resistive switching layer sand-

Active Electrode

Inert Electrode

L

Tunneling gap (d)

OxygenVacancies

ResidualFilament

Voltage (V)

Cu

rre

nt

(A)

(a) (b)

Fig. 1. (a). Physical model of the HfOx based RRAM. The RRAM resistancestate is determined by the tunneling gap distance d, and d will evolve dueto the filed and thermally driven oxygen ion migration. (b). Typical DC I-Vbipolar switching curves of HfOx RRAM devices reported in [18].

wiched between two electrodes. The conductance is exponen-tially dependent on the tunneling gap distance (d) as:

I = I0 · exp(− d

d0) · sinh(

V

V0) (1)

The ideal resistive crossbar-based analog computing re-quires both linear I-V relationship and continuous variableresistance states. However, nowadays RRAM devices can’tsatisfy these requirements perfectly. Therefore, we introducethe practical characteristics of RRAM devices in this section:• The I-V relationship of RRAM devices is non-linear.

However, when V is very small, an approximation canbe applied as sinh( V

V0) ∼ V

V0. Therefore, the voltages

applied on RRAM devices should be limited to achievean approximate linear I-V relationship [19].

• As shown in Fig. 1(b), the SET process (from a highresistance state (HRS) to a low resistance state (LRS)) isabrupt while the RESET process (the opposite switchingevent from LRS to HRS) is gradual. The RESET processis usually used to achieve multiple resistance states [20].

• Even in the RESET process, the RRAM resistance changeis stochastic and abrupt. This phenomenon is called‘variability’. The RRAM variability can be approximatedas a lognormal distribution and can make the RRAMdevice miss the target state in the switching process.

In this paper, we use the HfOx based RRAM devicefor study because it is one of the most mature materialsexplored [17]. The analytical model is put into the circuit withVerilog-A [18], [21]. We use HSPICE to simulate the circuitperformance and study the device and circuit interaction issuesfor RRAM-based approximate computing.

B. Neural Approximator

Fig. 3 illustrates a simple model of a 3-layer feedforwardartificial neural network with one hidden layer. The computa-tion between neighbour layers of the network can be expressedas:

yj = fj(

n∑i=1

wij · xi + bj) (2)

or:~y = f(W · ~x+~b) (3)

where xi is the value of node i in the input (hidden) layer,and yj represents the result of node j in the hidden (output)

Page 3: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

3

Input MU

X

Input Data

Control Signal

Control Signal

RRAMPE

RRAMPE

RRAMPE

RRAMPE

ADC

Interface

RRAM-ACU

DAC

RRAM-ACU

DAC

RRAM-ACU

DAC

Local M

emory

RRAM-ACU

DAC

Local M

emory

ADC

Output M

UX

Local M

emory

RRAM-ACU

Local M

emory

RRAM Program Unit

Input Network

Input Layer Output Layer

Hidden Layer

Output Network

(c)(b)(a) (d)

Fig. 2. The overview of the hardware architecture of RRAM-based analog approximate computing. (a)&(b). RRAM approximate computing framework.(c)&(d). RRAM-based approximate computing unit (RRAM-ACU).

x1

x2

x3

xi

w11

w12

wij

Activation Functionw21

w22

Input Layer

Hidden Layer

Output Layer

b

Neuron

Fig. 3. A 3-layer feedforward neural network with one hidden layer

layer. wij is the connection weight between xi and yj . bj is anoffset. fj(x) is an activation function, e.g. sigmoid function:

f(x) =1

1 + e−x(4)

It has been proven that a universal approximator can be im-plemented by a 3-layer feedforward network with one hiddenlayer and sigmoid activation function [22], [23]. Table I givesthe maximum errors of the approximations of six commonfunctions by this method based on the MATLAB simulation.The mean square errors (MSE) of approximations are less than10−6 after the network training algorithm completes1 [24].The neural approximator offers less than 1.87% error for the6 common complex functions. This precision level is ableto satisfy the requirements of many approximate computingapplications [1].

III. RRAM-BASED ANALOG APPROXIMATE COMPUTING

Fig. 2 demonstrates an overview of the hardware imple-mentation of RRAM-based analog approximate computing. Inthis section, we will introduce this framework from the basicRRAM-based approximate computing unit (RRAM-ACU) tothe scalable RRAM-based approximate computing framework.

A. RRAM-based Approximate Computing Unit

Fig. 2(c)&(d) shows the proposed RRAM-based approx-imate computing unit (RRAM-ACU). The RRAM-ACU isbased on an RRAM hardware implementation of a 3-layer

1 Theoretically, the network’s accuracy shall increase with the network size.However, it’s usually more difficult to train a network with a bigger size. Thenetwork may easily fall into a local minima, instead of the global optimalsolution, and thus sometimes provide a worse result [24].

TABLE IMAXIMUM ERRORS (%) OF NEURAL APPROXIMATORS

Function Nodes in the Hidden Layer0 5 10 15 20 25

x1 · x2 · x3 22.79 1.10 0.68 0.28 0.34 0.27x−1 9.53 0.25 0.20 0.14 0.10 0.05

sin(x) 10.9 0.05 0.07 0.05 0.07 0.06log(x) 7.89 0.21 0.13 0.14 0.12 0.14

exp(−x2) 20.27 0.04 0.03 0.05 0.03 0.04√x 13.76 1.87 1.19 1.43 0.35 0.49

network (with one hidden layer) to work as a universalapproximator. The mechanism is as follows.

As described in Eq. (2)-(4), the neural approximator can beconceptually expressed as: i). a matrix-vector multiplicationbetween the network weights and input variations; and ii). asigmoid activation function.

For the matrix-vector multiplication, this basic operation canbe mapped to the RRAM crossbar array illustrated in Fig. 4.The output of the crossbar array can be expressed as:

Voj =∑k

Vik · ckj (5)

where, for Fig. 4(a), ckj can be represented as:

ckj = −gkjgs

(6)

and for Fig. 4(b):

ckj =gkj

gs +N∑l=1

gkl

(7)

where gkj is the RRAM conductance state in the crossbararray. And gs represents the conductivity of the load resistance.

Both two types of crossbar array are efficient to realizematrix-vector multiplication by reducing the computation com-plexity from O(n2) to O(1).

The latter one, which does not require Op Amps, consumesless power and can be smaller in size. However, there aresome drawbacks with the latter implementation when buildingmultilayer networks:• First of all, ckj not only depends on the correspondinggkj , but also depends on all the RRAM devices in thesame column. It’s difficult to realize a linear one-to-onemapping between the network weight wij and the RRAMconductance gij . Although previous work proposed some

Page 4: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

4

Vik

Voj

gkj

RSRSRS

Vi1

Vi3

Vo1 Vo3

Vik

Voj

gkj

RS

Vi1

Vi3

Vo1 Vo3

(a) (b)

Fig. 4. Two implementations of RRAM crossbar arrays

approximate mapping algorithms, the computation accu-racy is still a problem [25].

• Secondly, the parameters of neighbour layers will in-fluence each other through RS . Voltage followers orbuffer amplifiers are demanded to isolate different circuitstages and guarantee the driving force [26], [27]. Thesize and energy savings compared with the first typeimplementation will be wasted.

The first implementation can overcome these drawbacks.Op Amps can enhance the output accuracy, make ckj linearlydepend on the corresponding gkj , and isolate neighbour layers.So we choose the first implementation to build RRAM-ACU.

Since both R (the load resistance) and g (the conductancestates of RRAM devices) can only be positive, two crossbar ar-rays are needed to represent the positive and negative weightsof a neural approximator, respectively, with the help of analoginverters [28] as shown in Fig. 6.

The practical weights of the network can be expressed as:

wkj = R · (gkj(postive) − gkj(negative)) (8)

We also note that the polarities of the terminals of the RRAMdevices in two crossbar arrays should be set to oppositedirections. This technique is aimed to make the the resistancestate deviations caused by the currents passing through thepaired RRAM devices cancel each other [29]. We refer to thistechnique as RRAM pairing and it’s shown in Fig. 6.

The sigmoid activation function can be generated by thecircuit described in [30] and a complete feedforward networkwithout hidden layer is accomplished.

Finally, by combining two networks together, a three-layerfeedforward network unit is realized. As described in SectionII-B, this network can work as a universal approximatorto perform approximated computation. And a basic RRAMapproximate computing unit is accomplished.

B. RRAM-based Approximate Computing Framework

The overview of the proposed RRAM approximate comput-ing framework is shown in Fig. 2(a)&(b). The building blocksof the framework are the RRAM processing elements (RRAMPE). Each RRAM PE consists of several RRAM-ACUs to ac-complish algebraic calculus. Each RRAM PE is also equippedwith its own digital-to-analog converters (DACs) to generateanalog signals for processing. In addition, the RRAM PE mayalso have several local memories, e.g., analog data stored inform of the resistance states of RRAM devices, or digital datastored in the DRAM or SRAM. Both the use and the type oflocal memory depends on the application requirement and wewill not limit and discuss its implementation in detail in this

polarity

Icurrent Icurrent

polarityIcurrent

gpositive gnegative

Analog Inverter

Fig. 6. RRAM pairing technique.

work. On top of that, all the RRAM PEs are organized by twomultiplexers with Round-Robin algorithm.

In the processing stage, the data will be injected into theplatform sequentially. The input multiplexers will deliver thedata into the relevant RRAM PE to perform approximatecomputing. The data will be fed into the RRAM PE in digitalformat and the DACs in each RRAM PE will convert the dateinto analog signals. Each RRAM PE may work under lowfrequency but a group of RRAM PEs can work in parallelto achieve high performance. Finally, the output data will betransmitted out from the RRAM PE by output multiplexer forfurther processing, e.g., be converted back into digital formatby a high performance analog-to-digital converter (ADC).

The framework is scalable and the user can configure it ac-cording to individual demand. For example, for tasks requiringpower efficiency, it’s better to choose low power Op Ampsto form the RRAM-ACUs and each RRAM PE may workin a low frequency. On the other hand, high speed Op Amps,AD/DAs and even hierarchical task allocation architecture willbe preferred for high performance applications.

IV. CONFIGURATION FLOW FOR RRAM-ACU

The RRAM-based analog approximate computing hardwarerequires a configuration flow to get programmed for each spe-cific task. In this section, we discuss the detailed configurationflow for the proposed RRAM-based approximate computingunits (RRAM-ACUs). The flow is illustrated in Fig. 5. Itincludes three phases to solve the following problems:

1) Training Phase: How to train a neural approximatorin an RRAM-ACU to learn the required approximatecomputing task?

2) Mapping Phase: The parameters of a trained approxima-tor can NOT be directly configured to the RRAM-ACU.We need to map these parameters to appropriate RRAMresistance states in the RRAM crossbar array.

3) Tuning Phase: After we achieve a set of RRAM resis-tance states for an approximate computing task, how totune the RRAM devices accurately and efficiently to thetarget states?

All these phases will be introduced in detail in the followingsubsections.

A. Training Phase: Neural Approximator Training Algorithm

The RRAM approximate computing unit is based on anRRAM implementation of neural approximator. The approx-imator must be trained efficiently for each specific function.The training process can be realized by adjusting the weights

Page 5: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

5

Modified Training Algorithm &

Parameters of Neural Approximator (w)

0.8kOhm

2.3kOhm

4.7kOhm

Map Parameters (w) to

RRAM Resistance States

Icurrent

gpositive gnegative

2.28kOhm 1.2kOhm

w = 0.5377

Program RRAM to Target StatesRRAM-based Analog

Approximate Computing

Fig. 5. Configuration flow for RRAM-ACU. The flow includes three phases: i). a training scheme customized for RRAM-ACU to train the neural approximator;ii). a parameter mapping scheme to convert the parameters of a trained neural approximator to appropriate RRAM resistance states; and iii). an RRAM statetuning scheme to tune RRAM devices to target states efficiently

Input Value / Current (mA)‐6 ‐4 ‐2 0 2 4 6

‐0.2

0

0.2

0.4

0.6

0.8

1

1.2

Out

put V

alue

/ V

olta

ge (V

)

0.5*AnalogMathematical

Fig. 7. Comparison between the mathematical sigmoid function and itsanalog implementation reported in [30]. The output of analog implementationis multiplied by 0.5 for normalization. A significant difference can beobserved.

in the network layer by layer [24]. The update of each weight(wji) can be expressed as:

wji ← wji + η · δj · xi (9)

where xi is the value of node i. η is the learning rate. δj isthe error back propagated from node j in the next neighbourlayer. δj depends on the derivative of the activation function(e.g. sigmoid function) as described in Section II-B.

In the RRAM-ACU training phase, both the calculationsof sigmoid function and its derivative should be adjustedaccording to the analog sigmoid circuit. Fig. 7 illustratesa comparison between the accurate mathematical sigmoidfunction and its hardware implementation reported in [30].The I-V relationship is simulated with HSPICE. There is asignificant difference between them. Therefore, we replacethe mathematical sigmoid activation function by its simulationresults in the training scheme of RRAM-ACU.

Finally. it’s worth noting that most weights are small(around zero) after a proper training2. For example, more than90% weights of the trained network3 are within the range of[−1.5, 1.5] for all the benchmarks used in this paper. Thelimitation of weight amplitude can simplify the design ofRRAM state tuning scheme and help improve the tuning speed.

2A neural network will trend to overfit when many weights of the networkare large [31]. Overfitting is a problem that the model learns too much,including the noise, from the training data. The trained model will have poorpredictive performance on the unknown testing data which are not coveredby the training set.

3We use `2 regularization in the training scheme. Regularization is atechnique widely used in the neural network training to limit the amplitudeof network weight, avoid overfitting, and improve model generalization [31].To be specific, for the `2 regularization, a penalty of the square of the 2-normof network weights will be proportionally added to the loss function of thenetwork. So the error of the network and the amplitude of weights will bebalanced and optimized simultaneously in the training process [31].

B. Mapping Phase: Mapping Neural Approximator Weights toRRAM Conductance States

Once the weights of a neural approximator are determined,the parameters need to be mapped to the appropriate states ofRRAM devices in the crossbar arrays. Improperly convertingthe network weights to the RRAM conductance states mayresult in the following problems:• The converted results are beyond the actual range of the

RRAM device.• The dynamic range of converted results is so small that

the RRAM state may easily saturate.• The converted results are so high that the summation

of output voltages will exceed the working range ofOp Amps.

In order to prevent the above problems, we propose aparameter mapping algorithm to convert the weights of neuralapproximator to appropriate conductance states of RRAMdevices.

The mapping process can be abstracted as an optimizationproblem. The feasible range of the weights of neural approx-imators can be expressed as a function of RRAM parameters:

−RS · (gON − gOFF ) ≤ w ≤ RS · (gON − gOFF ) (10)

where gON = 1RON

and gOFF = 1ROFF

. RON and ROFF arethe lowest and highest resistance states of RRAM devices. Allthe weights should be scaling within this range.

In order to extend the dynamic range and reduce the impactof process variation, we adjust gON and gOFF to:

g′ON =1

η ·∆ON +RON(11)

g′OFF =1

ROFF − η ·∆OFF(12)

where ∆ON and ∆OFF represent the maximum deviation ofRON and ROFF induced by process variation of the crossbararray, respectively. η is a scale coefficient which is set to1.1∼1.5 in our design to achieve a safety margin.

The risk of improper conversion can be measured by thefollowing risk function:

Risk(gpos, gneg) = |gpos − g′mid|+ |gneg − g′mid| (13)

where:g′mid =

g′ON + g′OFF

2(14)

and gpos and gneg represent the conductance states of eachpaired RRAM devices in the positive and negative crossbararrays, respectively, as Eq. (8).

Page 6: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

6

Initial

Vread Vread

Miss the requiredconductance state

Vprogram Vprogram Vprogram

Time

Co

nd

uct

ance

Stat

e

Initial

Pu

lse

A

mp

litu

de

Time

Variability of RRAM State Change

Target Range

Target State

Max Error

Fig. 8. Program-and-verify (P&V) scheme for multi-level RRAM conduc-tance state tuning.

Combining the constraints and the risk function, the param-eter mapping problem can be described as the optimizationproblem shown below:

(g∗pos, g∗neg) = arg min Risk (15)

s.t.

RS · (g∗pos − g∗neg) = wg′ON ≤ gpos ≤ g′OFF

g′ON ≤ gneg ≤ g′OFF

(16)

The optimal solutions of this optimization problem are:g∗pos = g′mid + w

2RS

g∗neg = g′mid − w2RS

(17)

These are the appropriate conductance states of RRAMdevices with the minimum risk of improper parameter con-version.

C. Tuning Phase: Tuning RRAM Devices to Target States

After the weights of neural approximator are converted intoRRAM conductance states, a state tuning scheme is requiredto program RRAM devices in an RRAM-ACU to target states.

Due to the stochastic characteristics of RRAM resistancechange, program-and-verify (P&V) method is commonly usedin multi-level state tuning [32]. As shown in Fig. 8, the RRAMdevice will be first initialized to LRS. Then a sequence of writepulses will be applied to tune RRAM devices gradually. Eachwrite pulse is followed by a read pulse to verify the currentconductance state. The amplitude of read pulse should be smallenough to not change the RRAM conductance state. The P&Voperation will keep on performing until the verify step detectsthat the RRAM device has reached the target range.

The P&V method choose LRS as the initial state becauseof the following reasons:• LRS is much more uniform than HRS. When an RRAM

device is switched between HRS and LRS repeatedly,LRS is able to be uniform while HRS usually varies alot among different cycles [17], [18], [33];

• As shown in Fig. 1(b), the resistance change process fromLRS to HRS is gradual, while the opposite process isabrupt. It is easier to achieve multiple resistance statesfrom LRS than HRS, although HRS may help reduce thepower consumption.

Choose gi

Iteratively Tu

nin

g

RR

AM

Co

nd

uctan

ceS

tates

Weight MatrixRRAM Model Circuit Description File

Initialize RRAM Devices to gi

P&V Operation

Select an RRAM Device

All of the RRAM Devices are Tuned?

YES

RESET RRAM device

Miss the RequiredConductance State?

No

No

Tuning Scheme Finished

YES

Fig. 9. Proposed state tuning scheme for RRAM-ACU.

• Finally, the target resistance states are closer to LRSaccording to Eq. (17). As HRS is usually >100× largerthan LRS, initializing RRAM devices to LRS will requiremuch less pulses to reach the target resistance range.

However, tuning RRAM devices to accurate g′mid, g∗pos, org∗neg as Eq. (17) still requires large effort with P&V method.Considering the physical characteristics of RRAM devicesand the circuit architecture of RRAM-ACU, we propose asimple but efficient RRAM state tuning scheme as illustratedin Fig. 9. The proposed RRAM state tuning scheme includesthe following two steps:

Step 1: Initializing all the RRAM devices in the pairedcrossbar arrays to the same initial state gi. We hope thatonly one RRAM device in the pair needs tuning after weinitialize all the RRAM devices to gi. The choice of gi isa major concern in this state tuning scheme. It should be ableto approximate most of the optimal states (g′mid + |w|

2RS) in

the crossbar array, and should be both uniform and easy toreach for RRAM devices. Therefore, we choose gi to be closeto g′mid because most wij are close to zero as discussed inSection IV-A and the optimal states (g′mid + |w|

2RS) will be

close to g′mid. On top of that, we choose gi, which shouldbe a uniform low resistance state that can be achieved easilyaccording to the physical characteristics of RRAM devices.For example, for the HfOx RRAM devices used in this paper,the lowest resistance state is RON ≈ 290Ω [18]. And we setgi to ∼ (500Ω)−1 as it’s both close to gON/2 and can beeasily achieved by limiting the compliance current [18].

Step 2: Tuning the positive and negative crossbar arraysto satisfy RS · (gpos − gneg) = w. After initializing RRAMdevices to gi ≈ g′mid, only one RRAM device in each pairedRRAM devices will need to be tuned according to Eq. (17).The state tuning scheme will perform P&V operations on thecorresponding RRAM device until Eq. (16) is satisfied.

Another problem of the state tuning scheme is that thevariability of resistance state change may make RRAM devicesmiss the target conductance range. Considering that the setback process is abrupt and hard to control, and most targetstates that are close to gi (e.g., the requirement of resistance

Page 7: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

7

SelectedVW

VW/2

VW/2Half-Selected

Half-Selected

VW/2 VW/2 0

Word LineBit Line

Fig. 10. Tuning RRAM devices with half-select method to mitigate sneakpath problem.

change is usually around tens of Ohm), in this work, theproposed state tuning scheme will reset the RRAM device tothe initial state gi. There is no need to prepare a complicatedpartial setback operation at the cost of increasing the circuitcomplexity.

The last problem in the state tuning scheme is the sneak pathproblem. Sneak path usually exits in the memory architecture.As only one cell will be selected in the memory read or writeoperation, it will be difficult for the architecture to isolatethe selected RRAM device from the unselected cells. Theunselected cells will form a sneak path, which will disturbthe output signals and impact the unselected cells’ states[34]. However, when an RRAM crossbar array is used forcomputation, all the cells will be selected for computation.In other words, no sneak path can be formed in this case. Bycontrast, each output port can only be used to tune one RRAMdevice in the corresponding column. We cannot select and tuneall the RRAM devices in the crossbar array at the same time.The sneak path still exists in the state tuning scheme.

In order to mitigate the impact of sneak path in the statetuning scheme, the half-select method is adopted [14]. Fig. 10illustrates the principle of half-select method. The methodis aimed to reduce the voltage drop between the selectedand unselected cells to reduce the sneak path current and itsimpact. A half-select voltage (VW /2), instead of connectingto the ground, will be applied on the unselected word line andbit line. The maximum voltage drop between the selected andunselected cells is VW /2 instead of VW . Therefore, the sneakpath current is reduced and the unselected cells are protected.

The half-select method mitigate the sneak path problemat the cost of extra power consumption. We further reducethe direct component in the original half-select method toalleviate this problem. To be specific, a (VW /2) and (−VW /2)voltage will be applied on the WL and BL of the selected cell,respectively. And other unselected cells will be connect to theground instead of a half-select voltage (VW /2). This techniquecan reduce around 75% of the power consumption comparedwith the original method.

Finally, we note that only the RRAM devices in differentword lines and bit lines can be tuned in parallel. A parallelstate tuning scheme can significantly improve the tuning speedof RRAM-ACU but will require extra copies of peripheralcircuits and additional control logic. As the energy consump-tion (the product of tuning time and power consumption) oftuning the entire RRAM crossbar array remains almost thesame, there will be a trade-off between the tuning speed andthe circuit size in the RRAM state tuning scheme. In order tosave more area for AD/DAs and Op Amps, each RRAM-ACU

is equipped with only one set of tuning circuit in this work.

V. SYSTEM MODELING AND OVERHEAD DISCUSSION

In this section, we discuss modeling the performance andenergy consumption of the proposed RRAM-based analogapproximate computing system at the system level. The modelwill be used to analyze the system performance, quantizeand demonstrate major trade-offs, and explore the applicationscenarios of RRAM-based analog approximate computing.

A. System Level ModelingModeling an RRAM-based approximate computing at the

system level mainly includes 3 parts:• Modeling the RRAM crossbar array and its peripheral

circuits, such as Op Amps, sigmoid circuits, and analoginverters.

• Modeling the interface: AD/DAs.• Evaluating the configuration overhead, especially the time

and energy consumption of RRAM state tuning.For the first part, we use a Verilog-A RRAM device model

to build up the SPICE-level crossbar array as described inSection II-A. We use a fine-grained SPICE-level simulationbecause the physical characteristics RRAM devices are differ-ent from the ideal linear resistance with continuous variablestates, and other non-ideal factors, such as the IR-drop in thecrossbar, are also difficult to estimate.

For the second part, we extract the parameters like accu-racy, speed, power, and area from fabricated chips. Becausethis work is aimed to explore the feasibility and potentialof RRAM-based analog approximate computing, we mainlyfocus on the choice, instead of the design, of AD/DAs.Extracting necessary parameters from fabricated chips is ableto satisfy the requirement of modeling the RRAM-basedcomputing system at system level.

Finally, after the RRAM-ACUs are configured properly,the system can perform approximate computing with highpower efficiency. However, tuning RRAM devices to targetstates is usually time and energy consuming, due to thelarge number of RRAM devices in RRAM-ACU and therandom resistance change of RRAM devices. As a result, theconfiguration of RRAM-ACU becomes a major overhead ofRRAM-based approximate computing. A frequent configura-tion will drastically decrease the efficiency of RRAM-basedcomputing system. The system should operate continuouslywithout reconfiguration to alleviate this overhead.

At the system level, the energy efficiency (FLOPS/J) of thewhole system along with the operating cycles can be calculatedthrough the following equation:

ηoverall =Econfigure + Eoperate × Cycles

Cycles · Insts(18)

where ηoverall is the energy efficiency of an RRAM-ACUwhen the configuration overhead is estimated. Econfigure is thetotal energy consumption cost to program the RRAM-ACU.Eoperate is the average4 energy cost for each approximate

4 We can achieve a fine-grained energy consumption with SPICE-basedsimulation. But an average energy consumption is enough to evaluate theenergy efficiency along with operating cycles at system level.

Page 8: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

8

0.05 0.1 0.15 0.2 0.25 0.3 0.35100

101

102

103

Expe

cted

Pro

gram

Ste

ps

Gap Distance (nm)(a)

0.1 0.2 0.3100

101

102

103

Expe

cted

Pro

gram

Ste

ps

ModelExp Data

0.05 0.1 0.15 0.2 0.25 0.3 0.35100

101

102

103

Expe

cted

Pro

gram

Ste

ps

0.05 0.1 0.15 0.2 0.25 0.3 0.35100

101

102

103

Expe

cted

Pro

gram

Ste

ps

ModelExp Data

ModelExp Data

ModelExp Data

Gap Distance (nm)(b)

Gap Distance (nm)(c)

Gap Distance (nm)(d)

Fig. 11. Performance of the proposed predictive model. The width of write pulses are 1, 2, 5, 10 (ns) in (a), (b), (c) and (d), respectively. The experimentresults are achieved by statistically analyzing 5,000 independent simulation results for each set of parameters.

computing operation. Cycles is the number of operating cyclesafter the RRAM-ACU is programmed. Insts represents thenumber of x86-64 instructions required to complete the sameapplication as an RRAM-ACU.

In this model, the upper limit of system energy efficiencyis Eoperate/Insts, and Econfigure will significantly impact theperformance when the system is reconfigured frequently.

B. Predictive Model of RRAM-ACU Configuration Overhead

The overhead of configuring an RRAM-ACU mainly de-pends on the efficiency of RRAM state tuning. The estimationof configuration overhead Econfigure requires the steps oftuning an RRAM device to the target state and the energyconsumption of each step5. For the latter, as RRAM devicesare tuned around LRS of g′mid in Eq. (17), we can usethe energy consumed by a tuning pulse applied on g′mid toapproximate the energy consumption of each step. However,it is usually very hard to predict the tuning steps as the RRAMresistance change is stochastic as described in Section II-A.

To simplify the estimation of tuning steps6, we developa predictive compact model to calculate the expected tuningsteps efficiently. The relationship between the change of gapdistance (∆d) and the expected tuning steps (E(N)) can beapproximated through the following equation:

E(N) = [e ξ

ε Tw· ( αw√

Tw∆d + βw) ] (19)

where the gap distance change ∆d (nm) represents the dif-ference of RRAM tunneling gap (d) between the target andinitial conductance state. ∆d can be calculated through Eq. (1).Vw (V) and Tw (ns) are the amplitude and width of RRAMwrite pulses, respectively. ε (‰) is the maximum acceptabledeviation of RRAM conductance state. e is the Euler’s number.αw (∼2000) and βw (∼25) are fitting parameters. ξ is aparameter influenced by the gap change speed and can berepresented as:

ξ ∝ ∆d

∆t(20)

5An RRAM device usually requires an initial forming process to get theresistance state changeable. The initial forming process is required only onceafter the RRAM device is fabricated. In this work, we assume that all theRRAM devices are already formed before executing approximate computingtasks, and we do NOT include the forming process in the predictive model.

6Tuning an RRAM device to the target state can be modeled as a stochasticprocess. The accurate probability of successfully tuning an RRAM devicewith N steps can be calculated recursively. The detailed derivation process isprovided in the Appendix.

TABLE IIIDETAILED PARAMETERS OF PERIPHERAL CIRCUITS IN RRAM-ACU.

Technology Node 180nmRRAM Tunneling Gap 0.2nm - 1.9nm

RRAM Resistance Range 290Ω - 500kΩRS 2kΩ

Op Amp ∼4.8mWADC 8bit, ∼3.1mWDAC 12bit, ∼40mW

Frequency 800MHz

where ∆d/∆t depends on the device parameters as Ref. [18].ξ ≈ 1, when Vw = -1.2V and Tw = 1ns.

Fig. 11 verifies the predictive compact model. The reference‘Exp Data’ are simulation data collected by using a MATLAB-based RRAM device model as Ref. [18], [21] to simulate the s-tochastic tuning process of RRAM devices. We generate 5,000independent simulation results for each set of parameters. Theamplitude of the write pulse is set to −1.2V and the readpulse is set to 0.1V. The initial resistance state is set to 500Ω.The lines in Fig. 11 represent the results calculated by thepredictive compact model. The points are the reference datagenerated by the simulation of RRAM state tuning process.The predictive compact model fits the ‘Exp Data’ well.

VI. EVALUATION

To evaluate the performance and efficiency of the proposedRRAM-based analog approximate computing, we apply ourdesign to several benchmarks, ranging from the signal pro-cessing, gaming, compression to the object recognition. Asensitivity analysis is also performed to evaluate the robustnessof the RRAM-based computing system.

A. Experiment Setup

In the experiment, a Verilog-A RRAM device model report-ed in [18], [21] is used to build up the SPICE-level crossbararray. We choose the 65nm technology node to model theinterconnection of the crossbar array and reduce the IR drop.The parameters of the interconnection are calculated withthe International Technology Roadmap for Semiconductors2013 [35]. The sigmoid circuit is the same as reported in[30]. The Op Amps, ADCs and DACs used for simulationare that reported in [36], [37] and [38], respectively. Theworking frequency of each RRAM-ACU is set to 800MHz.Detailed parameters of peripheral circuits are summarizedin Table III. Moreover, the maximum amplitude of input

Page 9: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

9

TABLE IIBENCHMARK DESCRIPTION

Name Description Type x86-64 Training Testing NN NN MSE NN MSE Error ErrorInsts Set Set Topology (CPU) (RRAM) Metric

FFTRadix-2 Signal 34

32,768 Random 2,048 Random1× 8× 2 0.0046 0.0071

Average10.72%Cooley-Tukey Processing Floating Point Floating Point Relative

Fast Fourier Numbers Numbers Error

Inversek2jInverse

Robotics 10010,000 (x, y) 10,000 (x, y)

2× 8× 2 0.0038 0.0053Average

9.07%Kinematics for Random Random Relative2-Joint Arm Coordinates Coordinates Error

JmeintTriangle

3D Gaming 1,07910,000 Pairs of 10,000 Pairs of

18× 48× 2 0.0117 0.0258 Miss 9.50%Intersection 3D Triangle 3D Triangle RateDetection Coordinates Coordinates

JPEG JPEG Compression 1,257 Three 512× 512 One 220× 22064× 16× 64 0.0081 0.0153 Image 11.44%Encoding Color Images Color Image Diff

K-Means K-Means Machine 2650,000 Pairs One 220× 220

6× 20× 1 0.0052 0.0081 Image 7.59%Clustering Learning of (R,G,B) Color Image DiffValues

Sobel Sobel Edge Image 88 One 512× 512 One 220× 2209× 8× 1 0.0286 0.0026 Image 4.00%Detector Processing Color Image Color Image Diff

voltage is set to 0.5V to achieve an approximate linear I-V relationship of RRAM devices. The state tuning schemedescribed in Section IV-C is used to program the RRAM-ACU. The amplitude of the write pulse is set to -1.2V andthe read pulse is set to 0.1V. The pulse width is set to 5ns.The maximum acceptable deviation (ε) of RRAM conductancestate is set to 1% when programming RRAM-ACU. All thesimulation results of the RRAM crossbar array are achievedwith HSPICE. And the configuration overhead is estimatedwith the predictive compact model introduced in Section V-B.

B. Benchmark Evaluation

Table II summarizes the benchmarks used in the evaluation.The benchmarks are the same as that described in [1], whichare used to test the performance of a x86-64 CPU at 2GHzequipped with a CMOS-based digital neural processing unit.The ‘NN Topology’ term in the table represents the size ofeach neural network. For example, ‘9 × 8 × 1’ represents aneural approximator with 9 nodes in the input layer, 8 nodesin the hidden layer, and 1 node in the output layer. The meansquare error (MSE) is tested both on CPU and SPICE-basedRRAM-ACU after training. The training scheme has beendescribed in Section IV-A, which is modified for RRAM-ACU.The size of the crossbar array in the RRAM-ACU is set to64 × 64 to satisfy all the benchmarks. The unused RRAMdevices in the crossbar array are set to the highest resistancestates to minimize the sneak path problem. And the unusedinput and output ports are connected to the ground.

The simulation results are illustrated in and Fig. 13 andFig. 14. Compared with the x86-64 CPU at 2GHz, theRRAM-ACU achieves 567.98 GFLOPS/W power efficiencyand 196.41× speedup at most. And for the whole set of se-lected diverse benchmarks, the RRAM-ACU provides 249.14GFLOPS/W and speedup of 67.29× with quality loss of 8.72%on average. The improvement of processing speed mainlydepends on the capability of the neural approximator. As theRRAM-ACU is able to transfer a set of instructions intoa neural approximator and execute them with only onecycle, the speedup achieved by an RRAM-ACU increases lin-early with the number of instructions the neural approximatorrepresents. For example, the ‘Jmeint’ and ‘JPEG’ benchmarksachieve >150× speedup as their neural approximators suc-

Speedup

5.31

15.63

168.59 196.41

4.06

13.75

1

10

100

1,000

FFT Inversek2j Jmeint JPEG K‐Means Sobel

Fig. 13. Speedup of the RRAM-ACU under different benchmarks.

85.00

223.01

422.87 567.98

24.59

171.39

1

10

100

1,000

FFT Inversek2j Jmeint JPEG K‐Means Sobel

GFLOPS/W

Fig. 14. Power efficiency of the RRAM-ACU under different benchmarks.

0%

20%

40%

60%

80%

100%

FFT Inversek2j Jmeint JPEG K‐Means Sobel

RRAM Sigmoid OP AD/DA

No

rmal

ize

d P

ow

er

Co

nsu

mp

tio

n

Fig. 15. RRAM-ACU power consumption breakdowns.

cessfully implement the complex tasks that require more thana thousand instructions in traditional x86-64 architectures. Incontrast, the ‘K-Means’ and ‘FFT’ benchmarks achieve the

Page 10: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

10

100

101

102

103

104

105

106

107

100

101

102

103

104

105

Cycles

En

erg

y C

on

sum

pti

on

per

Flo

p (

pJ/

Flo

p)

FFTInversek2jJmeintJPEGK-MeansSobel

Fig. 15. Energy efficiency of RRAM-ACU along with the operating timewhen configuration overhead is considered.

least speedup (∼10×) because of the simplicity of tasks.And for the improvement of power efficiency, although theRRAM-ACU for a complex task is able to achieve morespeedups, a bigger neural approximator may be also demandedto accomplish more power-consuming tasks. However, as theNN topology increases slower than the instruction number inthe experiment, the complex tasks still achieve better powerefficiency.

Fig. 15 illustrates the power consumption breakdowns ofRRAM-ACUs. The sigmoid circuit is power efficient as thereare only 6 MOSFETs used in the circuit [30]. The powerconsumption of sigmoid circuit mainly depends on the outputvoltage. For example, most outputs will be close to zeroafter the JPEG encoding. And therefore, the sigmoid circuittakes a negligible part of power consumption in the ‘JPEG’benchmark. In contrast, the outputs of sigmoid circuits in the‘Inversek2j’ and ‘K-Means’ are much larger and the powerconsumption increases as a result. Compared with the sigmoidcircuit, most of the power is consumed by Op Amps andAD/DAs. RRAM devices only take 10%∼20% of the totalenergy consumption in RRAM-ACU, and the ratio increaseswith the NN topology. Therefore, how to reduce the energyconsumed by peripheral circuits may be a challenge to furtherimprove the efficiency of RRAM-based analog approximatecomputing.

Finally, Fig. 15 illustrates the energy efficiency of RRAM-ACU along with the operating time when the configurationoverhead is considered. The energy efficiency of the wholesystem is calculated through the following equation accordingto Eq. (18). It can be seen that the RRAM-ACU shouldkeep operating for a period of time to reduce the impactof configuration overhead and increase the energy efficiency.The configuration overhead increases with the size of neuralapproximator. For the benchmarks with a small NN topology,e.g. ‘Sobel’ and ‘FFT’, the configuration overhead is small.Only ∼103 cycles (@800MHz) are needed to reach a goodperformance. However, for the complex tasks, more operationcycles (∼105) are required to achieve better energy efficiency.

In conclusion, the simulation results demonstrate the effi-ciency of RRAM-ACU as well as the feasibility of a dynamicreconfiguration. And there is a trade-off among the taskcomplexity, power efficiency and configuration overhead: Themore difficult the task, the better power efficiency an RRAM-

97.6 95.2

84.4

71.9

40.6

93.9 90.9

80.8

0

20

40

60

80

100

CPU RRAMIdeal

DV=5% DV=10%DV=20% SF=5% SF=10% SF=20%

Recognition Accuracy (%)

Fig. 16. Performance of RRAM-based HMAX under different noiseconditions, where ‘DV’ represents device variation and ‘SF’ represents inputsignal fluctuation.

ACU can achieve, but the more operating cycles are requiredto hide the larger configuration overhead.

C. System Level Evaluation: HMAX

In order to evaluate the performance of RRAM-ACU atsystem level, we conduct a case study on HMAX application.HMAX is a famous bio-inspired model for general objectrecognition in complex environment [39]. The model con-sumes more than 95% amount of computation to performpattern matching in S2 Layer by calculating the distancebetween the prototypes and units [13], [39]. The amount ofcomputation is too huge to realize real-time video process-ing on conventional CPUs while the computation accuracyrequirement is not strict [40]. In this section evaluation,we apply the proposed RRAM-based approximate computingframework to conduct the distance calculations to promote thedata processing efficiency.

We use 1,000 images (350 of cars and 650 of the othercategories) from PASCAL Challenge 2011 database [41] toevaluate the performance of the HMAX system on the digitaland the RRAM-based approximated computation framework.Each image is of 320× 240 pixels with complex background.The HMAX model contains 500 patterns of car images whichremain the same on each platform. A correct result indicatesboth a right judgment on the classification of the object and asuccessful detection on the object location.

The RRAM approximate commuting framework illustratedin Fig. 2 is used to support the HMAX approximate. EachRRAM processing element consists of four 6-input RRAM-ACU for Gaussian calculations and one for 4-input multi-plication. Therefore, each RRAM PE can realize a 24-inputdistance calculation per clock cycle [13].

The results of correct rate are shown in Fig. 16. Theperformance of RRAM-based approximate computing underdifferent noise conditions is also considered. The device vari-ation represents the deviation of the RRAM conductance stateand the signal fluctuation represents the deviation of the inputsignals. As we can observe, the correct rate degradation is only2.4% on the ideal RRAM-based approximate computing w.r.t.the CPU platform. This degradation can be easily compensatedby increasing the amount of patterns [39].

Page 11: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

11

TABLE IVPOWER EFFICIENCY OF THE RRAM-BASED HMAX

AD/DA Analog Total x86-64 Frequency Efficiency(mW) (mW) (mW) Insts (MHz) (GFlops/W)963.1 511.96 1475.06 558 800 302.64

TABLE VPOWER EFFICIENCY COMPARISON WITH DIFFERENT PLATFORMS (FPGA,

GPU, CPUS IN [40])

Parameters Proposed FPGA GPU CPUsSize of input image 320× 240 256× 256HMAX orientations 12HMAX scale 12HMAX prototypes 500 5120Average size of prototypes 8Cycles for calculation 32 -Calculation amount/frame 5455× 500 -Frequency (MHz) 800 -Power (W) 1.475 40 144 116Unified fps/W 6.214 0.483 0.091 0.023Speed Up - 12.86 68.29 270.17

Moreover, when taking the noise into consideration, thedevice variation will significantly impact the recognition ac-curacy. As the performance of neural approximator mainlydepends on the RRAM conductance states, the device variationwill significantly impact the computation quality and make therecognition accuracy decrease a lot. For example, a 10% de-vice variation can result in a >50% decrease of the recognitionaccuracy. Therefore, the device variation should be suppressedto satisfy the application requireing high accuracy. Comparedwith the device variation, the impact of signal fluctuation ismuch less, which demonstrates that we may use DACs withless precision but less power consumption, in the RRAM-ACUto further improve the power efficiency of the whole system.

The power efficiency evaluation of the RRAM-based H-MAX accelerator is given in Table IV. The detailed com-parisons with other platforms are given in Table V. Theparameters of the HMAX model as well as the evaluationimage dataset are different among different platforms. It’shard to compare the recognition accuracy of different imple-mentations. However, we can still compare the efficiency ofdifferent platforms through the unified power consumption perframe. The simulation results show that the power efficiency ofRRAM-based approximated computation framework is higherthan 300 GFLOPS/W. And compared to other platforms likeFPGA, GPU and CPU [40], RRAM-based HMAX achieves aperformance up to 6.214 fps/W, which is 12.8∼270.2× higherthan its digital counterparts.

VII. CONCLUSION

In this work, we propose a power efficient approximatecomputing framework with the emerging RRAM technology.We first introduce an RRAM-based approximate computingframework by integrating our programmable RRAM-ACU. Wealso introduce a complete configuration flow to program theRRAM-based computing hardware efficiently. Finally, the pro-posed RRAM-based computing system is modeled at systemlevel, and a predictive compact model is developed to esti-

mate the configuration overhead and explore the applicationscenarios of RRAM-based analog approximate computing.

Besides exploring the potential of RRAM-based approxi-mate computing, this work still faces many challenges. Forexample, the IR-drop caused by the interconnect resistanceinfluences the RRAM computation quality and severely limitsthe scale of the crossbar system [25]. IR-drop reduction orcompensation techniques are demanded to support application-s, such as the deep learning, which require a large crossbarsize. Besides, many RRAM specific issues, such as the impactof temperature on the resistive switching behavior and I-Vrelationship, should be also considered to enhance the systemreliability in future work [42].

APPENDIX

The probability of successfully tuning an RRAM device tothe target resistance range with N steps can be calculated bythe following expansion:

P (Step = N) =

N−2∑i=1

Pr(i) · Ps(N − i− 1|Initial)

+ Ps(N |Initial) (21)

where Pr(n) represents that the state tuning scheme detectsthat the RRAM device misses the required range at the nthstep and reset it at the n + 1th step. Ps(n|Initial) representsthat the RRAM device is successfully tuned to the requiredrange with n steps without initialization.

According to Ref. [18], the tunneling gap change caused bya voltage pulse follows a Gaussian distribution, whose meandepends on the previous gap (d) of the RRAM devices. Asthe RRAM-ACU mainly takes advantage of the low resistancestate of RRAM devices, d is usually very small (0.2nm ∼0.5nm). We can assume that tunneling gap change causedby each voltage pulse is approximately i.i.d. and followsthe same Gaussian distribution. Because the summation of aseries of independent Gaussian distributions is still a Gaussiandistribution, Ps(n|Initial) can be represented as follows:

Ps(n|Initial) =

∫ D+ε

D−εN(x|nµ, nσ2) dx (22)

where N(x|µ, σ2) is the Gaussian probability density functionthat represents the tunneling gap change caused by one voltagepulse. D is the distance between the target and initial tunnelinggap. ε is the maximum absolute deviation of resistance state.

For the other part, Pr(n) can be calculated recursively as:

Pr(n) =

n−2∑i=1

Pr(i) · Pm(n− i− 1|Initial) + Pm(n) (23)

where Pm(n|Initial) represents the probability that the RRAMdevice misses the required resistance with m steps afterinitialization. Pm(n|Initial) can be expressed as:

Pm(n|Initial) =

∫ +∞

D+ε

N(x|nµ, nσ2) dx (24)

Finally, by combining Eq. (21)-(23), the detailed probabilityof tuning an RRAM device with N steps can be achieved.

Page 12: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

12

REFERENCES

[1] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural ac-celeration for general-purpose approximate programs,” in InternationalSymposium on Microarchitecture(MICRO), 2012, pp. 449–460.

[2] DARPA. Power efficiency revolution for embedded computingtechnologies. [Online]. Available: https://www.fbo.gov/

[3] N. T. K.-S. DATASHEET. (2013) Kepler family product overview.[4] Intel. (2015) Intel microprocessor export compliance metrics.[5] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and

D. Burger, “Dark silicon and the end of multicore scaling,” in ComputerArchitecture (ISCA), 2011 38th Annual International Symposium on.IEEE, 2011, pp. 365–376.

[6] S. Chakradhar and A. Raghunathan, “Best-effort computing: Re-thinkingparallel software and hardware,” in Design Automation Conference(DAC), 2010 47th ACM/IEEE, June 2010, pp. 865–870.

[7] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On reconfiguration-oriented approximate adder design and its application,” in Proceedings ofthe International Conference on Computer-Aided Design. IEEE Press,2013, pp. 48–54.

[8] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, andA. Raghunathan, “Quality programmable vector processors for ap-proximate computing,” in Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture. ACM, 2013, pp. 1–12.

[9] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-powerdigital signal processing using approximate adders,” Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on,vol. 32, no. 1, pp. 124–137, Jan 2013.

[10] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghu-nathan, “Salsa: systematic logic synthesis of approximate circuits,” inProceedings of the 49th Annual Design Automation Conference. ACM,2012, pp. 796–801.

[11] C. Liu, J. Han, and F. Lomardi, “A high-performance approximatemultiplier with configurable partial error recovery,” technical report,University of Alberta, Tech. Rep., 2013.

[12] V. Narayanan, S. Datta, G. Cauwenberghs, D. Chiarulli, S. Levitan, andP. Wong, “Video analytics using beyond cmos devices,” in Proceedingsof the Conference on Design, Automation & Test in Europe, ser. DATE’14, 2014, pp. 344:1–344:5.

[13] B. Li, Y. Shan, M. Hu, Y. Wang, Y. Chen, and H. Yang, “Memristor-based approximated computation,” in Low Power Electronics and Design(ISLPED), 2013 IEEE International Symposium on, Sept 2013, pp. 242–247.

[14] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design implications ofmemristor-based rram cross-point structures,” in Design, Automation &Test in Europe Conference & Exhibition (DATE), 2011. IEEE, 2011,pp. 1–6.

[15] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu,“Nanoscale memristor device as synapse in neuromorphic systems,”Nano letters, vol. 10, no. 4, pp. 1297–1301, 2010.

[16] M. Hu, H. Li, Q. Wu, and G. S. Rose, “Hardware realization of bsbrecall function using memristor crossbar arrays,” in Design AutomationConference, 2012, pp. 498–503.

[17] H. S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee,F. Chen, and M.-J. Tsai, “Metal-oxide rram,” Proceedings of the IEEE,vol. 100, no. 6, pp. 1951–1970, June 2012.

[18] S. Yu, B. Gao, Z. Fang, H. Yu, J. Kang, and H.-S. P. Wong, “A lowenergy oxide-based electronic synaptic device for neuromorphic visualsystems with tolerance to device variation,” Advanced Materials, vol. 25,no. 12, pp. 1774–1779, 2013.

[19] Y. Deng, P. Huang, B. Chen, X. Yang, B. Gao, J. Wang, L. Zeng,G. Du, J. Kang, and X. Liu, “Rram crossbar array with cell selectiondevice: A device and circuit interaction study,” Electron Devices, IEEETransactions on, vol. 60, no. 2, pp. 719–726, Feb 2013.

[20] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precisiontuning of state for memristive devices by adaptable variation-tolerantalgorithm,” Nanotechnology, vol. 23, no. 7, p. 075201, 2012.

[21] X. Guan, S. Yu, and H.-S. Wong, “A spice compact model of metal oxideresistive switching memory with variations,” Electron Device Letters,IEEE, vol. 33, no. 10, pp. 1405–1407, Oct 2012.

[22] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforwardnetworks are universal approximators,” Neural Networks, vol. 2, no. 5,pp. 359–366, 1989.

[23] Y. Ito, “Approximation capability of layered neural networks withsigmoid units on two layers.” Neural Computation, vol. 6, no. 6, pp.1233–1243, 1994.

[24] L. Fausett, Ed., Fundamentals of neural networks: architectures, algo-rithms, and applications. Upper Saddle River, NJ, USA: Prentice-Hall,Inc., 1994.

[25] P. Gu, B. Li, T. Tang, S. Yu, Y. Cao, Y. Wang, and H. Yang,“Technological exploration of rram crossbar array for matrix-vectormultiplication,” in The 20th Asia and South Pacific Design AutomationConference (ASPDAC). IEEE, 2015, pp. 106–111.

[26] S. O. Cannizzaro, A. D. Grasso, R. Mita, G. Palumbo, and S. Pennisi,“Design procedures for three-stage cmos otas with nested-miller com-pensation,” Circuits and Systems I: Regular Papers, IEEE Transactionson, vol. 54, no. 5, pp. 933–940, 2007.

[27] W. Oh and B. Bakkaloglu, “A cmos low-dropout regulator with current-mode feedback buffer amplifier,” Circuits and Systems II: Express Briefs,IEEE Transactions on, vol. 54, no. 10, pp. 922–926, 2007.

[28] P. E. Allen and D. R. Holberg, CMOS analog circuit design. OxfordUniv. Press, 2002.

[29] B. Li, Y. Wang, Y. Chen, H. H. Li, and H. Yang, “Ice: inline calibrationfor memristor crossbar-based computing engine,” in Proceedings of theconference on Design, Automation & Test in Europe. European Designand Automation Association, 2014, p. 184.

[30] G. Khodabandehloo, M. Mirhassani, and M. Ahmadi, “Analog imple-mentation of a novel resistive-type sigmoidal neuron,” Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, vol. 20, no. 4, pp.750 –754, april 2012.

[31] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neuralnetworks architectures,” Neural computation, vol. 7, no. 2, pp. 219–269,1995.

[32] F. Bedeschi, R. Fackenthal, C. Resta, E. Donze, M. Jagasivamani,E. Buda, F. Pellizzer, D. Chow, A. Cabrini, G. Calvi, R. Faravelli,A. Fantini, G. Torelli, D. Mills, R. Gastaldi, and G. Casagrande,“A bipolar-selected phase change memory featuring multi-level cellstorage,” Solid-State Circuits, IEEE Journal of, vol. 44, no. 1, pp. 217–227, Jan 2009.

[33] H. Lee, P. Chen, T. Wu, Y. Chen, C. Wang, P. Tzeng, C. Lin, F. Chen,C. Lien, and M. Tsai, “Low power and high speed bipolar switchingwith a thin reactive ti buffer layer in robust hfo2 based rram,” in IEEEInternational Electron Devices Meeting (IEDM), 2008, pp. 1–4.

[34] S. Kannan, J. Rajendran, R. Karri, and O. Sinanoglu, “Sneak-path testingof crossbar-based nonvolatile random access memories,” Nanotechnolo-gy, IEEE Transactions on, vol. 12, no. 3, pp. 413–426, 2013.

[35] ITRS, “International technology roadmap for semiconductors,” 2013.[36] K. Gulati and H.-S. Lee, “A high-swing cmos telescopic operational

amplifier,” Solid-State Circuits, IEEE Journal of, vol. 33, no. 12, pp.2010–2019, 1998.

[37] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli,M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, “A 3.1 mw 8b 1.2gs/s single-channel asynchronous sar adc with alternate comparators forenhanced speed in 32nm digital soi cmos,” in Solid-State Circuits Con-ference Digest of Technical Papers (ISSCC), 2013 IEEE International.IEEE, 2013, pp. 468–469.

[38] W.-T. Lin and T.-H. Kuo, “A 12b 1.6 gs/s 40mw dac in 40nm cmos with>70db sfdr over entire nyquist bandwidth,” in Solid-State Circuits Con-ference Digest of Technical Papers (ISSCC), 2013 IEEE International.IEEE, 2013, pp. 474–475.

[39] J. Mutch and D. G. Lowe, “Object class recognition and localizationusing sparse features with limited receptive fields,” Int. J. Comput.Vision, vol. 80, no. 1, pp. 45–57, Oct. 2008.

[40] A. A. Maashri, M. Debole, M. Cotter, N. Chandramoorthy, Y. Xiao,V. Narayanan, and C. Chakrabarti, “Accelerating neuromorphic visionalgorithms for recognition,” in Proceedings of the 49th Annual DesignAutomation Conference, ser. Design Automation Conference, 2012, pp.579–584.

[41] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Internationaljournal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.

[42] C. Walczyk, D. Walczyk, T. Schroeder, T. Bertaud, M. Sowinska,M. Lukosius, M. Fraschke, D. Wolansky, B. Tillack, E. Miranda,and C. Wenger, “Impact of temperature on the resistive switchingbehavior of embedded hfo2-based rram devices,” Electron Devices, IEEETransactions on, vol. 58, no. 9, pp. 3124–3131, Sept 2011.

Page 13: RRAM-based Analog Approximate Computing · figurable approximate computing framework with both power efficiency and computation generality. The RRAM device (or the memristor) is

13

Boxun Li received B.S. degree in Electronic Engi-neering from Tsinghua University, China, in 2013.He is currently pursuing his M.S. degree in De-partment of Electronic Engineering, Tsinghua U-niversity. His research mainly focuses on energyefficient hardware computing system design, andparallel computing based on GPU.

Peng Gu is currently pursuing the B.S. degree fromthe Department of Electronic Engineering, TsinghuaUniversity, Beijing, China. Peng’s research interestsinclude low power system design, hardware accel-eration and computing with emerging devices. Hehas authored and co-authored several papers in DAC,ASPDAC, GLSVLSI, and etc.

Yi Shan is now a senior R&D engineer in IDLof Baidu Inc. Before joining Baidu, he received hisB.S. degree in Tsinghua University, China in 2008,and then Ph.D. degree in NICS Group, Electronic-s Engineering Department, Tsinghua University in2014. Dr. Shan’s work mainly focuses on heteroge-neous parallel/distributed computing based on GPUcluster for deep learning applications, and hardwarecomputing on FPGA for other applications, suchas stereo vision, search engine, and brain networkanalysis.

Yu Wang (S’05-M’07-SM’14) received his B.S.degree in 2002 and Ph.D. degree (with honor) in2007 from Tsinghua University, Beijing, China.

He is currently an Associate Professor with theDepartment of Electronic Engineering, Tsinghua U-niversity. His research interests include parallel cir-cuit analysis, application specific hardware comput-ing (especially on the Brain related problems), andpower/reliability aware system design methodology.

Dr. Wang has authored and coauthored over 130papers in refereed journals and conferences. He is

the recipient of IBM X10 Faculty Award in 2010, Best Paper Award in ISVLSI2012, Best Poster Award in HEART 2012, and 6 Best Paper Nomination inASPDAC/CODES/ISLPED. He serves as the Associate Editor for IEEE Transon CAD, Journal of Circuits, Systems, and Computers. He is the TPC Co-Chair of ICFPT 2011, Finance Chair of ISLPED 2012-2015, and serves asTPC member in many important conferences (DAC, FPGA, DATE, ASPDAC,ISLPED, ISQED, ICFPT, ISVLSI, etc).

Yiran Chen (M’05) received the B.S. (Hons.) andM.S. (Hons.) degrees from the Tsinghua University,Beijing, China, and the Ph.D. degree from the Pur-due University, West Lafayette, IN, USA, in 2005.After five years in industry, he joined the Universityof Pittsburgh, Pittsburgh, PA, USA, in 2010, as anAssistant Professor and was promoted to AssociateProfessor with the ECE Department, in 2014. He haspublished one book, several book chapters, and over200 technical publications. He has been granted 86U.S. and international patents with other 15 pending

applications.Dr. Chen was the recipient of three Best Paper Awards from ISQED’08,

ISLPED’10, and GLSVLSI’13 and other several nominations from DAC,DATE, and ASPDAC. He was also the recipient of the NSF CAREER Awardin 2013, ACM SIGDA Outstanding Young Faculty Award in 2014, and wasthe invitee of 2013 U.S. Frontiers of Engineering Symposium of NationalAcademy of Engineering. He is an Associate Editor of the IEEE transactionson Computer-Aided Design of Integrated Circuits and Systems, the ACMJournal on Emerging Technologies in Computing Systems, ACM SpecialInterest Group on Design Automation, E-news, and has served on the technicaland organization committees of about 40 conferences.

Huazhong Yang (M’97-SM’00) was born in Ziyang,Sichuan Province, China, on August 18, 1967. Hereceived the B.S. degree in microelectronics and theM.S. and Ph.D. degrees in electronic engineeringfrom Tsinghua University, Beijing, China, in 1989,1993, and 1998, respectively.

In 1993, he joined the Department of Electron-ic Engineering, Tsinghua University, where he iscurrently a Specially Appointed Professor of theCheung Kong Scholars Program. He has authoredand co-authored over 300 technical papers and holds

70 granted patents. His current research interests include wireless sensornetworks, data converters, parallel circuit simulation algorithms, nonvolatileprocessors, and energy-harvesting circuits.